A Case Study of R Performance Analysis and Optimization
TimeTuesday, July 244:15pm - 4:30pm
DescriptionAlthough R has become an analytic platform for many scientific domains, high performance has rarely been a trait of R. The inefficiency can come from the R programming specification itself or the interpreter environment implementation. Profiling and optimizing useful R code can not only directly benefit domain science researchers but also increase the efficiency of R code to run on high performance computing resources. We use envirotyping analysis as an example. This analysis considers both genetic information and environment conditions to understand how these factors affect crop yields through multi-dimensional data collected from fields and simulations. The analysis has the potential to improve breeding schemes for better global crop yield. A central tool used to support this analysis is an R package, “PReMiuM: Dirichlet Process Bayesian Clustering, Profile Regression”, whose computational complexity increases as numbers of observations and features grow. The package is a useful tool for Bayesian clustering and inference with broad application potentials if computational bottlenecks can be overcome. In this paper, we detail our experiences on detecting the bottlenecks and optimizing its performance. We present a general workflow for investigating general performance issues such as execution time and memory usage to understand R program behavior and thus helping the optimization of the code. The workflow can be applied to other R applications. With the approach presented here, R users can easily identify inefficient code block, search for potential optimization solutions, and efficiently utilize high performance computing resources for scientific research.