The R computing environment has become an important tool for quantitative research, from computational biology to financial modeling. In this hands-on workshop, we will explore commonly used strategies to efficiently analyze large-scale data sets in R. Participants will learn to automate their R analyses on a compute cluster, profile memory usage, call fast C++ routines in R, and implement simple parallelization strategies, including multithreaded and distributed computing. The aim is to learn these techniques through hands-on "live coding"; we will analyze several medium to large-scale data sets. Objectives: Attendees will (1) learn how to automate R analyses on a compute cluster; (2) use simple techniques to profile memory usage in R; (3) learn how to make more effective use of memory in R; (4) use multithreading to speed up R computations; (5) learn how to call C++ code from R using Rcpp; (6) write scripts to distribute "embarrassingly parallel" R computations using the Slurm job scheduler on the RCC Midway compute cluster; (7) learn through "live coding."
All participants are expected to bring a laptop with a Mac, Linux or Windows operating system. Further, participants should be comfortable interacting with the UNIX shell and programming in a non-graphical R environment (not RStudio). An RCC user account is recommended, but not required.
This git repository (the "workshop packet") includes:
-
README.md: This file.
-
conduct.md: Code of Conduct.
-
LICENSE.md: License information for the materials in this repository.
-
slides.pdf: The slides for the workshop.
-
slides.Rmd: R Markdown source used to generate these slides.
-
Makefile: GNU Makefile containing commands to generate the slides from the R Markdown source.
-
This workshop attempts to apply elements of the Software Carpentry approach. See also this article. Please also take a look at the Code of Conduct, and the license information.
-
To generate PDFs of the slides from the R Markdown source, run
make slides
in the root directory of the git repository. For this to work, you will need to to install the rmarkdown package in R, as well as the packages used in slides.Rmd. For more details, see the Makefile.
These materials were developed by Peter Carbonetto at the University of Chicago. Thank you to Matthew Stephens for his support and guidance. Also thank you to Gao Wang for sharing the Python script for profiling memory usage.