Skip to content

Latest commit

 

History

History
78 lines (61 loc) · 3.11 KB

README.md

File metadata and controls

78 lines (61 loc) · 3.11 KB

Large-scale data analysis in R

The R computing environment has become an important tool for quantitative research, from computational biology to financial modeling. In this hands-on workshop, we will explore commonly used strategies to efficiently analyze large-scale data sets in R. Participants will learn to automate their R analyses on a compute cluster, profile memory usage, call fast C++ routines in R, and implement simple parallelization strategies, including multithreaded and distributed computing. The aim is to learn these techniques through hands-on "live coding"; we will analyze several medium to large-scale data sets. Objectives: Attendees will (1) learn how to automate R analyses on a compute cluster; (2) use simple techniques to profile memory usage in R; (3) learn how to make more effective use of memory in R; (4) use multithreading to speed up R computations; (5) learn how to call C++ code from R using Rcpp; (6) write scripts to distribute "embarrassingly parallel" R computations using the Slurm job scheduler on the RCC Midway compute cluster; (7) learn through "live coding."

Prerequistes

All participants are expected to bring a laptop with a Mac, Linux or Windows operating system. Further, participants should be comfortable interacting with the UNIX shell and programming in a non-graphical R environment (not RStudio). An RCC user account is recommended, but not required.

What's included

This git repository (the "workshop packet") includes:

  • README.md: This file.

  • conduct.md: Code of Conduct.

  • LICENSE.md: License information for the materials in this repository.

  • slides.pdf: The slides for the workshop.

  • slides.Rmd: R Markdown source used to generate these slides.

  • Makefile: GNU Makefile containing commands to generate the slides from the R Markdown source.

Other information

Credits

These materials were developed by Peter Carbonetto at the University of Chicago. Thank you to Matthew Stephens for his support and guidance. Also thank you to Gao Wang for sharing the Python script for profiling memory usage.