R learning and data analysis resources. Please, contribute and get in touch! See MDmisc notes for other programming and genomics-related notes.
- General
- Cheatsheets
- Courses
- R Package Development
- R Data Analysis
- R Misc
- R tips & tricks
- R questions
-
rstudio.cloud - a web-based instance of R/RStudio. RStudio cloud cheatsheet and notes, from Twitter source. RStudio Cloud for education - Mel Gregory - 24 min video with all the essense of using RStudio cloud for education
-
RStudio Cheatsheets - PDFs, PowerPoint, Keynote formats. GitHub
-
RStudio Webinars - Code and slides for RStudio webinars
-
stat133-cheatsheets - Cheat sheets on R, RStudio, tidyverse, ggplot2, plotly, shiny, git, and moreucb-stat133/stat133-cheatsheets)
-
ardeeshany/Parallel_Computing - A CheatSheet for Parallel Computation in R
-
awesome-quarto - A curated list of Quarto talks, tools, examples & articles
-
awesome-R - A curated list of awesome R packages, frameworks and software
-
awesome-rshiny - a list of resources to learn R Shiny
-
Awesome ggplot2 - A curated list of awesome ggplot2 tutorials, packages etc.
-
awesome-r-dataviz - Curated resources for Data Visualization, Drawing & Publishing in R
-
awesome-r-pkgtools - A curated list of awesome resources for R package development
-
introverse - alternate documentation for commonly-used functions and concepts in Base R and in the tidyverse. Tweet
-
awesome-r-pkgtools - A curated list of awesome resources for R package development
-
swirl - R package for interactive R leaning
-
learnR4free - all sorts of resources (books, videos, interactive websites, papers) to learn R. Tweet
-
Data Science: A First Introduction - bookdown by Tiffany-Anne Timbers, Trevor Campbell, Melissa Lee. From basics of data wrangling to classification/regression/clustering/statistical inference. Tweet
-
Big Book of R, by Oscar Baruffa. A bookdown bookmarking links to everything R for data science, from basics, visualization, tidyverse to distributed computing, spatial data science, machine learning, and more. GitHub
-
A Complete Tutorial to learn Data Science in R from Scratch - Basics of R, with short and conscise examples
-
teach-r - List of Resources for Teaching R
-
The learnr tutorials in RStudio Cloud's primers, avaliable on RStudio Cloud
-
FasteR - Fast Lane to Learning R, learn R language in R console by following examples, by Norm Matloff
-
Hands-On Programming with R - introductory book how to program in R, with hands-on examples, by Garrett Grolemund
-
R for Data Science - a Tidyverse-oriented introduction to R online book, by Garrett Grolemund and Hadley Wickham
-
Advanced R and Advanced R Solutions - Advanced R programming, book and solutions, by Hadley Wickham and others
-
R Cookbook, 2nd Edition - statistics-oriented R introduction, by James Long and Paul Teetor
-
Efficient R programming - advanced concepts for efficient (R) programming, by Colin Gillespie & Robin Lovelace
-
SDS 375 Data Visualization in R by Claus Wilke. GitHub: wilkelab/SDS375. Tweet
-
RMarkdown from RStudio - illustrated guide to RMarkdown
-
Rmarkdown for Scientists book by Nicholas Tierney. GitHub
-
Mastering Shiny book by Hadley Wickham. GitHub
-
Yet another ‘R for Data Science’ study guide, by Bryan Shalloway. Tidyverse-oriented introduction to R
-
One Page R book "Data Science Quick Start: Knowledge Discovery Through R" by Togaware, includes pdf slides and R code templates for various machine learning tasks
-
Data Analysis and Prediction Algorithms with R book by Rafael Irizarry. Data science, statistics topics in R. GitHub. Old version, Labs, and Videos
-
R & Bioconductor Manual - One of the best R manual to brush up all major steps in data analysis/visualization. And links to other resources there. By Thomas Girke, UC Riverside
-
RProgrammingForResearch - course notes for R Programming for Research. R learning from ground up to tidyverse, with lectures, homeworks, data, source files
-
biostat561 - Computational Skills for Biostatistics, from version control, R programming, ggplot, shiny, to Unix, LaTeX, Markdown, Python. By Amy Willis
-
THRIV datasci 2018 - THRIV Data Science course by Stephen Turner. Comprehensive coverage from R/RStudio introduction, RMarkdown, dplyr, ggplot2, shiny to all practical and statistical aspects of data cleaning, visualization, predictive modeling, survival analysis. Workshops
-
Introduction to Data Science by Rafael Irizarry and Stephanie Hicks. The GitHub repo https://github.com/datasciencelabs/2020 has the latest course material, previous material for the course is available by changing the year number, e.g. https://github.com/datasciencelabs/2019. Data for the course
-
R Programming for Data Science by Roger Peng. Fundamentals of R programming and data science. GitHub
-
Mastering Software Development in R, book by Roger D. Peng, Sean Kross, and Brooke Anderson. From R basics to package development
-
Statistical Computing - Biostatistics 140.776 course by Roger Peng. Youtube Playlists
-
DataScienceSpecialization - Course materials for the Data Science Specialization from JHU folks. Detailed lectures on each topic
-
ds4stats - Data Science for Statisticians Workshop. Tidyverse, Data wrangling, visualization, ggplot2, machine learning. Rmds, lab exercises, links to ready-to-view lectures.
-
master-the-tidyverse - the Master the Tidyverse Workshop,
tidyverse
-oriented tutorials. Instructor's material -
Statistical Inference via Data Science, A ModernDive into R and the Tidyverse - bookdown, statistics and data science using tidyverse. GitHub
-
Statistical Thinking for the 21st Century - bookdown, statistics theory illustrated with R. GitHub material
-
R (BGU course) by Jonathan D. Rosenblatt. From R basics to regression, machine learning, graphics, shiny, advanced computing. GitHub
-
online-courses - Free courses by RSquareAcademy. From data import to tidyverse, web scraping, regex, databases. Videos, slides, code, data.
-
SISBID - several modules covering various aspects of data science. Summer Institute in Statistics for Big Data. Lectures, exercises, data. Module 1, Big Data, Module 2, Visualization of Biomedical Big Data, Module 3, Reproducible research, Module 4, Module 5
-
Ted Laderas and his Ready for R course, the accompanying Ready for R: Notebook Reference bookdown, and the R Bootcamp tidyverse exercises
-
R Programming For Research course by Brooke Anderson, Rachel Severson, and Nicholas Good, Colorado State University. Basic, intermediate, and advanced data analysis in R. GitHub, Youtube playlist with short videos on R-related topics
-
R for the Rest of Us free R courses by David Keyes.
-
Text Mining with R, by Julia Silge and David Robinson
-
Fundamentals of Data Visualization, by Claus O. Wilke
-
Effective graphs with Microsoft R Open - plots using base R graphics. Blog post about the book, GitHub
-
idem_viz - Materials for the course IDEM 181 "Visualizing Data"
-
Network visualization with R, workshop by Katherine Ognyanova. Blog post
-
bigdataclass - Big Data with R class by RStudio
-
R Consortium playlists - useR! conferences and more
- Package Building - How
DESCRIPTION
,NAMESPACE
,roxygen
, anddevtools::document
work together. By Ted Laderas - Automate testing of your R package using Travis CI, Codecov, and testthat, by Jean Fan
- MangoTheCat/goodpractice - An R package to check for good coding practices when building R packages. Syntax to avoid, package structure, code complexity, code formatting, etc.
- pkgdown - an R package for making a website for your package
- Hexmake - This app allows the user to build its own hex stickers. RStudio note
- GuangchuangYu/hexSticker - Hexagon sticker in R, R package
- neurodata/hyppo - an example of well-documented software, informative README, badges
-
Bioc package list - Summary and analysis of popular Bioconductor packages, by Charlotte Soneson. R/tidyverse code, GitHub
-
S4-Bioconductor - Online resources for learning programming with S4 in R in general, and the particular implementation in Bioconductor (BiocGenerics, S4Vectors, IRanges, GRanges, SummarizedExperiment, Biostrings, etc.).
-
code.bioconductor.org - Browse the contents and git history of all @Bioconductor software packages, Search for across all software packages at once, and filter results by file names, types, or packages #rstats. Tweet
-
Develop Bioconductor packages with Docker container
- Bioconductor/bioconductor_full - Docker Images which include a complete installation of all software needed to build all Bioconductor packages
-
BiocPkgTools - R package for queueing Bioconductor package statistics, downloads, dependencies, visualization as graphs.
- Su, Shian, Vincent J. Carey, Lori Shepherd, Matthew Ritchie, Martin T. Morgan, and Sean Davis. “BiocPkgTools: Toolkit for Mining the Bioconductor Package Ecosystem.” F1000Research 8 (May 29, 2019)
-
seandavi/BuildABiocWorkshop2020 - template for building a bioconductor workshop package using GitHub actions
-
How to run a Bioconductor Workshop on a Google Cloud Instance by Sean Davis
-
Orchestra - workshop platform provider for running docker containers on kubernetes, by Sean Davis
-
autoEDA-resources - A list of software and papers related to automatic/fast Exploratory Data Analysis
-
cmapR - parse and maniplate data in various formats used by the Connectivity Map. Manipulating annotated matrices stored as GCT or GCTX formats.
-
missMDA - Imputation of incomplete continuous or categorical datasets; Missing values are imputed with a principal component analysis (PCA), a multiple correspondence analysis (MCA) model or a multiple factor analysis (MFA) model; Perform multiple imputation with and in PCA or MCA. By Francois Husson
- Josse, Julie, and François Husson. “MissMDA : A Package for Handling Missing Values in Multivariate Data Analysis.” Journal of Statistical Software 70, no. 1 (2016).
-
imputeTS - Imputation (replacement) of missing values in univariate time series. Offers several imputation functions and missing data plots. Available imputation algorithms include: 'Mean', 'LOCF', 'Interpolation', 'Moving Average', 'Seasonal Decomposition', 'Kalman Smoothing on Structural Time Series models', 'Kalman Smoothing on ARIMA models'
- Moritz, Steffen, and Thomas Bartz-Beielstein. “ImputeTS: Time Series Missing Value Imputation in R.” The R Journal, 2017
-
softImpute - matrix completion (imputation). A combination of two algorithms, nuclear-norm-regularized matrix approximation and maximum-margin matrix factorization. R package, distributed version using Spark cluster. By Trevor Hastie
- Hastie, Trevor, Rahul Mazumder, Jason Lee, and Reza Zadeh. “Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.” ArXiv:1410.2596 [Stat], October 9, 2014.
-
missForest - Nonparametric Missing Value Imputation using Random Forest. Handles continuous and categorical data. Training RF on the complete data, then predicting the missing values iteratively.
- Stekhoven, D. J., and P. Buhlmann. “MissForest--Non-Parametric Missing Value Imputation for Mixed-Type Data.” Bioinformatics 28, no. 1 (January 1, 2012)
-
mice - Multivariate Imputation using Fully Conditional Specification (FCS). By Stef van Buuren
- Buuren, Stef van, and Karin Groothuis-Oudshoorn. “Mice : Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45, no. 3 (2011).
-
FriendsDontLetFriends - Visualization dos and dont's. Friends don't let friends make certain types of data visualization - What are they and why are they bad.
-
basegraphics - Pretty plots using pure base graphics in R
-
circlize - Circular Visualization in R. Documentation
-
cols4all - interactive selection of color palettes (
c4a_gui()
). categorical (qualitative) palettes, sequential palettes, diverging palettes, and bivariate palettes (divided into three subtypes). Considering color blindness. -
corrplot - A visual exploratory tool on correlation matrix. R package, CRAN, Documentation
-
DiagrammeR - Graph and network visualization using tabular data in R.
-
ggalluvial - a ggplot setension for alluvial plots
-
ggbreak - set axis breaks for ‘ggplot2’
-
ggfortify - Enhanced plotting for commonly usedstatistics, such as GLM, time series, PCA families, clustering and survival analysis
-
ggtern - ternary diagrams in R
-
ggord - PCA and other dim reduction methods plotting with ellipses
-
ggplot_tricks - ggplot2 tricks, e.g., text contrast on a heatmap, color/fill tricks.
-
ggpointdensity - A Cross Between a Scatter Plot and a 2D Density Plot.
-
ggridges - Ridgeline plots in ggplot2. Introduction to ggridges
-
ggstats - plot regression model coefficients (“forest plots”) using ggplot2, compare models, proportion, cross-tabuation plots. Examples on the website
-
ggstatsplot - Enhancing
{ggplot2}
plots with statistical analysis. Examples on the website. -
ggstream - A package to make streamplots
-
ggsci - Scientific Journal and Sci-Fi Themed Color Palettes for 'ggplot2'
-
MetBrewer - Color palette package in R inspired by works at the Metropolitan Museum of Art in New York
-
paletteer - Collection of most color palettes in a single R package
-
pcaExplorer - Interactive Visualization of RNA-seq Data Using a Principal Components Approach. Tweet
-
PCAtools - set of tools performing common PCA-related tasks, by Kevin Blighe and Aaron Lun. GitHub
-
RainCloudPlots - Code and tutorials to visualise your data in a way that is both beautiful and statistically valid. R, Python, Matlab examples.
Paper
Allen M, Poggiali D, Whitaker K et al. Raincloud plots: a multi-platform tool for robust data visualization [version 2; peer review: 2 approved]. Wellcome Open Res 2021, 4:63. DOI: 10.12688/wellcomeopenres.15191.2
-
scattermore - very fast scatterplots for R. CRAN
-
visNetwork - an R package for network visualization, using vis.js javascript library.
-
UpSetR - stretched and aligned Venn and Euler diagrams. Slides by Nils Gehlenborg
- Conway, Jake R, Alexander Lex, and Nils Gehlenborg. “UpSetR: An R Package for the Visualization of Intersecting Sets and Their Properties.” Bioinformatics 33, no. 18 (September 15, 2017)
-
How to Create a Bar Chart Race in R - Mapping United States City Population 1790-2010.
- ChromoMap - an R package and a Shiny app for multi-omics data visualization over the genome/chromosome plots. Input - BED files, with annotations (point annotations and segment annotations). D3 JavaScript implementation allows for interactivity. Single function, customizable. ChromLinks - link regions across chromosomes, undirected and directed, Sankey-like. Tweet, Website.
Paper
Anand, Lakshay, and Carlos M. Rodriguez Lopez. “ChromoMap: An R Package for Interactive Visualization of Multi-Omics Data and Annotation of Chromosomes.” BMC Bioinformatics 23, no. 1 (December 2022): 33. https://doi.org/10.1186/s12859-021-04556-z.
-
ggcoverage - Visualize and annotate genomic coverage with ggplot2
-
ngsplot - Quick mining and visualization of NGS data by integrating genomic databases. Average profiles and heatmaps of ChIP-seq-like signals.
-
RIdeogram - an R package for whole-genome overlay of genomic data on an ideogram. Plot continuous and discrete data as heatmaps and track labels. GC content, gene and repeat density, DNA methylation distribution, genomic synteny. Similar functionality: ggbio, IdeoViz, chromPlot, chromDraw, karyoploteR, chromoMap, JavaScript Ideogram.js and karyotypeSVG. Distinctive feature - visualizing changes between two or more genomes using Bezier curves (synteny, genomic rearrangements).
- Hao, Zhaodong, Dekang Lv, Ying Ge, Jisen Shi, Dolf Weijers, Guangchuang Yu, and Jinhui Chen. “RIdeogram: Drawing SVG Graphics to Visualize and Map Genome-Wide Data on the Idiograms,” 2020, 11.
-
trackplot - Generate IGV style locus tracks from bigWig files in R
-
volcano3D - An R package to plot interactive three-way differential expression analysis. A polar coordinate space to plot Z-scores or fold changes of genes amond three groups, z-axis is -log10 p-value. CRAN, website, Olnile demo.
Paper
Lewis, Myles J., Michael R. Barnes, Kevin Blighe, Katriona Goldmann, Sharmila Rana, Jason A. Hackney, Nandhini Ramamoorthi, et al. “Molecular Portraits of Early Rheumatoid Arthritis Identify Clinical and Treatment Response Phenotypes.” Cell Reports 28, no. 9 (August 2019): 2455-2470.e5. https://doi.org/10.1016/j.celrep.2019.07.091.
- valr - R package for bedtools-like genome interval analysis, uses tidyverse approach. Read in BED bedGraph, VCF formats as tibble data_frame objects. Table 1 - overview of major functions.
Paper
Riemondy, Kent A., Ryan M. Sheridan, Austin Gillen, Yinni Yu, Christopher G. Bennett, and Jay R. Hesselberth. “Valr: Reproducible Genome Interval Analysis in R.” F1000Research 6 (June 29, 2017): 1025. https://doi.org/10.12688/f1000research.11997.1.
-
ClusterEnG - online clustering and interactive visualization (D3). Seven algorithms: k-means, k-medoids, affinity propagation, spectral clustering, Gaussian mixture model, hierarchical clustering and DBSCAN. Clustering validation metrics. Tutorials using NCI60 data and B-cell lymphoma gene expression. R backend. GitHub.
- Manjunath, Mohith, Yi Zhang, Steve H. Yeo, Omar Sobh, Nathan Russell, Christian Followell, Colleen Bushell, Umberto Ravaioli, and Jun S. Song. “ClusterEnG: An Interactive Educational Web Resource for Clustering and Visualizing High-Dimensional Data.” PeerJ. Computer Science 4 (2018).
-
bigcor - Creating Very Large Correlation/Covariance Matrices. Tweet
-
GLM-PCA - multinomial distribution-based analysis methods for UMI counts in scRNA-seq data. Details of scRNA-seq data properties, analysis steps. GitHub
- Townes, F. William, Stephanie C. Hicks, Martin J. Aryee, and Rafael A. Irizarry. “Feature Selection and Dimension Reduction for Single Cell RNA-Seq Based on a Multinomial Model.” BioRxiv, March 11, 2019.
-
PCA course using FactoMineR, with videos, by François Husson, the creator of FactoMineR - Multivariate Exploratory Data Analysis and Data Mining
-
pmcbioc - R package fow working with pubmed citations
-
Rcpp for everyone - Rcpp bookdown. GitHub
-
R-parallel - Using R with many CPUs. Wiki with overview, examples, snippets, links, and more
-
containerit - an R package to create a Docker file from R session. Similar functionality provided by
dockerfiler
,liftr
,automagic
.- Nüst, Daniel, and Matthias Hinz. “Containerit: Generating Dockerfiles for Reproducible Research with R.” Journal of Open Source Software 4, no. 40 (August 21, 2019)
-
disk.frame - an R package for larger-than-RAM Disk-Based Data Manipulation Framework. blog post about disk.frame usage
-
ppcor - Partial and Semi-Partial (Part) Correlation
-
corpcor - Efficient Estimation of Covariance and (Partial) Correlation
-
ymlthis - write YAML for R Markdown, bookdown, blogdown, and more. The YAML Fieldguide
-
googledrive - an R package interface to Google Drive. Alternative: Google Drive direct download of big files, gdown.pl
-
philentropy - Similarity and Distance Quantification Between Probability Functions. Computes 46 optimized distance and similarity measures for comparing probability functions
-
gt - Easily generate information-rich, publication-quality tables from R
-
gtsummary Presentation-Ready Data Summary and Analytic Result Tables
-
r-base-shortcuts - Base R shortcuts: A collection of lesser-known but powerful idioms and coding patterns for writing concise and fast R code
-
Install R as a user. Slack
$ curl -O https://cran.r-project.org/src/base/R-4/R-4.1.1.tar.gz
$ tar xzf R-4.1.1.tar.gz
$ cd R-4.1.1
$ mkdir -p "$HOME/software/R-4.1.1"
$ ./configure --prefix="$HOME/software/R-4.1.1"
$ make
$ make install
# Then add:
export PATH=$HOME/software/R-4.1.1/bin:$PATH
# and you've got R / Rscript (=R 4.1.1) ready to go.
- Convert continuous to categorical value
data(iris)
vari <- iris$Sepal.Length
nb.clusters <- 3
breaks <- quantile(vari, seq(0,1,1/nb.clusters))
Xqual <- cut(vari,breaks, include.lowest=TRUE)
summary(Xqual)
- Barplot with StdErr using standard R graphics
dat = agridat::lasrosas.corn
means.nf = tapply(dat$yield, INDEX=dat$nf, FUN=mean)
StdErr.nf = tapply(dat$yield, INDEX=dat$nf, FUN= std.error)
BP = barplot(means.nf, ylim=c(0,max(means.nf)+10))
segments(BP, means.nf - (2*StdErr.nf), BP, means.nf + (2*StdErr.nf), lwd = 1.5)
arrows(BP, means.nf - (2*StdErr.nf), BP, means.nf + (2*StdErr.nf), lwd = 1.5, angle = 90, code = 3, length = 0.05)
- HDF5Array
library(HDF5Array)
a0 <- array(runif(15000000), dim=c(10000, 300, 5))
A0 <- as(a0, "HDF5Array")
library(pryr)
object_size(A0)
#> 1.94 kB
object_size(a0)
#> 120 MB #rstats
- Convert a vector to normally distributed one
interactions <- log2(interactions) # If highly right-skewed, log2-transform beforehand
# Inverse normal conversion
interactions <- sapply(interactions, function(x) {
rank <- rank(x, na.last = "keep")
P <- (rank - 0.5) / length(x[ !is.na(x)] )
x <- qnorm(P)
})
- Why vcd package is used? vcd package provides different methods for visualizing multivariate categorical data.
- What is iPlots? It is a package which provide bar plots, mosaic plots, box plots, parallel plots, scatter plots and histograms.
- What is fitdistr() function? It is used to provide the maximum likelihood fitting of univariate distributions. It is defined under the MASS package.
- Define loglm() function. Loglm() function is used to create log-linear models.
- How to create scatterplot matrices? Pair() or splom() function is used for create scatterplot matrices.
- Define leaps(). It is used to perform the all-subsets regression and it is defined under the leaps package.
- Define cluster.stats(). It is define in fpc package which provide a method for comparing the similarity of two clusters solution using different validation criteria.