Skip to content

mdozmorov/R_notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 

Repository files navigation

R related notes

License: MIT PR's Welcome

R learning and data analysis resources. Please, contribute and get in touch! See MDmisc notes for other programming and genomics-related notes.

Table of content

General

Cheatsheets

Courses

Conference material

R Package Development

Bioconductor

R Data Analysis

  • autoEDA-resources - A list of software and papers related to automatic/fast Exploratory Data Analysis

  • cmapR - parse and maniplate data in various formats used by the Connectivity Map. Manipulating annotated matrices stored as GCT or GCTX formats.

Imputation

  • missMDA - Imputation of incomplete continuous or categorical datasets; Missing values are imputed with a principal component analysis (PCA), a multiple correspondence analysis (MCA) model or a multiple factor analysis (MFA) model; Perform multiple imputation with and in PCA or MCA. By Francois Husson

  • imputeTS - Imputation (replacement) of missing values in univariate time series. Offers several imputation functions and missing data plots. Available imputation algorithms include: 'Mean', 'LOCF', 'Interpolation', 'Moving Average', 'Seasonal Decomposition', 'Kalman Smoothing on Structural Time Series models', 'Kalman Smoothing on ARIMA models'

  • softImpute - matrix completion (imputation). A combination of two algorithms, nuclear-norm-regularized matrix approximation and maximum-margin matrix factorization. R package, distributed version using Spark cluster. By Trevor Hastie

  • missForest - Nonparametric Missing Value Imputation using Random Forest. Handles continuous and categorical data. Training RF on the complete data, then predicting the missing values iteratively.

  • mice - Multivariate Imputation using Fully Conditional Specification (FCS). By Stef van Buuren

Visualization

  • FriendsDontLetFriends - Visualization dos and dont's. Friends don't let friends make certain types of data visualization - What are they and why are they bad.

  • basegraphics - Pretty plots using pure base graphics in R

  • circlize - Circular Visualization in R. Documentation

  • cols4all - interactive selection of color palettes (c4a_gui()). categorical (qualitative) palettes, sequential palettes, diverging palettes, and bivariate palettes (divided into three subtypes). Considering color blindness.

  • corrplot - A visual exploratory tool on correlation matrix. R package, CRAN, Documentation

  • DiagrammeR - Graph and network visualization using tabular data in R.

  • ggalluvial - a ggplot setension for alluvial plots

  • ggbreak - set axis breaks for ‘ggplot2’

  • ggfortify - Enhanced plotting for commonly usedstatistics, such as GLM, time series, PCA families, clustering and survival analysis

  • ggtern - ternary diagrams in R

  • ggord - PCA and other dim reduction methods plotting with ellipses

  • ggplot_tricks - ggplot2 tricks, e.g., text contrast on a heatmap, color/fill tricks.

  • ggpointdensity - A Cross Between a Scatter Plot and a 2D Density Plot.

  • ggridges - Ridgeline plots in ggplot2. Introduction to ggridges

  • ggstats - plot regression model coefficients (“forest plots”) using ggplot2, compare models, proportion, cross-tabuation plots. Examples on the website

  • ggstatsplot - Enhancing {ggplot2} plots with statistical analysis. Examples on the website.

  • ggstream - A package to make streamplots

  • ggsci - Scientific Journal and Sci-Fi Themed Color Palettes for 'ggplot2'

  • MetBrewer - Color palette package in R inspired by works at the Metropolitan Museum of Art in New York

  • paletteer - Collection of most color palettes in a single R package

  • pcaExplorer - Interactive Visualization of RNA-seq Data Using a Principal Components Approach. Tweet

  • PCAtools - set of tools performing common PCA-related tasks, by Kevin Blighe and Aaron Lun. GitHub

  • RainCloudPlots - Code and tutorials to visualise your data in a way that is both beautiful and statistically valid. R, Python, Matlab examples.

    Paper Allen M, Poggiali D, Whitaker K et al. Raincloud plots: a multi-platform tool for robust data visualization [version 2; peer review: 2 approved]. Wellcome Open Res 2021, 4:63. DOI: 10.12688/wellcomeopenres.15191.2

Genomics

  • ChromoMap - an R package and a Shiny app for multi-omics data visualization over the genome/chromosome plots. Input - BED files, with annotations (point annotations and segment annotations). D3 JavaScript implementation allows for interactivity. Single function, customizable. ChromLinks - link regions across chromosomes, undirected and directed, Sankey-like. Tweet, Website.
    Paper Anand, Lakshay, and Carlos M. Rodriguez Lopez. “ChromoMap: An R Package for Interactive Visualization of Multi-Omics Data and Annotation of Chromosomes.” BMC Bioinformatics 23, no. 1 (December 2022): 33. https://doi.org/10.1186/s12859-021-04556-z.
  • ggcoverage - Visualize and annotate genomic coverage with ggplot2

  • ngsplot - Quick mining and visualization of NGS data by integrating genomic databases. Average profiles and heatmaps of ChIP-seq-like signals.

  • RIdeogram - an R package for whole-genome overlay of genomic data on an ideogram. Plot continuous and discrete data as heatmaps and track labels. GC content, gene and repeat density, DNA methylation distribution, genomic synteny. Similar functionality: ggbio, IdeoViz, chromPlot, chromDraw, karyoploteR, chromoMap, JavaScript Ideogram.js and karyotypeSVG. Distinctive feature - visualizing changes between two or more genomes using Bezier curves (synteny, genomic rearrangements).

  • trackplot - Generate IGV style locus tracks from bigWig files in R

  • volcano3D - An R package to plot interactive three-way differential expression analysis. A polar coordinate space to plot Z-scores or fold changes of genes amond three groups, z-axis is -log10 p-value. CRAN, website, Olnile demo.

    Paper Lewis, Myles J., Michael R. Barnes, Kevin Blighe, Katriona Goldmann, Sharmila Rana, Jason A. Hackney, Nandhini Ramamoorthi, et al. “Molecular Portraits of Early Rheumatoid Arthritis Identify Clinical and Treatment Response Phenotypes.” Cell Reports 28, no. 9 (August 2019): 2455-2470.e5. https://doi.org/10.1016/j.celrep.2019.07.091.
  • valr - R package for bedtools-like genome interval analysis, uses tidyverse approach. Read in BED bedGraph, VCF formats as tibble data_frame objects. Table 1 - overview of major functions.
    Paper Riemondy, Kent A., Ryan M. Sheridan, Austin Gillen, Yinni Yu, Christopher G. Bennett, and Jay R. Hesselberth. “Valr: Reproducible Genome Interval Analysis in R.” F1000Research 6 (June 29, 2017): 1025. https://doi.org/10.12688/f1000research.11997.1.

Clustering, Dimensionality reduction

R Misc

  • pmcbioc - R package fow working with pubmed citations

  • Rcpp for everyone - Rcpp bookdown. GitHub

  • R-parallel - Using R with many CPUs. Wiki with overview, examples, snippets, links, and more

  • containerit - an R package to create a Docker file from R session. Similar functionality provided by dockerfiler, liftr, automagic.

  • disk.frame - an R package for larger-than-RAM Disk-Based Data Manipulation Framework. blog post about disk.frame usage

  • ppcor - Partial and Semi-Partial (Part) Correlation

  • corpcor - Efficient Estimation of Covariance and (Partial) Correlation

  • ymlthis - write YAML for R Markdown, bookdown, blogdown, and more. The YAML Fieldguide

  • googledrive - an R package interface to Google Drive. Alternative: Google Drive direct download of big files, gdown.pl

  • philentropy - Similarity and Distance Quantification Between Probability Functions. Computes 46 optimized distance and similarity measures for comparing probability functions

  • mkdocs Project documentation with Markdown, GitHub

  • gt - Easily generate information-rich, publication-quality tables from R

  • gtsummary Presentation-Ready Data Summary and Analytic Result Tables

R tips & tricks

  • r-base-shortcuts - Base R shortcuts: A collection of lesser-known but powerful idioms and coding patterns for writing concise and fast R code

  • Install R as a user. Slack

$ curl -O https://cran.r-project.org/src/base/R-4/R-4.1.1.tar.gz
$ tar xzf R-4.1.1.tar.gz
$ cd R-4.1.1
$ mkdir -p "$HOME/software/R-4.1.1"
$ ./configure --prefix="$HOME/software/R-4.1.1"
$ make
$ make install
# Then add:
export PATH=$HOME/software/R-4.1.1/bin:$PATH
# and you've got R / Rscript (=R 4.1.1) ready to go.
  • Convert continuous to categorical value
data(iris)
vari <- iris$Sepal.Length
nb.clusters <- 3
breaks <- quantile(vari, seq(0,1,1/nb.clusters))
Xqual <- cut(vari,breaks, include.lowest=TRUE)
summary(Xqual)
  • Barplot with StdErr using standard R graphics
dat = agridat::lasrosas.corn
means.nf = tapply(dat$yield, INDEX=dat$nf, FUN=mean)  
StdErr.nf = tapply(dat$yield, INDEX=dat$nf, FUN= std.error)  
BP = barplot(means.nf, ylim=c(0,max(means.nf)+10))  
segments(BP, means.nf - (2*StdErr.nf), BP, means.nf + (2*StdErr.nf), lwd = 1.5)  
arrows(BP, means.nf - (2*StdErr.nf), BP,  means.nf + (2*StdErr.nf), lwd = 1.5, angle = 90,  code = 3, length = 0.05)  
  • HDF5Array
library(HDF5Array)
a0 <- array(runif(15000000), dim=c(10000, 300, 5))
A0 <- as(a0, "HDF5Array")  
library(pryr) 
object_size(A0)
#> 1.94 kB
object_size(a0)
#> 120 MB #rstats
  • Convert a vector to normally distributed one
interactions <- log2(interactions) # If highly right-skewed, log2-transform beforehand
# Inverse normal conversion
interactions <- sapply(interactions, function(x) {
  rank <- rank(x, na.last = "keep")
  P <- (rank - 0.5) / length(x[ !is.na(x)] )
  x <- qnorm(P)
})

R questions

  • Why vcd package is used? vcd package provides different methods for visualizing multivariate categorical data.
  • What is iPlots? It is a package which provide bar plots, mosaic plots, box plots, parallel plots, scatter plots and histograms.
  • What is fitdistr() function? It is used to provide the maximum likelihood fitting of univariate distributions. It is defined under the MASS package.
  • Define loglm() function. Loglm() function is used to create log-linear models.
  • How to create scatterplot matrices? Pair() or splom() function is used for create scatterplot matrices.
  • Define leaps(). It is used to perform the all-subsets regression and it is defined under the leaps package.
  • Define cluster.stats(). It is define in fpc package which provide a method for comparing the similarity of two clusters solution using different validation criteria.

About

Data science in R notes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published