SlideDeck.Rmd

---
title: "Introduction to R: A Computational Workbench for Biological Data Analysis"
author: "Brian Capaldo"
date: "6/10/2017"
output: slidy_presentation
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```

## Acknowledgements

- [May Institute](https://computationalproteomics.ccis.northeastern.edu/)

    - [GitHub](https://github.com/MayInstitute/MayInstitute2017)
  
- Association of Biomolecular Research Facilities

- International Society for Advancement of Cytometry

- The R Project for Statisitcal Computing

- The Bioconductor Project

- [RStudio](https://www.rstudio.com/)

- UVA Office of Research Core Administration

## attributes(Brian)

- BS in Cellular and Molecular Biology

- Wet lab experience in yeast/bacterial genetics and materials science

- PhD in Biochemistry and Molecular Genetics

- Core informatician at UVA SOM ORCA
    
    1. Cytometry
    
    2. Mass spectrometry
    
    3. Next Generation Sequencing
    
    4. Microarrays

- [GitHub](https://github.com/bc2zb)    


## Why R?

- It's flexible (this presentation was made using R)

- It's open source

- It's widely supported

- Its design philosophy

- [Installing R](https://cran.r-project.org/)

## History of R

- The R programming language has its roots in S, a language created forty years ago at ATT Bell Labs by Chambers, Becker and Wilks.

- The motivation for S was to simplify the task of calling statistical routines written in FORTRAN (this involved writing code to read in, format and write out the data).

- The goal of S was to free users of the drudgery of writing that interface code and speed up the data analyst’s workflow. The key linguistic design principle was to design a language that could be extended and that was devoid of arbitrary limitations.

- The S language was such a success that users started writing statistical routines in S.

- The R programming language was designed by Ross Ihaka and Robert Gentleman at the University of Auckland around 1992.

- R departed from S in number of ways, it was an open source language released under a GPL license to ensure that everything related to R remains in the public domain, R cleaned up some corners of S and improve the performance of programs written in the language.

## R as a computational bench top 

- Data as experiments

- Data structures as glassware

- Functions as instruments

- Vignettes as protocols

## Data types in R

The base types include `double`, `integer`, `characater` and `logical`.  Values can be tested as belonging to one of those types.

##

```{r, echo = TRUE}
typeof(1)
```

##

```{r, echo = TRUE}
typeof(1L)
```

##

```{r, echo = TRUE}
is.numeric(1.1)
```

##

```{r, echo = TRUE}
is.logical(TRUE)
is.character("Hello")
```


## Vectors

R provides several kinds of compound data structures commonly referred to as *vectors*.  All operations in R are *vectorized*, this means that they take vectors as arguments and they will operate on the individual values (basic types) of these vectors.

The operator `[<-` is the assignment function.

```{r, echo = TRUE}
x <- c(1,2,2)   # 'c()' is the concatenate function
y <- c(2,2,1)   
( x + y ) * x
```

## Atomic vectors

Atomic vectors are homogeneous (possibly multi-dimensional) data structures.

```{r, echo = TRUE}
c(1,2,3)                # again using 'c()' to concatenate
c("one","two","three")
```

## Subsetting

All vectors can be accessed used subsetting operators `[` and `[[`

- `typeof()` 

    - to find out what basic type is in the vector
    
- `length()`

    - how many elements are in it
    
- `attributes()` 

    - query the vector's meta data

## Element of vectors can be accessed individually.

```{r, echo = TRUE}
x <- c(1L, 2L, 3L) # The `L` lets R know you want an interger
x[1] <- 45
x[1]
```

## A two dimensional atomic vector is called a *matrix* and higher-dimensional vector is called an *array*.

```{r, echo = TRUE}
a <- matrix(1:6, ncol = 3, nrow = 2)
b <- array(1:12, c(2,3,4))
```

## a

a <- matrix(1:6, ncol = 3, nrow = 2)

```{r, echo = TRUE}
a
```

## b

b <- array(1:12, c(2,3,4))

```{r, echo = TRUE}
b
```

## Dynamic typing and coercions 

Matrices and array, like vectors, are homogeneous, but it is possible to assign values of different basic types into any one of them. This causes a *coercion* of the entire data structure.

```{r, echo = TRUE}
x
x[[1]] <-1.1
x
```

##

```{r, echo=T}
a
a[[2,1]] <- "one"
a
```

## R has a powerful set of subsetting operations that apply to all vectors uniformly. R allows, to subset ranges, rows, columns of vectors.

```{r, echo = TRUE}
a[1,]       # subset the first row
a[1:2,]     # subset a range consisting of the first and second row
a[1:2,2:3]  # subset a range consisting of the first and second row and the second and third column
```

## Coercions

```{r, echo = TRUE}
aa <- as.integer(a) # as.___ functions coerce one type into another
a
aa
```

##

```{r, echo = TRUE}
aa
dim(aa) <- c(2,3) # dim() refers to the dimension of the data structure
aa
a[1, a[1,] > 1] # here we are subsetting the first row of a and returning which positions are > 1
aa[1, aa[1,] > 1]
```

## When using a predicate such as `a[1,] > 1` for subsetting, behind the scenes R generates a vector of logical values. True indicates a position that should be extracted.  So this is:

```{r, echo = TRUE}
aa[1,]>1
```

## We can use logical arrays to subset as follows:

```{r, echo = TRUE}
aa[1, c(FALSE, TRUE, TRUE)]
```

## Subsetting can be used in assignment operations as well.

```{r, echo = TRUE}
a
a[1:3] <- a[4:6]
a
```


## Lists

Lists are heterogeneous vectors. The elements of list can be of any kind, including lists and vectors.

```{r, echo = TRUE}
list(1, "hi", c(1,2))
```

The `typeof()` a list is `list`. You can test for a list with `is.list()` and coerce to a list with `as.list()`. You can turn a list into an atomic vector with `unlist()`. If the elements of a list have different types, `unlist()` will coerce.

## Referential transparency

To facilitate equational reasoning, R attempts to provide *referential transparency* for function calls. Referential transparency means that arguments are not changed by the function being called. So the following function `f(x)` does not modify the vector passed in.

```{r, echo = TRUE}
x
f <- function(x) { x[1] <- 1 }
f(x)
x
```

## Questions on the basics of R?

## Packages

- In R, collections of data, functions, and compiled code are stored as *packages*. *Packages* can be thought of as the computational benchtop equivalent of the reagents and instruments required to perfrom a specific computational assay. 

- There are two primary repositories from which to get *packages*, [CRAN](https://cran.r-project.org/) and [Bioconductor](http://bioconductor.org/)

- Bioconductor has made available a number of previous [courses](http://bioconductor.org/help/course-materials/) as well

## Installing packages from CRAN

```{r, echo=TRUE, eval=T}
install.packages("tidyverse", verbose = T, repos='http://cran.us.r-project.org')
```

## Installing packages from Bioconductor

```{r, echo =T, eval=T}
source("https://bioconductor.org/biocLite.R")
biocLite("flowCore", verbose = T, suppressUpdates = T)
```

## Loading libraries

```{r, echo = T}
library(flowCore)
library(tidyverse)
```

## Read Files

```{r, echo = T}
file.name <- system.file("extdata","0877408774.B08", package="flowCore")
x <- read.FCS(file.name, transformation=FALSE)
summary(x)
```

## Search Keywords

```{r, echo = T}
keyword(x,c("$P1E", "$P2E", "$P3E", "$P4E"))
```

## Print Summary

```{r, echo = T}
summary(read.FCS(file.name)) # if you weren't sure if this is the right file
```

## One FCS file

```{r, echo = T}
x # Operator did not annotate the channels!
```

## Biaxial Plot 

```{r, echo = T}
library(ggcyto)
autoplot(x, "FL1-H", "FL2-H")
```

## Histogram Plot

```{r, echo = T}
autoplot(x, "FL1-H")
```

## Handling multiple FCS files

```{r, echo = T}
frames <- lapply(                                      # member of apply family
                 dir(system.file("extdata",            # object to which function will be "applied"
                                 "compdata",           # system.file() is pointing to where the data is
                                 "data",    
                                  package="flowCore"), # which package is the data stored in
                     full.names=TRUE),                 # returns the full file names
                 read.FCS)                             # function being applied
as(frames, "flowSet")                                  # casting all the frames as a flowSet
```

## Storing a Flow Set

```{r, echo = T}
names(frames) <- sapply(frames, keyword, "SAMPLE ID") # naming the frames in the flowSet
fs <- as(frames, "flowSet")                           # storing the flowSet as "fs"
fs                                                    # TIMTOWTDI
```

## Annotating a flowSet

```{r, echo = T}
phenoData(fs)$Filename <- fsApply(fs,        # another apply function
                                  keyword,   # applies keyword() on each frame in fs
                                  "$FIL")    # returns the $FIL (file name)
pData(phenoData(fs))                         # fs is now annotated with the file names
```

## Much more efficient method

```{r, echo = T}
fs <- read.flowSet(path=system.file("extdata",
                                    "compdata",
                                    "data",
                                    package="flowCore"), # pointing to the data
                   name.keyword="SAMPLE ID",             
                   phenoData=list(name="SAMPLE ID", Filename="$FIL")) # annotating the data
fs
pData(phenoData(fs))
```

## Vignettes as Protocols

[ggcyto](http://bioconductor.org/packages/release/bioc/html/ggcyto.html)

[cytofkit](http://bioconductor.org/packages/release/bioc/html/cytofkit.html)

## Tour of RStudio

[RStudio](https://www.rstudio.com/)

## Resources

- Lab's GitHub Repositories

    -[Nolan Lab](https://github.com/nolanlab)
    
    -[Gottardo Lab](https://github.com/rglab)
    
- Other labs to follow

    -[Irish Lab](https://my.vanderbilt.edu/irishlab/)
    
    -[Brinkman Lab](http://www.terryfoxlab.ca/people-detail/ryan-brinkman/)

- [Bioconductor Support](https://support.bioconductor.org/)

- [CyTOF Forum](http://cytoforum.stanford.edu/viewforum.php?f=1)

- Institutional Resources

## Next steps while here at CYTO 2017

- There are lots of posters about computational cytometry here

- Workshop 11: Automating High Volume Cytometry Data Analysis for Central Facilities

    Location: Room 302
    
    Date: Monday, Jun 12 15:45
    
- Keep posting questions on the [GitHub Issues Page](https://github.com/bc2zb/IntroductionToR/issues)