-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSlideDeck.Rmd
423 lines (271 loc) · 10.5 KB
/
SlideDeck.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
---
title: "Introduction to R: A Computational Workbench for Biological Data Analysis"
author: "Brian Capaldo"
date: "6/10/2017"
output: slidy_presentation
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## Acknowledgements
- [May Institute](https://computationalproteomics.ccis.northeastern.edu/)
- [GitHub](https://github.com/MayInstitute/MayInstitute2017)
- Association of Biomolecular Research Facilities
- International Society for Advancement of Cytometry
- The R Project for Statisitcal Computing
- The Bioconductor Project
- [RStudio](https://www.rstudio.com/)
- UVA Office of Research Core Administration
## attributes(Brian)
- BS in Cellular and Molecular Biology
- Wet lab experience in yeast/bacterial genetics and materials science
- PhD in Biochemistry and Molecular Genetics
- Core informatician at UVA SOM ORCA
1. Cytometry
2. Mass spectrometry
3. Next Generation Sequencing
4. Microarrays
- [GitHub](https://github.com/bc2zb)
## Why R?
- It's flexible (this presentation was made using R)
- It's open source
- It's widely supported
- Its design philosophy
- [Installing R](https://cran.r-project.org/)
## History of R
- The R programming language has its roots in S, a language created forty years ago at ATT Bell Labs by Chambers, Becker and Wilks.
- The motivation for S was to simplify the task of calling statistical routines written in FORTRAN (this involved writing code to read in, format and write out the data).
- The goal of S was to free users of the drudgery of writing that interface code and speed up the data analyst’s workflow. The key linguistic design principle was to design a language that could be extended and that was devoid of arbitrary limitations.
- The S language was such a success that users started writing statistical routines in S.
- The R programming language was designed by Ross Ihaka and Robert Gentleman at the University of Auckland around 1992.
- R departed from S in number of ways, it was an open source language released under a GPL license to ensure that everything related to R remains in the public domain, R cleaned up some corners of S and improve the performance of programs written in the language.
## R as a computational bench top
- Data as experiments
- Data structures as glassware
- Functions as instruments
- Vignettes as protocols
## Data types in R
The base types include `double`, `integer`, `characater` and `logical`. Values can be tested as belonging to one of those types.
##
```{r, echo = TRUE}
typeof(1)
```
##
```{r, echo = TRUE}
typeof(1L)
```
##
```{r, echo = TRUE}
is.numeric(1.1)
```
##
```{r, echo = TRUE}
is.logical(TRUE)
is.character("Hello")
```
## Vectors
R provides several kinds of compound data structures commonly referred to as *vectors*. All operations in R are *vectorized*, this means that they take vectors as arguments and they will operate on the individual values (basic types) of these vectors.
The operator `[<-` is the assignment function.
```{r, echo = TRUE}
x <- c(1,2,2) # 'c()' is the concatenate function
y <- c(2,2,1)
( x + y ) * x
```
## Atomic vectors
Atomic vectors are homogeneous (possibly multi-dimensional) data structures.
```{r, echo = TRUE}
c(1,2,3) # again using 'c()' to concatenate
c("one","two","three")
```
## Subsetting
All vectors can be accessed used subsetting operators `[` and `[[`
- `typeof()`
- to find out what basic type is in the vector
- `length()`
- how many elements are in it
- `attributes()`
- query the vector's meta data
## Element of vectors can be accessed individually.
```{r, echo = TRUE}
x <- c(1L, 2L, 3L) # The `L` lets R know you want an interger
x[1] <- 45
x[1]
```
## A two dimensional atomic vector is called a *matrix* and higher-dimensional vector is called an *array*.
```{r, echo = TRUE}
a <- matrix(1:6, ncol = 3, nrow = 2)
b <- array(1:12, c(2,3,4))
```
## a
a <- matrix(1:6, ncol = 3, nrow = 2)
```{r, echo = TRUE}
a
```
## b
b <- array(1:12, c(2,3,4))
```{r, echo = TRUE}
b
```
## Dynamic typing and coercions
Matrices and array, like vectors, are homogeneous, but it is possible to assign values of different basic types into any one of them. This causes a *coercion* of the entire data structure.
```{r, echo = TRUE}
x
x[[1]] <-1.1
x
```
##
```{r, echo=T}
a
a[[2,1]] <- "one"
a
```
## R has a powerful set of subsetting operations that apply to all vectors uniformly. R allows, to subset ranges, rows, columns of vectors.
```{r, echo = TRUE}
a[1,] # subset the first row
a[1:2,] # subset a range consisting of the first and second row
a[1:2,2:3] # subset a range consisting of the first and second row and the second and third column
```
## Coercions
```{r, echo = TRUE}
aa <- as.integer(a) # as.___ functions coerce one type into another
a
aa
```
##
```{r, echo = TRUE}
aa
dim(aa) <- c(2,3) # dim() refers to the dimension of the data structure
aa
a[1, a[1,] > 1] # here we are subsetting the first row of a and returning which positions are > 1
aa[1, aa[1,] > 1]
```
## When using a predicate such as `a[1,] > 1` for subsetting, behind the scenes R generates a vector of logical values. True indicates a position that should be extracted. So this is:
```{r, echo = TRUE}
aa[1,]>1
```
## We can use logical arrays to subset as follows:
```{r, echo = TRUE}
aa[1, c(FALSE, TRUE, TRUE)]
```
## Subsetting can be used in assignment operations as well.
```{r, echo = TRUE}
a
a[1:3] <- a[4:6]
a
```
## Lists
Lists are heterogeneous vectors. The elements of list can be of any kind, including lists and vectors.
```{r, echo = TRUE}
list(1, "hi", c(1,2))
```
The `typeof()` a list is `list`. You can test for a list with `is.list()` and coerce to a list with `as.list()`. You can turn a list into an atomic vector with `unlist()`. If the elements of a list have different types, `unlist()` will coerce.
## Referential transparency
To facilitate equational reasoning, R attempts to provide *referential transparency* for function calls. Referential transparency means that arguments are not changed by the function being called. So the following function `f(x)` does not modify the vector passed in.
```{r, echo = TRUE}
x
f <- function(x) { x[1] <- 1 }
f(x)
x
```
## Questions on the basics of R?
## Packages
- In R, collections of data, functions, and compiled code are stored as *packages*. *Packages* can be thought of as the computational benchtop equivalent of the reagents and instruments required to perfrom a specific computational assay.
- There are two primary repositories from which to get *packages*, [CRAN](https://cran.r-project.org/) and [Bioconductor](http://bioconductor.org/)
- Bioconductor has made available a number of previous [courses](http://bioconductor.org/help/course-materials/) as well
## Installing packages from CRAN
```{r, echo=TRUE, eval=T}
install.packages("tidyverse", verbose = T, repos='http://cran.us.r-project.org')
```
## Installing packages from Bioconductor
```{r, echo =T, eval=T}
source("https://bioconductor.org/biocLite.R")
biocLite("flowCore", verbose = T, suppressUpdates = T)
```
## Loading libraries
```{r, echo = T}
library(flowCore)
library(tidyverse)
```
## Read Files
```{r, echo = T}
file.name <- system.file("extdata","0877408774.B08", package="flowCore")
x <- read.FCS(file.name, transformation=FALSE)
summary(x)
```
## Search Keywords
```{r, echo = T}
keyword(x,c("$P1E", "$P2E", "$P3E", "$P4E"))
```
## Print Summary
```{r, echo = T}
summary(read.FCS(file.name)) # if you weren't sure if this is the right file
```
## One FCS file
```{r, echo = T}
x # Operator did not annotate the channels!
```
## Biaxial Plot
```{r, echo = T}
library(ggcyto)
autoplot(x, "FL1-H", "FL2-H")
```
## Histogram Plot
```{r, echo = T}
autoplot(x, "FL1-H")
```
## Handling multiple FCS files
```{r, echo = T}
frames <- lapply( # member of apply family
dir(system.file("extdata", # object to which function will be "applied"
"compdata", # system.file() is pointing to where the data is
"data",
package="flowCore"), # which package is the data stored in
full.names=TRUE), # returns the full file names
read.FCS) # function being applied
as(frames, "flowSet") # casting all the frames as a flowSet
```
## Storing a Flow Set
```{r, echo = T}
names(frames) <- sapply(frames, keyword, "SAMPLE ID") # naming the frames in the flowSet
fs <- as(frames, "flowSet") # storing the flowSet as "fs"
fs # TIMTOWTDI
```
## Annotating a flowSet
```{r, echo = T}
phenoData(fs)$Filename <- fsApply(fs, # another apply function
keyword, # applies keyword() on each frame in fs
"$FIL") # returns the $FIL (file name)
pData(phenoData(fs)) # fs is now annotated with the file names
```
## Much more efficient method
```{r, echo = T}
fs <- read.flowSet(path=system.file("extdata",
"compdata",
"data",
package="flowCore"), # pointing to the data
name.keyword="SAMPLE ID",
phenoData=list(name="SAMPLE ID", Filename="$FIL")) # annotating the data
fs
pData(phenoData(fs))
```
## Vignettes as Protocols
[ggcyto](http://bioconductor.org/packages/release/bioc/html/ggcyto.html)
[cytofkit](http://bioconductor.org/packages/release/bioc/html/cytofkit.html)
## Tour of RStudio
[RStudio](https://www.rstudio.com/)
## Resources
- Lab's GitHub Repositories
-[Nolan Lab](https://github.com/nolanlab)
-[Gottardo Lab](https://github.com/rglab)
- Other labs to follow
-[Irish Lab](https://my.vanderbilt.edu/irishlab/)
-[Brinkman Lab](http://www.terryfoxlab.ca/people-detail/ryan-brinkman/)
- [Bioconductor Support](https://support.bioconductor.org/)
- [CyTOF Forum](http://cytoforum.stanford.edu/viewforum.php?f=1)
- Institutional Resources
## Next steps while here at CYTO 2017
- There are lots of posters about computational cytometry here
- Workshop 11: Automating High Volume Cytometry Data Analysis for Central Facilities
Location: Room 302
Date: Monday, Jun 12 15:45
- Keep posting questions on the [GitHub Issues Page](https://github.com/bc2zb/IntroductionToR/issues)