-
Notifications
You must be signed in to change notification settings - Fork 4
/
PBMCs.Rmd
114 lines (95 loc) · 7.32 KB
/
PBMCs.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
title: "PBMCs Example"
output: pdf_document
---
```{r,echo=TRUE,warning=FALSE,message=FALSE}
library(org.Hs.eg.db)
library(Seurat)
source('cell_type_identification4_clean.R')
```
We start by reading in our training data, which is downloaded from 10X Genomics: CD4, CD8, CD14, and NK cells. For the purposes of this example, we will withhold 100 cells of each to serve as test data. After loading in the data, we convert the gene symbols to Ensembl IDs (required).
```{r,echo=TRUE,warning=FALSE,message=FALSE}
set.seed(6619)
### Preparation for symbol->ENSEMBL conversion
Hs_symbol <- org.Hs.egSYMBOL
mapped_Hs_genes.symbol <- mappedkeys(Hs_symbol)
Hs_symbol.df <- as.data.frame(Hs_symbol[mapped_Hs_genes.symbol])
Hs_ensembl <- org.Hs.egENSEMBL
mapped_Hs_genes.ensembl <- mappedkeys(Hs_ensembl)
Hs_ensembl.df <- as.data.frame(Hs_ensembl[mapped_Hs_genes.ensembl])
Hs_mapping <- merge(Hs_symbol.df,Hs_ensembl.df)
## CD4: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/cd4_t_helper?
cd4_facs.data <- Read10X(data.dir = "../cd4_singlecell/")
cd4_facs.data <- as.matrix(cd4_facs.data)
rownames(cd4_facs.data) <- Hs_mapping$ensembl_id[match(rownames(cd4_facs.data),
Hs_mapping$symbol)]
cd4_facs.data <- cd4_facs.data[!is.na(rownames(cd4_facs.data)),]
cd4_facs.data <- na.omit(cd4_facs.data)
cd4.test <- cd4_facs.data[,1:100] # CD4 withheld cells
cd4_facs.data <- cd4_facs.data[,101:5101] # CD4 training cells
## CD8: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/cytotoxic_t
cd8_facs.data <- Read10X(data.dir = '../filtered_matrices_cd8/hg19/')
cd8_facs.data <- as.matrix(cd8_facs.data)
rownames(cd8_facs.data) <- Hs_mapping$ensembl_id[match(rownames(cd8_facs.data),
Hs_mapping$symbol)]
cd8_facs.data <- cd8_facs.data[!is.na(rownames(cd8_facs.data)),]
cd8.test <- cd8_facs.data[,1:100] # CD8 withheld cells
cd8_facs.data <- cd8_facs.data[,101:5101] # CD8 training cells
## CD14: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/cd14_monocytes
cd14_facs.data <- Read10X(data.dir = '../filtered_matrices_cd14/hg19/')
cd14_facs.data <- as.matrix(cd14_facs.data)
rownames(cd14_facs.data) <- Hs_mapping$ensembl_id[match(rownames(cd14_facs.data),
Hs_mapping$symbol)]
cd14_facs.data <- cd14_facs.data[!is.na(rownames(cd14_facs.data)),]
cd14.test <- cd14_facs.data[,1:100] # CD14 withheld cells
cd14_facs.data <- cd14_facs.data[,101:ncol(cd14_facs.data)] # CD14 training cells
## NK: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/cd56_nk
nk_facs.data <- Read10X(data.dir = '../filtered_matrices_nk/hg19/')
nk_facs.data <- as.matrix(nk_facs.data)
rownames(nk_facs.data) <- Hs_mapping$ensembl_id[match(rownames(nk_facs.data),
Hs_mapping$symbol)]
nk_facs.data <- nk_facs.data[!is.na(rownames(nk_facs.data)),]
nk.test <- nk_facs.data[,1:100] # NK withheld cells
nk_facs.data <- nk_facs.data[,101:5101] # NK training cells
```
Next, we fit our model to these training data. We combine the training data into a single matrix and call the `trainAllReference` function on this matrix, alongside a vector of cell-type labels. By default, this function will only consider the subset of discriminative genes needed for classification. For now, to illustrate the uses of our method additional to classification, we will consider all genes by setting `discrim_only=FALSE`. (Note that regardless of whether this parameter is set to `TRUE`, as it is by default, or `FALSE`, the classification function will still only use the subset of discriminative genes; setting this parameter to `TRUE` just has the effect of speeding up the training process by only fitting to that set of genes.)
```{r,echo=TRUE,warning=FALSE,message=FALSE}
## Input a single matrix with a vector of labels
common_pbmcs_genes <- Reduce(intersect,list(rownames(cd4_facs.data),
rownames(cd14_facs.data),
rownames(cd8_facs.data),
rownames(nk_facs.data)))
pbmcs_reference <- as.matrix(cbind(cd4_facs.data[common_pbmcs_genes,],
cd14_facs.data[common_pbmcs_genes,],
cd8_facs.data[common_pbmcs_genes,],
nk_facs.data[common_pbmcs_genes,]))
pbmcs_reference_labels <- c(rep('CD4',dim(cd4_facs.data)[2]),rep('CD14',dim(cd14_facs.data)[2]),
rep('CD8',dim(cd8_facs.data)[2]),rep('NK',dim(nk_facs.data)[2]))
pbmcs.d <- trainAllReference(pbmcs_reference,pbmcs_reference_labels,discrim_only=FALSE)
```
The list `pbmcs.d` contains one table for each inputted cell-type, which indicates each gene's empirical rate, probability of belonging to the off-low latent state, and probability of belonging to the off-high latent state. This is used as input into the `classifyTarget` function. However, if our goal is to examine each cell-type's barcode, we could pass this list into the `getBarcode` function, which will present the probability that each gene is on in each cell-type. This could then be used beyond the cell-type annotation context to study gene expression within a single cell-type, to compare genes across cell-types, and to identify markers.
As an example, we show below how we can find the barcodes for each cell-type, then identify genes with a high probability of being on in NK cells, but a low probability of being on in the others (i.e. potential markers for this cell-type). Finally, we show how we can identify markers for NK cells compared to global patterns of gene expression, rather than the other cell-types under consideration here.
```{r,echo=TRUE,warning=FALSE,message=FALSE}
## Obtain barcodes
barcodes <- getBarcode(pbmcs.d)
## Identify genes likely to be on in NK cells, but likely to be off in the others
findMarkers('NK',barcodes,thresh_up=0.95,thresh_below=0.05)
## Identify genes likely to be on in NK cells, and not in global patterns
findMarkers('NK',barcodes,thresh_up=0.99,thresh_below=0.01,relative=F)
```
Finally, we can use the original object `pbmcs.d` to annotate our withheld test cells, which should be placed in a single matrix. If we want the possibility of detecting cells that do not belong to any of the cell-types in the reference, we can set `other=T`. Further, we can specify `return.probs=T` if we want a matrix of probabilistic assignments rather than just a vector of highest-probability cell-type labels. We show both below:
```{r,echo=TRUE,warning=FALSE,message=FALSE}
## Put withheld test cells in one matrix
pbmcs_withheld <- as.matrix(cbind(cd4.test[common_pbmcs_genes,],
cd14.test[common_pbmcs_genes,],
cd8.test[common_pbmcs_genes,],
nk.test[common_pbmcs_genes,]))
true_labels <- c(rep('CD4',100),rep('CD14',100),rep('CD8',100),rep('NK',100))
## Annotate!
annotation <- classifyTarget(pbmcs_withheld,pbmcs.d,other=T)
table(annotation,true_labels)
## Return probabilistic assignments instead
probabilistic_annotation <- classifyTarget(pbmcs_withheld,pbmcs.d,other=T,
return.probs=T)
head(round(probabilistic_annotation,2))
```