-
Notifications
You must be signed in to change notification settings - Fork 42
Home
The molecular heterogeneity brings great challenges for cancer diagnosis and treatment. However, recent advance in single cell RNA-sequencing (scRNA-seq) technology make it possible to study cancer heterogeneities at single cell level.
Here we develop a R package named scCancer
which focuses on processing and
analyzing droplet-based scRNA-seq data for cancer research.
Except routine data processing steps, this package takes several special
considerations for cancer-specific features.
The scCancer
pipeline consists of two parts: scStatistics
and scAnnotation
.
-
The
scStatistics
implements quality control for cells and genes. -
The
scAnnotation
implements normal analysis and annotate the cells' low-dimension coordinates, cluster, cell type, cell cycle, malignancy, gene set signatures, expression program, and so on.
After these analyses, a user-friendly graphic report will be generated.
library(devtools)
devtools::install_github("wguo-research/scCancer")
library(scCancer)
The scCancer
is mainly designed for 10X Genomics platform,
and it requires a data folder containing the results generated by the software
Cell Ranger
.
In general, the data folder needs to be organized as following which is the output of Cell Ranger V3
:
/sampleFolder
├── filtered_feature_bc_matrix
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── raw_feature_bc_matrix
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
└── web_summary.html
Comparing to Cell Ranger V2 (CR2)
, Cell Ranger V3 (CR3)
can identify cells with
low RNA content better. So we suggest to use CR3
to do alignment and cell-calling.
Considering that some open data is from CR2
or the raw matrix isn't supported, we
specially deisgn the pipeline to be compatible with these situations.
A common structure of CR2
is as below.
/sampleFolder
├── filtered_gene_bc_matrices
│ └── hg19
│ ├── barcodes.tsv
│ ├── genes.tsv
│ └── matrix.mtx
├── raw_gene_bc_matrices
│ └── hg19
│ ├── barcodes.tsv
│ ├── genes.tsv
│ └── matrix.mtx
└── web_summary.html
For other droplet-based platforms, the data folder can be prepared likewise.
The scStatistics
mainly implements quality control for the expression matrix
and return some suggested thresholds to filter cells and genes.
Following is an example script to run the first step scStatistics
.
library(scCancer)
# A path containing the cell ranger processed data
dataPath <- "./data/KC-example"
# A path used to save the results files
savePath <- "./results/KC-example"
# The sample name
sampleName <- "KC-example"
# Run scStatistics
stat.results <- runScStatistics(
dataPath = dataPath,
savePath = savePath,
sampleName = sampleName,
authorName = NULL
)
Running the scStatistics
script will generate some files as below:
- report-scStat.html : A HTML report containing all results and corresponding description.
- report-scStat.md : A markdown report.
- figures/ : All figures generated during the pipeline.
- report-figures/: All figures presented in the report.
- cellManifest-all.txt : The results calculated for all droplets.
- cell.QC.thres.txt : The suggested thresholds to filter poor-quality cells.
- geneManifest.txt : The results calculated for each genes.
-
report-cellRanger.html: The summary report generated by
Cell Ranger
.
Using the QC thresholds, the scAnnotation
filters cells and genes firstly, and then
perform normal operations (normalization, data scaling, highly variable genes identification,
PCA, clustering, differential expression analysis and cell cycle estimation) using R
package Seurat V3
[1,2]. Besides, scAnnotation
also provides some cancer-specific analysis:
-
Cancer micro-environmental cell types annotation : In this step, we propose
a one-class logistic regression (OCLR) model to predict cell types. This step can be control by the parameterbool.runCellClassify
. -
Cell malignancy estimation : In this step, we use moving average of expression values across each chromosome to estimate CNV profile (refer to
infercnv
). And then we use clustering information to estimated cells' malignancy scores and types. This step can be control by the parameterbool.runMalignancy
. -
Gene set signature scores calculation : In this step, we provide two methods to calculated gene set signature scores: GSVA and relative average expression levels. By default, we use 50 hallmark gene sets from MSigDB. This step can be control by the parameter
bool.runGeneSets
. -
Expression programs identification : In this step, we use non-negative matrix factorization (NMF) to identify expression programs. This step can be control by the parameter
bool.runExprProgram
.
Following is an example script to run the first step scAnnotation
.
library(scCancer)
# A path containing the cell ranger processed data
dataPath <- "./data/KC-example"
# A path containing the scStatistics results
statPath <- "./results/KC-example"
# A path used to save the results files
savePath <- "./results/KC-example"
# The sample name
sampleName <- "KC-example"
# Run scAnnotation
anno.results <- runScAnnotation(
dataPath = dataPath,
statPath = statPath,
savePath = savePath,
authorName = NULL,
sampleName = sampleName,
bgPercent.max = 0.001,
pc.use = 30,
bool.runDiffExpr = T,
bool.runCellClassify = T,
bool.runMalignancy = T,
bool.runGeneSets = T,
# geneSet.method = "GSVA",
bool.runExprProgram = T
)
Running the scAnnotation
script will generate some files/folders as below:
- report-scAnno.html : A HTML report containing all results and corresponding description.
- report-scAnno.md : A markdown report.
- figures/ : All figures generated during the pipeline.
- report-figures/: All figures used in the report.
- gene.manifest.filter.txt : The filtered genes during QC.
- expr.RDS : A Seurat object.
- diff.expr.genes/ : Differentially expressed genes information for all clusters.
- cellAnnotation.txt : The annotation results for each cells.
- malignancy/: All results of cell malignancy estimation.
- expr.programs/: All results of expression programs identification.
After the step scStatistics
, a HTML report (report-scStat.html
) will be generated, which persents the statistical features of the data from various perspectives (nUMI
, nGene
, mito.percent
, ribo.percent
, diss.percent
).
By identifying outliers from the distribution of these indices, scCancer
adaptively provides some suggested thresholds to filter poor-quality cells and records them in file cell.QC.thres.txt
.
Besides, users can also modify the values in file cell.QC.thres.txt
and then the step scAnnotation
will use the updated thresholds to perform cell QC and downstream analyses.
For the step gene QC, the scCancer
filters genes according to three aspects of information.
- First, it filters mitochondrial, ribosomal, and dissociation-associated genes.
- Second, it filters genes with the number of expressed cells less than argument
nCell.min
(the default is 3). - Third, it filters genes with the background percentage larger than the argument
bgPercent.max
(the default is 0.001). The distribution ofbgPercent.max
can be found in thereport-scStat.html
.
Users can also modify the value of these arguments according to their needs.
For patient-drived tumor xenograft (PDX) samples, it contains both human and mouse cells generally. In order to deal with these human-mouse-mixed data, users can explicitly set the arguments hg.mm.mix
, species
, hg.mm.thres
, and mix.anno
of relevant functions.
More details for the meaning of these arguments can be found by using command help()
.
If the sample data is generated by Cell Ranger V2 (CR2)
and contain 'raw_gene_bc_matrices' matrix, the scCancer package can re-perform cell calling using
the method EmptyDrop
(the name of R package is DropletUtils
). But users need to manually install the DropletUtils
and import it in script.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DropletUtils")
library(DropletUtils)
[1] Butler A, Hoffman P, Smibert P, et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species[J]. Nature biotechnology, 2018, 36(5): 411.
[2] Stuart T, Butler A, Hoffman P, et al. Comprehensive Integration of Single-Cell Data[J]. Cell, 2019.
[3] inferCNV of the Trinity CTAT Project. https://github.com/broadinstitute/inferCNV
Here is the output of sessionInfo()
on the system.
R version 3.6.0 (2019-04-26)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS
Matrix products: default
locale:
[1] en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] scCancer_1.0.0
loaded via a namespace (and not attached):
[1] Seurat_3.1.0 Rtsne_0.15 colorspace_1.4-1
[4] ggridges_0.5.1 IRdisplay_0.7.0 markdown_1.1
[7] base64enc_0.1-3 leiden_0.3.1 listenv_0.7.0
[10] npsurv_0.4-0 bit64_0.9-7 ggrepel_0.8.1
[13] AnnotationDbi_1.46.1 codetools_0.2-16 splines_3.6.0
[16] R.methodsS3_1.7.1 lsei_1.2-0 geneplotter_1.62.0
[19] knitr_1.24 shinythemes_1.1.2 zeallot_0.1.0
[22] IRkernel_1.0.1.9000 jsonlite_1.6 annotate_1.62.0
[25] ica_1.0-2 cluster_2.0.9 png_0.1-7
[28] R.oo_1.22.0 pheatmap_1.0.12 graph_1.62.0
[31] uwot_0.1.3 shiny_1.3.2 sctransform_0.2.0
[34] compiler_3.6.0 httr_1.4.1 backports_1.1.4
[37] assertthat_0.2.1 Matrix_1.2-17 lazyeval_0.2.2
[40] later_0.8.0 htmltools_0.3.6 tools_3.6.0
[43] rsvd_1.0.2 igraph_1.2.4.1 gtable_0.3.0
[46] glue_1.3.1 RANN_2.6.1 reshape2_1.4.3
[49] dplyr_0.8.3 Rcpp_1.0.2 Biobase_2.44.0
[52] vctrs_0.2.0 gdata_2.18.0 ape_5.3
[55] nlme_3.1-140 gbRd_0.4-11 lmtest_0.9-37
[58] xfun_0.9 stringr_1.4.0 globals_0.12.4
[61] miniUI_0.1.1.1 mime_0.7 lifecycle_0.1.0
[64] irlba_2.3.3 gtools_3.8.1 XML_3.98-1.20
[67] NNLM_0.4.3 future_1.14.0 MASS_7.3-51.4
[70] zoo_1.8-6 scales_1.0.0 promises_1.0.1
[73] parallel_3.6.0 RColorBrewer_1.1-2 memoise_1.1.0
[76] reticulate_1.13 pbapply_1.4-2 gridExtra_2.3
[79] ggplot2_3.2.1 ggExtra_0.9 RSQLite_2.1.2
[82] stringi_1.4.3 GSVA_1.32.0 highr_0.8
[85] S4Vectors_0.22.0 caTools_1.17.1.2 BiocGenerics_0.30.0
[88] bibtex_0.4.2 repr_1.0.1 Rdpack_0.11-0
[91] SDMTools_1.1-221.1 rlang_0.4.0 pkgconfig_2.0.2
[94] bitops_1.0-6 evaluate_0.14 lattice_0.20-38
[97] ROCR_1.0-7 purrr_0.3.2 labeling_0.3
[100] htmlwidgets_1.3 bit_1.1-14 cowplot_1.0.0
[103] tidyselect_0.2.5 GSEABase_1.46.0 RcppAnnoy_0.0.12
[106] plyr_1.8.4 magrittr_1.5 R6_2.4.0
[109] IRanges_2.18.1 gplots_3.0.1.1 DBI_1.0.0
[112] pbdZMQ_0.3-3 pillar_1.4.2 fitdistrplus_1.0-14
[115] survival_2.44-1.1 RCurl_1.95-4.12 tibble_2.1.3
[118] future.apply_1.3.0 tsne_0.1-3 crayon_1.3.4
[121] uuid_0.1-2 KernSmooth_2.23-15 plotly_4.9.0
[124] grid_3.6.0 data.table_1.12.2 blob_1.2.0
[127] diptest_0.75-7 metap_1.1 digest_0.6.20
[130] xtable_1.8-4 tidyr_1.0.0 httpuv_1.5.2
[133] R.utils_2.9.0 stats4_3.6.0 RcppParallel_4.4.3
[136] munsell_0.5.0 viridisLite_0.3.0