Skip to content

3. Quick start

wguo-research edited this page May 11, 2020 · 7 revisions

Data preparation

The scCancer is mainly designed for 10X Genomics platform, and it requires a data folder containing the results generated by the software Cell Ranger. In general, the data folder needs to be organized as following which is the output of Cell Ranger V3:

    /sampleFolder
    ├── filtered_feature_bc_matrix
    │   ├── barcodes.tsv.gz
    │   ├── features.tsv.gz
    │   └── matrix.mtx.gz
    ├── raw_feature_bc_matrix
    │   ├── barcodes.tsv.gz
    │   ├── features.tsv.gz
    │   └── matrix.mtx.gz
    └── web_summary.html

Comparing to Cell Ranger V2 (CR2), Cell Ranger V3 (CR3) can identify cells with low RNA content better. So we suggest to use CR3 to do alignment and cell-calling. Considering that some published data is from CR2 or the raw matrix isn't supported, we specially deisgn the pipeline to be compatible with these situations. A common folder structure of CR2 is as below.

    /sampleFolder
    ├── filtered_gene_bc_matrices
    │   └── hg19
    │       ├── barcodes.tsv
    │       ├── genes.tsv
    │       └── matrix.mtx
    ├── raw_gene_bc_matrices
    │   └── hg19
    │       ├── barcodes.tsv
    │       ├── genes.tsv
    │       └── matrix.mtx
    └── web_summary.html

For other droplet-based platforms, the data folder should be prepared likewise.

Quick start

Here, we provide an example data of kidney cancer from 10X Genomics. Users can download it and run following scripts to understand the workflow of scCancer. And following are the generated HTML reports:

For multi-samples, following is a generated HTML report for three kidney cancer samples integration analysis:

scStatistics

The scStatistics mainly implements quality control for the expression matrix and returns some suggested thresholds to filter cells and genes. Meanwhile, to evaluate the influence of ambient RNAs from lysed cells better, this step also estimates the contamination fraction by using the algorithm of SoupX.

Following is the example script to run the first module scStatistics. And using help(runScStatistics) can get more details about its arguments to realize personalized setting.

library(scCancer)

dataPath <- "./data/KC-example"     # The path of cell ranger processed data
savePath <- "./results/KC-example"  # A path to save the results
sampleName <- "KC-example"          # The sample name
authorName <- "G-Lab@THU"           # The author name to mark the report

# Run scStatistics
stat.results <- runScStatistics(
    dataPath = dataPath,
    savePath = savePath,
    sampleName = sampleName,
    authorName = authorName
)

Running the scStatistics script will generate some files/folders as below:

  1. report-scStat.html : A HTML report containing all results.
  2. report-scStat.md : A markdown report.
  3. figures/ : All figures generated during this module.
  4. report-figures/ : All figures presented in the HTML report.
  5. cellManifest-all.txt : The statistical results for all droplets.
  6. cell.QC.thres.txt : The suggested thresholds to filter poor-quality cells.
  7. geneManifest.txt : The statistical results for genes.
  8. ambientRNA-SoupX.txt : The results of estimating contamination fraction.
  9. report-cellRanger.html : The summary report generated by Cell Ranger.

scAnnotation

Using the QC thresholds, the scAnnotation filters cells and genes firstly, and then performs basic operations (normalization, log-transformation, highly variable genes identification, unwanted variance removing, scaling, centering, dimension reduction, clustering, and differential expression analysis) using R package Seurat V3. Besides, scAnnotation also performs some cancer-specific analyses:

  • Doublet score estimation : In this step, we integrate two methods (binary classification based bcds and co-expression based cxds) of R package scds to estimate doublet scores.

  • Cancer micro-environmental cell type classification : In this step, we develop a data-driven OCLR (one-class logistic regression) model to predict cell types, including epithelial cells, endothelial cells, fibroblasts, and immune cells (CD4+ T cells, CD8+ T cells, B cells, nature killer cells, and myeloid cells).

  • Cell malignancy estimation : In this step, we refer to the algorithm of R package infercnv to estimate an initial CNV profiles. Then, we take advantage of cells’ neighbor information to smooth CNV values and define the malignancy score as the mean of the squares of them.

  • Cell cycle analysis : In this step, to analyze intra-tumor cell phenotype heterogeneity, we define cell cycle score as the relative average expression of a list of G2/M and S phase markers, by using the function “AddModuleScore” of Seurat.

  • Cell stemness analysis : In this step, to analyze intra-tumor cell phenotype heterogeneity, we define cell stemness score as the Spearman correlation coefficient between cells’ expression and our pre-trained stemness signature, by referring to the algorithm of Malta et al.

  • Gene set signature analysis : In this step, to analyze intra-tumor heterogeneity at gene sets level, we provide two methods to calculated gene set signature scores: GSVA and relative average expression levels. By default, we use 50 hallmark gene sets from MSigDB.

  • Expression programs identification : In this step, to analyze intra-tumor heterogeneity at gene sets level, we use non-negative matrix factorization (NMF) to unsupervisedly identify potential expression program signatures.

  • Cell-cell interaction analyses : In this step, we referred to the methods of Kumar et al to characterize ligand-receptor interactions across cell clusters.

Following is the example script to run the second module scAnnotation. And using help(runScAnnotation) can get more details about its arguments to realize personalized setting.

library(scCancer)

dataPath <- "./data/KC-example"      # The path of cell ranger processed data
statPath <- "./results/KC-example"   # The path of the scStatistics results
savePath <- "./results/KC-example"   # A path to save the results
sampleName <- "KC-example"           # The sample name
authorName <- "G-Lab@THU"            # The author name to mark the report

# Run scAnnotation
anno.results <- runScAnnotation(
    dataPath = dataPath,
    statPath = statPath,
    savePath = savePath,
    authorName = authorName,
    sampleName = sampleName,
    geneSet.method = "average"       # or "GSVA"
)

Running the scAnnotation script will generate some files/folders as below:

  1. report-scAnno.html : A HTML report containing all results.
  2. report-scAnno.md : A markdown report.
  3. figures/ : All figures generated during this module.
  4. report-figures/ : All figures presented in the HTML report.
  5. geneManifest.txt : The annotation results of genes updated by filter information.
  6. expr.RDS : A Seurat object.
  7. diff.expr.genes/ : Differentially expressed genes information for all clusters.
  8. cellAnnotation.txt : The annotation results for each cells.
  9. malignancy/: All results of cell malignancy estimation.
  10. expr.programs/ : All results of expression programs identification.
  11. InteractionScore.txt : Cell clusters interactions scores.

scCombination

The scCombination mainly performs multiple samples data integration, batch effect correction and analyses visualization based on the scAnnotation results of single sample. And four strategies (NormalMNN (default), SeuratMNN, Raw and Regression) to integrate data and correct batch effect are optional.

Following is the example script to run the module scCombination. And using help(runScCombination) can get more details about its arguments.

library(scCancer)

# The paths of all sample's "runScAnnotation" results
single.savePaths <- c("./results/KC1", "./results/KC2", "./results/KC3")
sampleNames <- c("KC1", "KC2", "KC3")    # The labels for all samples
savePath <- "./results/KC123-comb"       # A path to save the results
combName <- "KC123-comb"                 # A label of the combined samples
authorName <- "G-Lab@THU"                # The author name to mark the report
comb.method <- "NormalMNN"               # Integration methods ("NormalMNN", "SeuratMNN", "Harmony", "Raw", "Regression", "LIGER")

# Run scCombination
comb.results <- runScCombination(
    single.savePaths = single.savePaths, 
    sampleNames = sampleNames, 
    savePath = savePath, 
    combName = combName,
    authorName = authorName,
    comb.method = comb.method
)    

Running the scCombination script will generate some files/folders as below:

  1. report-scAnnoComb.html : A HTML report containing all results.
  2. report-scAnnoComb.md : A markdown report.
  3. figures/ : All figures generated during this step.
  4. report-figures/ : All figures presented in the HTML report.
  5. expr.RDS : A Seurat object.
  6. diff.expr.genes/ : Differentially expressed genes information for all clusters.
  7. cellAnnotation.txt : The annotation results for each cells.
  8. expr.programs/ : All results of expression programs identification.
  9. (anchors.RDS : The anchors used for batch correction of "NormalMNN" or "SeuratMNN".)