Skip to content

Commit

Permalink
added 10xRNA tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
vivekbhr committed May 21, 2024
1 parent 3e819e9 commit 0e4ec8f
Show file tree
Hide file tree
Showing 7 changed files with 302 additions and 8 deletions.
Binary file modified docs/content/images/UMAP_compared_withOrig_10xATAC.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/content/images/igv_snapshot_10xRNA.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/content/tutorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Tutorials for using sincei on the command line
tutorials/sincei_tutorial_sortChIC.rst
tutorials/sincei_tutorial_10x.rst
tutorials/sincei_tutorial_10xATAC.rst
tutorials/sincei_tutorial_10xRNA.rst

Tutorials for using sincei inside python
-----------------------------------------
Expand Down
15 changes: 8 additions & 7 deletions docs/content/tutorials/sincei_tutorial_10x.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,16 +95,17 @@ of the barcode counts.
2. scATAC-seq analysis
----------------------

Please follow :doc:`this tutorial <sincei_tutorial_10xATAC>` for further analysis of scATAC-seq
samples from the above data.
Please follow :doc:`this tutorial <sincei_tutorial_10xATAC>` for further analysis of scATAC-seq samples from the above data.

3. scRNA-seq analysis
---------------------

Please follow :doc: `this tutorial <sincei_tutorial_10xATAC>` for further analysis of scRNA-seq
samples from the above data.
Please follow :doc: `this tutorial <sincei_tutorial_10xRNA>` for further analysis of scRNA-seq samples from the above data.

4. Joint analysis
-----------------
Notes
------------

Tutorial in preparation.
Currently, sincei doesn't provide a method for **doublet estimation and removal**, which is an important step in the analysis of droplet-based data.
Instead, we use simpler filters of min and max number of detected features per cell, which, to some extent mitigates this issue. However, this could
lead to some differences in results compared to the published data in used here. Despite this difference, the major published cell types can be separated
with sincei for both ATAC and RNA fraction of the data, as shown in the 2 tutorials above.
2 changes: 1 addition & 1 deletion docs/content/tutorials/sincei_tutorial_10xATAC.rst
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ bigwigs with 1kb bins.
--labels rep1_atac_rep1 rep2_atac_rep2 \
-i sincei_output/atac/scClusterCells_UMAP.tsv \
-o sincei_output/atac/sincei_cluster
# creates 6 files with names "sincei_cluster_<X>.bw" where X is 0, 1... 9
# creates 9 files with names "sincei_cluster_<X>.bw" where X is 0, 1... 9
We can now inspect these bigwigs on IGV. We can clearly see some regions
with cell-type specific signal, such as the markers described in the
Expand Down
292 changes: 292 additions & 0 deletions docs/content/tutorials/sincei_tutorial_10xRNA.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,292 @@
Analysis of 10x genomics scRNA-seq data using sincei
====================================================

The data used here is part of our 10x genomics multiome tutorial,
`described here <sincei_tutorial_10x.rst>`__

We will use the BAM file, barcodes and peaks.bed file obtained using the
**cellrange-arc** workflow. Alternatively, the BAM file (tagged with
cell barcode) and peaks.bed can be obtained from a custom mapping/peak
calling workflow, and the list of filtered cell barcodes can be obtained
using sincei (see the parent tutorial for explanation).

Define common bash variables:

.. code:: bash
# create dir
mkdir sincei_output/rna
2. Quality control - I (read-level)
-----------------------------------

In order to define high quality cells, we can use the read-level quality
statistics from scFilterStats. There could be several indications of low
quality cells in this data, such as:

- high PCR duplicates (marked by cellranger workflow, detected using
``--samFlagExclude``)
- high fraction of reads aligned to blacklisted regions (filtered using
``--blacklist``)
- high fraction of reads with poor mapping quality (filtered using
``--minMappingQuality``)
- very high/low GC content of the aligned reads, indicating reads
mostly aligned to low-complexity regions (filtered using
``--GCcontentFilter``)
- high level of secondary/supplementary alignments (filtered using
``--samFlagExclude/Include``)

.. code:: bash
for r in 1 2
do
dir=cellranger_output_rep${r}/outs/
bamfile=${dir}/gex_possorted_bam.bam
barcodes=${dir}/filtered_feature_bc_matrix/barcodes.tsv.gz
# from sincei or cellranger output
zcat ${barcodes} > ${dir}/filtered_barcodes.txt scFilterStats -p 20
–region chr1 \
–GCcontentFilter ‘0.2,0.8’ \
–minMappingQuality 10 \
–samFlagExclude 256 –samFlagExclude 2048 \
–barcodes ${dir}/filtered_barcodes.txt \
--cellTag CB \
--label rna_rep${r}
-o sincei_output/rna/scFilterStats_output_rep${r}.txt
-b ${bamfile}
done
`scFilterStats` summarizes these outputs as a table, which can then be
visualized using the `MultiQC tool <https://multiqc.info/docs/>`__, to select
appropriate list of cells to include for counting.

.. code:: bash
multiqc sincei_output/rna/ # results in multiqc_report.html
3. Signal aggregation (counting reads)
--------------------------------------

Below we use sincei to aggregate signal from single-cells. For the gene
expression data, it makes sense to aggregate signal from genes only. For
this we can use a GTF file, which defines gene annotations.

If needed, we can use the same parameters as in ``scFilterStats`` to
count only high quality reads from our whitelist of barcodes.

.. code:: bash
## Download the GTF file
curl -o sincei_output/hg38.gtf.gz \
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz && gunzip sincei_output/hg38.gtf.gz
# count reads on the GTF file
for r in 1 2 do dir=cellranger_output_rep${r}/outs/
bamfile=${dir}/gex_possorted_bam.bam
barcodes=${dir}/filtered_barcodes.txt scCountReads features -p 20
–BED sincei_output/hg38.gtf –cellTag CB –minMappingQuality 10 \
–samFlagExclude 256 –samFlagExclude 2048 -bc ${barcodes} --cellTag CB \
-o sincei_output/rna/scCounts_rna_genes_rep${r}
–label rna_rep${r}
-b ${bamfile}
done
# Number of bins found: 74538
4. Quality control - II (count-level)
-------------------------------------

Even though we already performed read-level QC before, the counts distribution on our specified regions (bins/genes/peaks) could be different from the whole-genome stats.

**scCountQC**, with the `--outMetrics` option, outputs the count statistics at region and cell level (labelled as <prefix>.regions.tsv and <prefix>.cells.tsv).
Just like `scFilterStats`, these outputs can then be visualized using the
`MultiQC tool <https://multiqc.info/docs/>`__, to select appropriate metrics to
filter out the unwanted cells/regions.

.. code:: bash
# list the metrics we can use to filter cells/regions
for r in 1 2; do scCountQC -i sincei_output/rna/scCounts_rna_genes_rep${r}.loom --describe; done
# export the single-cell level metrices
for r in 1 2; do scCountQC -i sincei_output/rna/scCounts_rna_genes_rep${r}.loom \
-om sincei_output/rna/countqc_rna_genes_rep${r} & done
# visualize output using multiQC
multiqc sincei_output/rna/ # see results in multiqc_report.html
In total >74500 genes have been detected in >13.7K cells here.

Below, we perform a basic filtering using **scCountQC**. We exclude the
cells with <500 and >10000 detected genes (``--filterRegionArgs``).
Also, we exclude the genes that are detected in too few cells (<100) or
too many (approax >90%) of cells (``--filterCellArgs``).

.. code:: bash
for r in 1 2 do scCountQC -i
sincei_output/rna/scCounts_rna_genes_rep\ :math:`{r}.loom \
-o sincei_output/rna/scCounts_rna_genes_filtered_rep`\ {r}.loom
–filterRegionArgs “n_cells_by_counts: 100, 6000”
–filterCellArgs “n_genes_by_counts: 500, 15000” done
## rep 1
#Applying filters
#Remaining cells: 5314
#Remaining features: 48219
## rep 2
#Applying filters
#Remaining cells: 4894
#Remaining features: 48660
5. Combine counts for the 2 replicates
--------------------------------------

While it's useful to perform count QC seperately for each replicate, the counts can now be combined for downstream analysis.

We provide a tool `scCombineCounts`, which can concatenate counts for cells based on common features. Concatenating the filtered cells for the 2 replicates would result in a total of >12K cells.

.. code:: bash
scCombineCounts \
-i sincei_output/rna/scCounts_rna_genes_filtered_rep1.loom \
sincei_output/rna/scCounts_rna_genes_filtered_rep2.loom \
-o sincei_output/rna/scCounts_rna_genes_filtered.merged.loom \
--method multi-sample --labels rep1 rep2
#Combined cells: 10208
#Combined features: 48059
5. Dimentionality reduction and clustering
------------------------------------------

Finally, we will apply glmPCA to this data, assuming the data follows a
poisson distribution (which is nearly appropritate for count-based data
such as scRNA-seq), we will reduce the dimentionality of the data to 20
principle components (the default), followed by a graph-based (louvain)
clustering of the output.

.. code:: bash
scClusterCells -i sincei_output/rna/scCounts_rna_genes_filtered.merged.loom \
-m glmPCA -gf poisson --clusterResolution 1 \
-op sincei_output/rna/scClusterCells_UMAP.png \
-o sincei_output/rna/scCounts_rna_genes_clustered.loom
# Coherence Score: Coherence Score: -0.9893882070519782
# also produces the tsv file "sincei_output/rna/scClusterCells_UMAP.tsv"
(optional) Confirmation of clustering using metadata
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Below, we will load this data in R and compare it to the cell metadata
provided with our files to see if our clustering separates celltypes in
a biologically meaningful way.

We can color our UMAP output from ``scClusterCells`` with the cell-type
information based on the metadata provided by the original authors.

.. code:: r
library(dplyr)
library(magrittr)
library(ggplot2)
library(patchwork)
umap <- read.delim(“sincei_output/rna/scClusterCells_UMAP.tsv”) meta <-
read.csv(“metadata_cd34_rna.csv”, row.names = 1)
umap$celltype <- meta[gsub("rep1_|rep2_", "", umap`\ Cell_ID),
“celltype”]
# keep only cells with published labels
umap %<>% filter(!is.na(celltype)) # remove clusters with low number of
cells cl = umap %>% group_by(cluster) %>%
summarise(Cell_ID = dplyr::n()) %>%
filter(Cell_ID < 50) %>% .$cluster umap %<>%
filter(!(cluster %in% cl))
# make plots
df_center <- group_by(umap, cluster) %>%
summarise(UMAP1 = mean(UMAP1), UMAP2 = mean(UMAP2))
df_center2 <- group_by(umap, celltype) %>%
summarise(UMAP1 = mean(UMAP1), UMAP2 = mean(UMAP2))
# colors for metadata (8 celltypes)
col_pallete <- RColorBrewer::brewer.pal(8, “Paired”)
names(col_pallete) <- unique(umap$celltype) # grey is for NA
# colors for sincei UMAP (10 clusters)
colors_cluster <- RColorBrewer::brewer.pal(10, “Paired”)
names(colors_cluster) <- sort(unique(umap$cluster))
p1 <- umap %>% ggplot(., aes(UMAP1, UMAP2, color=factor(cluster),
label=cluster)) + geom_point() +
geom_label(data = df_center, aes(UMAP1, UMAP2)) +
scale_color_manual(values = colors_cluster) +
theme_void(base_size = 12) + theme(legend.position = “none”) +
p2 <- umap %>% filter(!is.na(celltype)) %>% ggplot(., aes(UMAP1, UMAP2,
color=factor(celltype), label=celltype)) + geom_point() +
geom_label(data = df_center2, aes(UMAP1, UMAP2)) +
scale_color_manual(values = col_pallete) + labs(color=“Cluster”) +
theme_void(base_size = 12) + theme(legend.position = “none”) +
ggtitle(“Published Cell Types”)
pl <- p1 + p2
ggsave(plot=pl, “sincei_output/rna/UMAP_compared_withOrig.png”,
dpi=300, width = 11, height = 6)
.. image:: ./../images/UMAP_compared_withOrig_10xRNA.png
:height: 800px
:width: 1600 px
:scale: 50 %

The figure above shows that we can easily replicate the expected cell-type
results from the scRNA-seq data using **sincei**. However there are some
interesting differences, especially, a separation of the CLP cluster into 2 clusters,
where one of these cluster is similar to the annotated pDC.

This was done using basic pre-processing steps, therefore the better cell/region filtering and optimizing the analysis parameters.

6. Creating bigwigs and visualizing signal on IGV
-------------------------------------------------

For further exploration of data, It's very useful to create in-silico bulk
coverage files (bigwigs) that aggregate the signal across cells in our clusters.
The tool **scBulkCoverage** takes sincei clustered `.tsv` file, along with the
corresponding BAM files, and aggregate the signal to create these bigwigs.

The parameters here are same as other sincei tools that work on BAM files,
except that we can ask for a normalized bulk signal (specified using
`--normalizeUsing` option) . Below, we produce CPM-normalized bigwigs with 100 bp bins.

.. code:: bash
scBulkCoverage -p 20 --cellTag CB --region chr1 \
--normalizeUsing CPM --binSize 100 \
--minMappingQuality 10 --samFlagExclude 2048 \
-b cellranger_output_rep1/outs/gex_possorted_bam.bam \
cellranger_output_rep2/outs/gex_possorted_bam.bam \
--labels rep1_rna_rep1 rep2_rna_rep2 \
-i sincei_output/rna/scClusterCells_UMAP.tsv \
-o sincei_output/rna/sincei_cluster
# creates 6 files with names "sincei_cluster_<X>.bw" where X is 0, 1... 9
We can now inspect these bigwigs on IGV. Looking at the region around one of the markers described in the
original manuscript, **TAL1**, we can see that the CLP (lyphoid) and pDCs lack it's expression, while the
myloid cells and HSCs have the signal. The neighboring gene STIL, which is involved in cell-cycle regulation
is not expressed in the highly proliferative HSC/HMP/MEP cell clusters. Overall this confirms that the signal
coverage extracted from our clusters broadly reflects the biology of the underlying cell types.

.. image:: ./../images/igv_snapshot_10xRNA.png
:height: 500px
:width: 6000 px
:scale: 50 %

0 comments on commit 0e4ec8f

Please sign in to comment.