This repo contains a collection of scripts and notebooks that I use for initial pre-processing and dimensionality reduction of 10X scATAC-seq data. These handle steps that come after peak calling with the cellatac pipeline (CellGenIT can run this for you).
After running cellatac
you will have a results folder. From that we will need the contents of results/peak_matrix
Make a new folder to store all your results
mkdir my_scATAC_dir
outdir=./my_scATAC_dir
cellatac_outs=results/peak_matrix
Install required R packages (other packages are installed directly in the notebooks)
install.packages(c("data.table","Signac", "tidyverse", "argparse"))
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
BiocManager::install(c("ensembldb", "EnsDb.Mmusculus.v79", "EnsDb.Hsapiens.v86","GenomicRanges"))
Install required python packages:
pip install scanpy anndata numpy pandas anndata2ri rpy2 scipy
This script uses functionality in GenomicRanges
to compute some basic stats on the peaks identified by cellatac
. We will use these for filtering and in later stages of the analysis.
In terminal:
Rscript annotate_peaks.R $cellatac_outs/peaks.txt $outdir --genome hg38
See notebook: cellatac2anndata.ipynb
Here we make an anndata object, we filter out a lot of peaks and start saving different layers (loading the big matrix takes some time, be patient).
The notebook saves an .h5ad
abject storing the raw counts and peak annotations for the dataset.
See notebook: add_cistopic.ipynb
cisTopic uses Latent Dirichlet Allocation (LDA) to perform dimensionality reduction on a binary (or sort of binary) matrix of peak x cell accessibility. LDA is a robust Bayesian method used in text mining to group documents addressing similar topics and related words into topics. cisTopic treats cells as documents and peaks as words to group cells where the same "regulatory topics" are active. Read all about it here.
Here we use the LDA model implemented in cisTopic for 3 purposes
- Dimensionality reduction: the topic x cell matrix obtained with LDA can be used to construct a KNN graphs, UMAPs, clustering etc... (basically an alternative to PCA in scRNA-seq pipelines)
- Data de-noising: scATAC-seq data from 10X Genomics is very sparse, de-noising based on the LDA model helps in exploratory data analysis and when examining accessibility patterns of single-peaks
- Calculating accessibility scores per gene: the denoised peak x cell matrix can be used to aggregate signal over genes, useful to examine markers
References for uses of cisTopic:
- Bravo Gonzales-Blas et al. (2017) cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data, Nat Methods
- Bravo Gonzales-Blas et al. (2020) Identification of genomic enhancers through spatial integration of single‐cell transcriptomics and epigenomics. Mol. Syst. Biol.
See (short) notebook: peak2genes.ipynb
A range of downstream analyses require to associate genes to peaks in their proximity (i.e. overlapping the gene body or less than n kbs away). The script proximal_peak2gene.R
uses functionality in the GenomicRanges
R package to create a sparse matrix of peak x gene assignment, where 1s indicate that a peak is in the proximity of the gene (N.B. A peak might be proximal to multiple genes). The default proximity window is 50 kbs.