GitHub - bio-ruxandra-tesloianu/scATAC_prep: A not-too-orderly collection of scripts for preprocessing and dimensionality reduction of scATAC-seq data

Pre-processing of scATAC-seq data

This repo contains a collection of scripts and notebooks that I use for initial pre-processing and dimensionality reduction of 10X scATAC-seq data. These handle steps that come after peak calling with the cellatac pipeline (CellGenIT can run this for you).

Input

After running cellatac you will have a results folder. From that we will need the contents of results/peak_matrix

Setting up

Make a new folder to store all your results

mkdir my_scATAC_dir
outdir=./my_scATAC_dir
cellatac_outs=results/peak_matrix

Install required R packages (other packages are installed directly in the notebooks)

install.packages(c("data.table","Signac", "tidyverse", "argparse"))

if (!requireNamespace("BiocManager", quietly = TRUE)){
    install.packages("BiocManager")
}
BiocManager::install(c("ensembldb", "EnsDb.Mmusculus.v79", "EnsDb.Hsapiens.v86","GenomicRanges"))

Install required python packages:

pip install scanpy anndata numpy pandas anndata2ri rpy2 scipy

1. Annotate peaks

This script uses functionality in GenomicRanges to compute some basic stats on the peaks identified by cellatac. We will use these for filtering and in later stages of the analysis.

In terminal:

Rscript annotate_peaks.R $cellatac_outs/peaks.txt $outdir --genome hg38

2. Make anndata and filter peaks

See notebook: cellatac2anndata.ipynb

Here we make an anndata object, we filter out a lot of peaks and start saving different layers (loading the big matrix takes some time, be patient).

The notebook saves an .h5ad abject storing the raw counts and peak annotations for the dataset.

3. Run cisTopic

See notebook: add_cistopic.ipynb

cisTopic uses Latent Dirichlet Allocation (LDA) to perform dimensionality reduction on a binary (or sort of binary) matrix of peak x cell accessibility. LDA is a robust Bayesian method used in text mining to group documents addressing similar topics and related words into topics. cisTopic treats cells as documents and peaks as words to group cells where the same "regulatory topics" are active. Read all about it here.

Here we use the LDA model implemented in cisTopic for 3 purposes

Dimensionality reduction: the topic x cell matrix obtained with LDA can be used to construct a KNN graphs, UMAPs, clustering etc... (basically an alternative to PCA in scRNA-seq pipelines)
Data de-noising: scATAC-seq data from 10X Genomics is very sparse, de-noising based on the LDA model helps in exploratory data analysis and when examining accessibility patterns of single-peaks
Calculating accessibility scores per gene: the denoised peak x cell matrix can be used to aggregate signal over genes, useful to examine markers

References for uses of cisTopic:

4. Match genes to proximal peaks

See (short) notebook: peak2genes.ipynb

A range of downstream analyses require to associate genes to peaks in their proximity (i.e. overlapping the gene body or less than n kbs away). The script proximal_peak2gene.R uses functionality in the GenomicRanges R package to create a sparse matrix of peak x gene assignment, where 1s indicate that a peak is in the proximity of the gene (N.B. A peak might be proximal to multiple genes). The default proximity window is 50 kbs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pre-processing of scATAC-seq data

Input

Setting up

1. Annotate peaks

2. Make anndata and filter peaks

3. Run cisTopic

4. Match genes to proximal peaks

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.md		README.md
add_cistopic.ipynb		add_cistopic.ipynb
annotate_peaks.R		annotate_peaks.R
cellatac2anndata.ipynb		cellatac2anndata.ipynb
peak2genes.ipynb		peak2genes.ipynb

bio-ruxandra-tesloianu/scATAC_prep

Folders and files

Latest commit

History

Repository files navigation

Pre-processing of scATAC-seq data

Input

Setting up

1. Annotate peaks

2. Make anndata and filter peaks

3. Run cisTopic

4. Match genes to proximal peaks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages