- First version
scCirRL-seq is a computational pipeline designed for single-cell RNA-seq long-read data.
It mainly consists of three parts:
- cell barcode/UMI calling: a standalone module for barcode/UMI calling and gene/transcript quantification
- alternative splicing analysis: modules for identification of cell type-specific and allele-specific splicing
- visualization of gene and transcript expression in single cells
- Haplotype-resolved full-length transcriptome analysis in single cells by scCirRL-seq
scCirRL-seq was written in python3 and tested on Linux/Unix systems, so it may not work well with python2 and/or other systems(Windows/Mac).
git clone git@github.com:Xinglab/scCirRL-seq.git
cd scCirRL-seq
conda create --prefix ./conda_env
conda activate ./conda_env
conda install -c conda-forge -c bioconda python=3.8 --file conda_requirements.txt
python setup.py install
git clone git@github.com:Xinglab/scCirRL-seq.git
cd scCirRL-seq && pip install .
For mapping, we recommend using minimap2 in RNA splice mode. Any other long-read RNA-seq alignment tools can also be used here.
-
Input
long_read.fq/fa
: 1D long reads or RCA consensus sequences in fastq/fasta formatref.fa
: reference genome- (optional)
anno.gtf
: gene annotation file in GTF format - (optional)
n_threads
: number of threads to use
-
Command
# 1. convert GTF to BED12 using paftools.js from minimap2
paftools.js gff2bed anno.gtf > anno_junc.bed12
# 2. mapping with minimap2
minimap2 ref.fa long_read.fq/fa \
--junc-bed anno_junc.bed12 \
-ax splice -ub -k14 -w4 \
--sam-hit-only --secondary=no \
-t n_threads -o long_read.sam
# 3. convert sam to sorted bam
samtools view long_read.sam -b long_read.bam
samtools sort long_read.bam -@ n_threads -o long_read.sorted.bam
For transcript identification, we recommend using ESPRESSO(≥1.3.1), other tools like Bambu or IsoQuant can also be used.
- Input
long_read.sorted.bam
: sorted long-read alignment file in BAM formatref.fa
: reference genome file in FASTA formatanno.gtf
: gene annotation file in GTF formatesp_output_dir
: output directory
- Output
esp_output_dir/esp_cmpt.tsv
: compatible isoforms for all readsesp_output_dir/for_esp_input_N2_R0_updated.gtf
: updated GTF annotation
- Command
# 1. create tab-separated input file for ESPRESSO
mkdir esp_output_dir 2> /dev/null
in_tsv=esp_output_dir/for_esp_input.tsv
abs_path=$(realpath long_read.sorted.bam)
base_name=$(basename long_read.sorted.bam)
echo -e "$abs_path\t$base_name" > $in_tsv
# 2. ESPRESSO S step
perl ESPRESSO_S.pl -L esp_output_dir/for_esp_input.tsv \
-F ref.fa -A anno.gtf \
-M notFilterOutchrM -T n_threads \
-O esp_output_dir
# 3. ESPRESSO C step
perl ESPRESSO_C.pl -I esp_output_dir -F ref.fa \
-X 0 -T n_threads
# 4. ESPRESSO Q step
perl ESPRESSO_Q.pl -L esp_output_dir/for_esp_input.updated \
-A anno.gtf -T n_threads \
-V esp_output_dir/esp_cmpt.tsv
Note that -V esp_cmpt.tsv
is optional in ESPRESSO Q
step, but it is required if you want scCirRL-seq to output gene/transcript quantification information.
scCirRL-seq identifies barcode and UMI from sorted alignment BAM file of single-cell long reads without requiring reference barcode from short-read data.
- Required:
long_read.sorted.bam
: sorted long-read BAM (recommend using minimap2)
- Optional:
read_isoform_compatible.tsv
: tabular file of compatible isoforms for all reads, generated by ESPRESSO(≥1.3.1), Bambu or IsoQuantupdated.gtf
: updated GTF annotation, generated by ESPRESSO(≥1.3.1), Bambu or IsoQuantannotation.gtf
: reference annotation GTF file, to retrieve gene names if gene names are not provided inupdated.gtf
cell_barcode.tsv
: reference cell barcode list. If provided, scCirRL-seq will directly use it to guide the barcode calling, only long reads with cell barcodes in the provided list will be kept- barcode sequence length (default: 16)
- max. allowed edit distance between barcode and reference barcode (default: 2)
- UMI sequence length (default: 12)
- max. allowed edit distance between PCR-duplicated UMIs (default: 1)
scCirRL long_read.sorted.bam \
output_dir \
-t updated.gtf \
-m read_isoform_compatible.tsv \
-g annotation.gtf
bc_umi.bam
: BAM file with barcode/UMI information for each long read, BAM tags:CB
andUB
, for barcode and UMI. Note that only long read with barcode/UMI called are keptbc_umi.tsv
: tabular file with barcode/UMI/gene/transcript information for all barcode-called long readsexpression_matrix/
: folder containing 10X format sparse expression matrix at both gene and transcript level, can be directly parsed using standard single-cell analysis tools, like Seurat/Azimuthref_bc.tsv
: reference cell barcode list identified from long reads. Note that if cell barcode list was provided to scCirRL-seq,ref_bc.tsv
file will not be generatedhigh_qual_bc_umi.rank.tsv
: cell barcode list ranked by unique UMI count based on all high-quality long reads. Note that if cell barcode list was provided to scCirRL-seq,high_qual_bc_umi.rank.tsv
file will not be generated
Example of bc_umi.tsv
:
cell barcode | UMI | # reads | compatible transcript id | gene id | gene name | read names (separeted by `,`) |
---|---|---|---|---|---|---|
ATCACGACACTTTAGG | ATCACATCCATG | 3 | ENST00000407249, ENST00000341832 | ENSG00000248333 | CDK11B | 741aa2c2-5840-4a29-bd90-3bdcb71604ba, 05f8432a-7f7e-446d-a6d5-8ab9e4eb5102, 6bb13ee1-7397-435c-8840-aeb2a90cf4ab |
All the downstream single-cell long-read analyses rely on the cell type clustering result, which could be accomplished by running Seurat on the gene expression matrix.
Annotating the cell clusters is the process of assigning cell types to each cell cluster. This could be performed manually or based on known reference annotation like Azimuth.
For example, for human peripheral blood mononuclear cells (PBMC), the clustering and annotation result can be obtained by mapping to the Azimuth human PBMC reference dataset.
We provide an example R script for human PBMC data. For data from other species/tissues, this needs to be done manually by users.
Note that not having cell clusters annotated with cell types does not affect the downstream splicing analyses by scRMATS-long. You can simply provide scRMATS-long with the cluster IDs for cell barcodes.
Here is an example of the output file for this step, bc_to_cell_type.tsv
:
barcode | cell type |
---|---|
TACGCCGAGCAGGCCA | CD4 T |
ACGATCGAGGCATCAG | Mono |
... | ... |
Or, if no cell type annotation was available:
barcode | cluster ID |
---|---|
TACGCCGAGCAGGCCA | 1 |
ACGATCGAGGCATCAG | 2 |
... | ... |
scRMATS-long performs pairwise comparison to identify differentially spliced genes/transcripts between each two cell types/clusters.
- Required
- Optional
- min. number of reads to keep a transcript for each cell type (default: 10)
- min. number of transcripts to keep a gene for each cell type (default: 2)
- list of genes of interest. If provided, all genes will still be processed, but only differentially spliced genes that are in this list will be output
scCirRL_cell_type_specific_splicing expression_matrix/transcript \
bc_to_cell_type.tsv \
out_prefix
*_cell_specific_spliced_genes.tsv
: differentially spliced genes between 2 cell types. Gene FDR ≤0.05, max. delta ratio ≥0.05*_cell_specific_spliced_transcripts.tsv
: differentially spliced transcripts between 2 cell types. Gene FDR ≤0.05, transcript p value ≤0.05, delta ratio ≥0.05
Example of *_cell_specific_spliced_genes.tsv
:
cell_type1 | cell_type2 | gene_id | gene_name | gene_fdr | max_delta_ratio | transcript_ids | cell1_counts | cell2_counts | cell1_ratios | cell2_ratios |
---|---|---|---|---|---|---|---|---|---|---|
Epithelial | Mesenchymal | ENSG00000026508 | CD44 | 7.96e-65 | 0.68 | ENST00000263398, ENST00000433892, ENST00000527326, ESPRESSO:chr11:443:2, ESPRESSO:chr11:443:3 | 61.43,122.97, 20.0,23.41,13.63 | 322.74,4.24,15.0,1.33-05,1.33e-05 | 0.25,0.50,0.08,0.09,0.05 | 0.94,0.01,0.04,3.89e-08,3.89e-08 |
Example of *_cell_specific_spliced_transcripts.tsv
:
cell_type1 | cell_type2 | gene_id | gene_name | transcript_id | gene_fdr | transcript_p | delta_ratio | count1 | count2 | ratio1 | ratio2 |
---|---|---|---|---|---|---|---|---|---|---|---|
Epithelial | Mesenchymal | ENSG00000026508 | CD44 | ENST00000263398 | 7.96e-65 | 8.25e-73 | 0.68 | 61.43 | 322.74 | 0.25 | 0.94 |
With additional whole-genome sequencing data, scRMATS-long can identify allele-specific splicing within each cell type. The allele-specific splicing analysis mainly consists of three steps:
- Phase long reads with whatshap
- Identify allele-specific spliced genes/transcripts within each cell type
- Search for disease-associated GWAS SNPs from identified allele-specific spliced genes
- Input
- Command
# sort and index `bc_umi.bam`
samtools sort bc_umi.bam -o bc_umi_sorted.bam
samtools index bc_umi_sorted.bam
# identify haplotype
whatshap haplotag wgs_phased.vcf bc_umi_sorted.bam \
--reference ref.fa \
--ignore-read-groups \
--skip-missing-contigs \
-o bc_umi_hap.bam \
--output-haplotag-list hap_list.tsv
- Output
bc_umi_hap.bam
: BAM file with barcode/UMI and additional haplotype informationhap_list.tsv
: tabular file containing haplotype information for barcode-called long reads
-
Input
-
Command
scCirRL_allele_specific_splicing bc_umi.tsv \
bc_to_cell_type.tsv \
hap_list.tsv \
output_prefix
- Output
*_allele_specific_spliced_genes.tsv
: allele-specific spliced genes. Gene FDR ≤0.05, max. delta ratio ≥0.05*_allele_specific_spliced_transcripts.tsv
: allele-specific spliced transcripts. Gene FDR ≤0.05, transcript p value ≤0.05, delta ratio ≥0.05
Example of *_allele_specific_spliced_genes.tsv
:
cell_type | gene_id | gene_name | gene_fdr | transcripts | hap1_counts | hap2_counts | hap1_ratios | hap2_ratios |
---|---|---|---|---|---|---|---|---|
Mono | ENSG00000105383 | CD33 | 4.42e-07 | ENST00000262262, ENST00000421133 | 58,9 | 24,41 | 0.86,0.13 | 0.36,0.63 |
Example of *_allele_specific_spliced_transcripts.tsv
:
cell_type | transcript_id | transcript_p | gene_id | gene_name | gene_fdr | hap1_count | hap2_count | hap1_ratio | hap2_ratio |
---|---|---|---|---|---|---|---|---|---|
Mono | ENST00000262262 | 3.58e-09 | ENSG00000105383 | CD33 | 4.42e-07 | 58 | 24 | 0.86 | 0.36 |
-
Input
*_allele_specific_spliced_genes.tsv
: allele-specific spliced genes, generated by scCirRL_allele_specific_splicingwgs_phased.vcf
: WGS phased VCF file, generated from WGS short-read data using GATK and SHAPEITannotation.gtf
: annotation GTF filegwas_catalog_efo.tsv
: GWAS catalog database with EFO termld.duckdb
: linkage disequilibrium (LD) score database in duckdb format
-
Command
scCirRL_gene_with_gwas_disease -g allele_spliced_genes.tsv \
annotation.gtf \
wgs_phased.vcf \
gwas_catalog_efo.tsv \
ld.duckdb \
output_prefix
- Output
*_gwas_efo_term_stat.tsv
: GWAS disease-associated genes, with traits described in EFO term
*_gwas_snps_ld
: directory of LD scores and GWAS traits for all SNPs, each gene has one separate.ld
file
Example of *_gwas_efo_term_stat.tsv
:
gene_name | gene_id | Cancer | Immune system disorder | Neurological disorder | Cardiovascular disease | Digestive system disorder | Metabolic disorder | Response to drug | Other disease |
---|---|---|---|---|---|---|---|---|---|
CD33 | ENSG00000105383 | False | False | True | False | False | False | False | False |
Example of *_gwas_snps_ld/gene.ld
:
snp_id | rs3865444 | rs2459141 | rs12459419 | rs2455069 | rs7245846 | rs33978622 | rs34813869 | rs1354106 | haplotype | chrom | pos | type | gwas_trait | gwas_trait_efo_term | gwas_pvalue |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rs3865444 | 1.0 | 0.326844 | 1.0 | 0.327703 | 0.891224 | 0.967195 | 0.887773 | 0.881092 | H2 | chr19 | 51224706 | gwas | Alzheimer disease | Neurological disorder | 2e-09 |
Given genes/transcripts of interest (cell-type-specific splicing/allele-specific spciling), scRMATS-long also offers visualization of your results in the single-cell UMAP space, based on the Seurat R package.
# gene/transcript expression folder generated by scCirRL-seq
gene_mtx <- "output_dir/expression_matrix/gene"
transcript_mtx <- "output_dir/expression_matrix/transcript"
# script for UMAP plot
source("visualization.R)
scCirRL_umap_plot(gene_mtx_dir = gene_mtx,
# `feature_mtx_dir` needs to be only one folder/object or the same size as `feature_list`
feature_mtx_dir = gene_mtx,
feature_list = c("VIM", "EPCAM"))
scCirRL_umap_plot(gene_mtx_dir = gene_mtx,
# `feature_mtx_dir` needs to be only one folder/object or the same size as `feature_list`
feature_mtx_dir = transcript_mtx,
feature_list = c("ENST00000433892", "ENST00000263398"))
scCirRL_umap_plot(gene_mtx_dir = gene_mtx,
# `feature_mtx_dir` needs to be only one folder/object or the same size as `feature_list`
feature_mtx_dir = c(gene_mtx, transcript_mtx, transcript_mtx),
feature_list = c("CD44", "ENST00000433892", "ENST00000263398"))
Note that, for gene_mtx_dir
and feature_mtx_dir
, you can also provide with Seurat object instead of a matrix folder.
gene_obj = make_seurat_obj(gene_mtx)
transcript_obj = make_seurat_obj(transcript_mtx)
scCirRL_umap_plot(gene_mtx_dir = gene_obj,
# `feature_mtx_dir` needs to be only one folder/object or same size as `feature_list`
feature_mtx_dir = c(gene_obj, transcript_obj, transcript_obj),
feature_list = c("CD44", "ENST00000433892", "ENST00000263398"))