27 Sep 2023:
- fix bugs in vcf filtering.
- implmenting sub-pipline cscQTL_bed.nf: this pipline is to allow cscQTL take inputs from other circRNA calling tools and different quantification approach.
cscQTL is an integrative framework for circular RNA quantitative trait locus (circQTL) discovery. The motivation for this framework is current circQTL studies relies hevily on a single circRNA calling method to annotate and quantify circRNAs. Regardless the efforts of state-of-the-art circRNA calling methods, circRNA detection is still suffer from a certain amount of false positive and circRNA detection exhibits little agreement between calling tools that cause divergence results in circQTL downstream analyses. cscQTL resolves the problem by combining inputs of serveal circRNA calling algorithm and re-quantify circRNA expression with Quasi-mapping by construction the pseudo circRNA references. cscQTL is implemented as an ready to use pipeline based on Nextflow. It also intergrate an automatic procedure to perform collocation with COLOC. For comparison purpose, the tool also allow users perform single method circQTL with Circall, CIRI2, CIRCExplorer2.
cscQTL pipeline
The pipeline requires at least a linux system with 32 CPUs and 64GB of RAM to run.
Since cscQTL is implemted with Nextflow (DSL2), you would need Nextflow to run it. Further information to install Nextflow can be found on its home page:https://www.nextflow.io/
For tool dependency, I have placed all dependencies in their corresponding container, so you would need to run with Docker or Singularity.
The default desgins of cscQTL is running with either Docker using a single machine (-profile standard) or with sigularity using slurm HPC(-profile cluster), you may need to change the nextflow.config to adapt to the your system.
Further customization, I recommend to consult Nextflow homepage: https://www.nextflow.io/docs/latest/config.html
git clone https://github.com/datngu/cscQTL.git
Parameters | Description | Default setting |
---|---|---|
genome | genome in fasta format | $baseDir/data/ref/genome.fa |
cdna | transcripts (cDNA) in fasta format | $baseDir/data/ref/cdna.fa |
bsj | back-splice junction database of Circall in fastq format | $baseDir/data/ref/bsj.fa |
annotation | ensembl annotation in gtf format | $baseDir/data/ref/annotation.gtf |
reads | ribo minus RNA seq reads in fastq.gz format | $baseDir/data/reads/*_{1,2}.fastq.gz |
genotype | genotype in vcf.gz format | $baseDir/data/genotype.vcf.gz |
meta | meta data in csv format - using for matching covariates and rna file names | $baseDir/data/meta.csv |
sumstat_files | summary statistics for COLOC analyses | $baseDir/data/sumstat/*.gz |
meta_sumstat | summary statistics information - for matching file names and detail sample sizes | $baseDir/data/meta_sumstat.csv |
trace_dir | directory for tracing output - all intermediate files in the analyses will be soft-linked here - used for debugging | $baseDir/trace_dir |
outdir | directory of output | $baseDir/results |
Running parameters | ||
chrom | chromsome range used for QTL mapping | 1..22 |
peer | PEER range used for optimize the number of PEER factors | 1..20 |
genotype_PCs | number of genotype principle components used for QTL mapping | 4 |
bsj_filter | BSJ read cutoff for circRNA candidate filtering | 2 |
consensus | number of circRNA calling methods requires to accept a circRNA candidate for consensus filtering | 1 |
exp_prop | population expression cutoff for filtering circRNAs | 0.3 |
maf | MAF cutoff for genotype filtering | 0.05 |
fdr | q-value cutoff | 0.05 |
fastqtl_window | FastQTL window size | 1.00E+06 |
Pipeline parameters | ||
cscqtl | running consensus-based circQTL | true |
coloc | running collocation analysis - summary statistics required | false |
circall | running circQTL with Circall | false |
ciri2 | running circQTL with CIRI2 | false |
circexplorer2 | running circQTL with CIRCExplorer2 | false |
Parameters | Description | Default setting |
---|---|---|
genome | genome in fasta format | $baseDir/data/ref/genome.fa |
bed_files | bed files input | $baseDir/data//data/bed/*.bed |
genotype | genotype in vcf.gz format | $baseDir/data/genotype.vcf.gz |
meta | meta data in csv format - using for matching covariates and rna file names | $baseDir/data/meta.csv |
sumstat_files | summary statistics for COLOC analyses | $baseDir/data/sumstat/*.gz |
meta_sumstat | summary statistics information - for matching file names and detail sample sizes | $baseDir/data/meta_sumstat.csv |
trace_dir | directory for tracing output - all intermediate files in the analyses will be soft-linked here - used for debugging | $baseDir/trace_dir |
outdir | directory of output | $baseDir/results |
Running parameters | ||
chrom | chromsome range used for QTL mapping | 1..22 |
peer | PEER range used for optimize the number of PEER factors | 1..20 |
genotype_PCs | number of genotype principle components used for QTL mapping | 4 |
bsj_filter | BSJ read cutoff for circRNA candidate filtering | 2 |
consensus | number of circRNA calling methods requires to accept a circRNA candidate for consensus filtering | 1 |
exp_prop | population expression cutoff for filtering circRNAs | 0.3 |
maf | MAF cutoff for genotype filtering | 0.05 |
fdr | q-value cutoff | 0.05 |
fastqtl_window | FastQTL window size | 1.00E+06 |
Pipeline parameters | ||
coloc | running collocation analysis - summary statistics required | false |
We prepared a script to download the needed annotation of human genome hg38, annotation v106. Note: the script will automaticly places the files in data/ref
bash download_hg38_annotation.sh
You can also download the example sumary statistics for collocation analyses. Note: the script will automaticly places the files in data/sumstat
bash dowload_sumstats.sh
For other input files, we prepare example input data for your convenient to figure how the tool works:
It is noted that these data are not the real data - they are for explaination purpose only
- example of reads data : data/reads
The input argument to the nextflow pipeline should be: data/reads/*_{1,2}.fastq.gz for this directory. It is noted that, all RNA seq sample must be named in the same format, the "*" is the wildcard of all sample ID.
- example of genotype data : data/genotype.vcf.gz
A standard vcf.gz file is accepted. rsID annotation is required - for collocation testing).
- example of meta.csv: data/meta.csv
The meta.csv file is very important and need to pay careful attention. Its need to includes at least 2 colums: rna_id, genotype_id. The column names must be exacted, and separator must be ",". The purpose of this file is to correctly mapping genotype ID and RNA ID. You can also add orther columns, they will be included as numeric covariates. Categorical variable such as Gender must be one-hot encoded in this file.
- example of meta_sumstat.csv: data/meta_sumstat.csv
This file is to provide the tools file names of summary statistics and its case-control ratio. The column names must be exact: file_name, case_prop.
Bellow is the real scripts I used to apply the tool for the 40 t-cell dataset.
### select the working directory - you may need to custome yourself
cd /sigma4/projects/
# clonning the pipeline
git clone https://github.com/datngu/cscQTL.git
# change to the pipeline directory
cd cscQTL
# download the annotation - the files will be placed to matched the default setting
bash download_hg38_annotation.sh
# download the summary statistics - the files will be placed to matched the default setting
bash dowload_sumstats.sh
# the real RNA seqs of 40 T-cell ribo minus: can be download from https://www.synapse.org/#!Synapse:syn22250947
# you would need to rename the RNA seq file names to be the similar names at data/reads
reads="/sigma4/data/40T-cell-blueprint/RNA_seq_data/total_RNA/*_{1,2}.fastq.gz"
# you would need to change "/sigma4/data/40T-cell-blueprint/RNA_seq_data/total_RNA" to your own path
# the genotype data, I provide the hg38 genome (converted to hg38 with crossmap)
wget https://github.com/datngu/data/releases/download/v.0.0.2/tcel_hg38.vcf.gz -O tcel_hg38.vcf.gz
genotype="$PWD/tcel_hg38.vcf.gz"
If you run in a local computer with Docker:
nextflow run main.nf -resume --reads $reads --outdir "TEST_WITH_LOCAL" --genotype $genotype --consensus 3 --coloc true --circall true --ciri2 true --circexplorer2 true -with-report -profile standard
If you run in a HCP server with Singularity and Slurm:
nextflow run main.nf -resume --reads $reads --outdir "TEST_WITH_HPC" --genotype $genotype --consensus 3 --coloc true --circall true --ciri2 true --circexplorer2 true -with-report -profile cluster
Bellow is scripts I used to apply the tool for the 40 t-cell dataset. Sample bed file data are included in the "data/bed" directory. The required bed file is 1-based coordinate system. Column 4 is the geneID, column 5 is the normalized quantification (used directly for QTL mapping), column 6 is the number of BSJ reads (used for filtering).
### select the working directory - you may need to custome yourself
# clonning the pipeline
git clone https://github.com/datngu/cscQTL.git
# change to the pipeline directory
cd cscQTL
# download the annotation - the files will be placed to matched the default setting
bash download_hg38_annotation.sh
# download the summary statistics - the files will be placed to matched the default setting
bash dowload_sumstats.sh
# the genotype data, I provide the hg38 genome (converted to hg38 with crossmap)
wget https://github.com/datngu/data/releases/download/v.0.0.2/tcel_hg38.vcf.gz -O tcel_hg38.vcf.gz
genotype="$PWD/tcel_hg38.vcf.gz"
If you run in a local computer with Docker:
nextflow run cscQTL_bed.nf -resume --bed_files "$PWD/data/bed/*.bed" --outdir "TEST_WITH_LOCAL" --genotype $genotype --coloc true -with-report -profile standard
If you run in a HCP server with Singularity and Slurm:
nextflow run cscQTL_bed.nf -resume --bed_files "$PWD/data/bed/*.bed" --outdir "TEST_WITH_HPC" --genotype $genotype --coloc true -with-report -profile cluster
cscQTL uses GNU General Public License GPL-3.
Dat Thanh Nguyen. 2023. "An integrative framework for circular RNA quantitative trait locus discovery with application in human T cells." bioRxiv 2023.03.22.533756; doi: https://doi.org/10.1101/2023.03.22.533756