This workflow is a wrapper around the method described by Luebbert et al., 20241 in which Kallisto is used to quantify reads of viral origin in RNA-seq data (bulk + single-cell).
This wrapper is currently bulk only.
- Nextflow (tested with v24.04.4.5917)
- Singularity (tested with v3.11.5-1.el7)
This workflow uses a containerised version of kallisto | bustools v0.28.22 running kallisto v0.50.13 and bustools v0.43.22.
This workflow requires several reference files providing information on both viral and 'host' species to be downloaded prior to running. Here, 'host' species refers to the species of origin of the sequenced samples.
Firstly, the reference genome and transcriptome (cDNA) of the host species is required in FASTA format, along with a GTF annotation.
An example for the GENCODE v46 release for Humans is shown below.
# Genome
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/GRCh38.primary_assembly.genome.fa.gz
# cDNA
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/gencode.v46.transcripts.fa.gz
# GTF
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/gencode.v46.annotation.gtf.gz
Also required are viral reference files (modified PalmDB files) which can be found here and downloaded as shown below
# fasta
wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa
# T2G
wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt
This workflow is written in Nextflow - see here for installation info.
Alternatively, Nextflow can be installed into a conda/mamba environments (my preferred method).
micromamba create -n nextflow nextflow
Nextflow configurations are important for optimal functioning - more info here.
An example config file for use with Newcastle University's Rocket HPC is provided - rocket.config
This workflow requires essential parameters described in the table below.
${projectDir}
describes the directory where kallistoViral.nf is located- FASTQ files should be concatenated across lanes if necessary
- In the example above, the FASTQ files are in the format:
- SampleA_R1.fastq.gz, SampleA_R2.fastq.gz,
- SampleB_R1.fastq.gz, SampleB_R2.fastq.gz
- In the example above, the FASTQ files are in the format:
Parameter | Description | Example |
---|---|---|
fastq | Path to FASTQ files | "${projectDir}/data/fastq/*_{R1,R2}.fastq.gz" |
outdir | Name of output folder to be created in the current directory | output |
strandedness | Whether RNA-seq library preparation retains strand information <forward / reverse / unstranded> | reverse |
readType | Whether sequencing generated single- or paired-end reads <single / paired> | paired |
viral_fasta | Path to viral FASTA file | "${projectDir}/data/ref/palmdb_rdrp_seqs.fa" |
viral_t2g | Path to viral T2G file | "${projectDir}/data/ref/palmdb_clustered_t2g.txt" |
cdna | Path to host transcriptome reference. | "${projectDir}/data/ref/gencode.v46.transcripts.fa.gz" |
genome | Path to host genomic reference | "${projectDir}/data/ref/GRCh38.primary_assembly.genome.fa.gz" |
gtf | Path to GTF annotation. | "${projectDir}/data/ref/gencode.v46.annotation.gtf.gz" |
The above parameters should be contained with a parameters file. An example is given in params.json
for the file structure shown below.
.
├── data
│ ├── fastq
│ │ ├── SampleA_R1.fastq.gz
│ │ ├── SampleA_R2.fastq.gz
│ │ ├── SampleB_R1.fastq.gz
│ │ └── SampleB_R2.fastq.gz
│ └── ref
│ ├── gencode.v46.annotation.gtf.gz
│ ├── gencode.v46.transcripts.fa.gz
│ ├── GRCh38.primary_assembly.genome.fa.gz
│ ├── palmdb_clustered_t2g.txt
│ └── palmdb_rdrp_seqs.fa
├── kallistoViral.nf
├── launcher_slurm.sh
├── params.json
├── README.md
└── rocket.config
The basic command for running the workflow is
micromamba activate nextflow
nextflow run kallistoViral.nf -config <config_file> -params-file <params_file>
However, it is advised to submit the nextflow job as a batch job to save resources on the login/head node. An example of this for use with the SLURM workload manager is given in launcher_slurm.sh
which can be run using
sbatch launcher_slurm.sh
When finished successfully, the specified output folder should contain host (transcript + gene) and viral abundances for each sample.
Information regarding the run + quantification is contained within JSON files.
output
├── host
│ ├── SampleA
│ │ ├── inspect.json
│ │ ├── kb_info.json
│ │ ├── quant_unfiltered
│ │ │ ├── abundance_1.tsv
│ │ │ └── abundance.gene_1.tsv
│ │ └── run_info.json
│ └── SampleB
│ ├── inspect.json
│ ├── kb_info.json
│ ├── quant_unfiltered
│ │ ├── abundance_1.tsv
│ │ └── abundance.gene_1.tsv
│ └── run_info.json
└── viral
├── SampleA
│ ├── inspect.json
│ ├── kb_info.json
│ ├── quant_unfiltered
│ │ ├── abundance_1.tsv
│ │ ├── abundance_2.tsv
│ │ ├── abundance.gene_1.tsv
│ │ └── abundance.gene_2.tsv
│ └── run_info.json
└── SampleB
├── inspect.json
├── kb_info.json
├── quant_unfiltered
│ ├── abundance_1.tsv
│ ├── abundance_2.tsv
│ ├── abundance.gene_1.tsv
│ └── abundance.gene_2.tsv
└── run_info.json
Paired-end sequencing produces two abundance files when quantifying viral reads as the use of amino acid sequences in pseudoalignment (kb count -aa
) is not currently supported in paired-end reads.
Mapping the viral IDs to their respective taxonomies can be performed using this file for downstream analysis.
- Parse JSON log outputs
- Automatically map viral IDs to taxonomies
- Produce formatted count matrices
- Develop single cell workflow wrapper
1 Luebbert, L., Sullivan, D.K., Carilli, M., Hjörleifsson, K.E., Winnett, A.V., Chari, T. & Pachter, L. (2024) ‘Efficient and accurate detection of viral sequences at single-cell resolution reveals putative novel viruses perturbing host gene expression’, bioRxiv: The Preprint Server for Biology, p. 2023.12.11.571168.
2 Melsted, P., Booeshaghi, A.S., Liu, L., Gao, F., Lu, L., Min, K.H. (Joseph), da Veiga Beltrame, E., Hjörleifsson, K.E., Gehring, J. & Pachter, L. (2021) ‘Modular, efficient and constant-memory single-cell RNA-seq preprocessing’, Nature Biotechnology, 39(7), pp. 813–818.
3 Bray, N.L., Pimentel, H., Melsted, P. & Pachter, L. (2016) ‘Near-optimal probabilistic RNA-seq quantification’, Nature Biotechnology, 34(5), pp. 525–527.