Snakemake pipeline to generate and QC genome (co)assemblies from single-cell (e.g., G&T-Seq) or metagenome data.
THIS PIPELINE DOES NOT TRIM READS. IT EXPECTS THE INPUT READS TO HAVE ALREADY BEEN ADAPTER TRIMMED
The input to this pipeline is a comma separated file. The first line specifies the name of the coassembly, the following lines contain 4 comma separated values to specify the gDNA and cDNA libraries:
- sample name
- library type (gDNA or cDNA)
- path to forward reads
- path to reverse reads
Sample names should not contain periods
Example input file: input.csv
COASSEMBLY_NAME
Sample1,gDNA,Sample1_gDNA_R1.fq.gz,Sample1_gDNA_R2.fq.gz
Sample1,cDNA,Sample1_cDNA_R1.fg.gz,Sample1_cDNA_R2.fq.gz
Sample2,gDNA,Sample2_gDNA_R1.fq.gz,Sample2_gDNA_R2.fq.gz
Sample2,cDNA,Sample2_cDNA_R1.fg.gz,Sample2_cDNA_R2.fq.gz
Sample3,gDNA,Sample3_gDNA_R1.fq.gz,Sample3_gDNA_R2.fq.gz
Sample4,cDNA,Sample4_cDNA_R1.fq.gz,Sample4_cDNA_R2.fq.gz
Example config file: config.yaml
# File listing input reads, better to set this at command line with "--config input=XXX.csv"
# Should use absolute paths
input:
# Mode to run SPAdes in - "sc" for single-cell or "meta" for metagenome
assembly_type: [sc / meta]
# Optional tools to run
run_checkm: true
run_busco: true
# SPAdes parameters
kmers: "21,33,55,77"
min_scaffold_length: 1000 # for bbtools reformat
# BUSCO parameters
busco_version: 3
busco_database: XXXXX
# MetaBat2 parameters
metabat2_min_contig: 1500
# Tiara parameters
tiara_min_length: 1000
# EukRep parameters
eukrep_min_length: 1000
# Blobtools parameters
diamond_database: XXXXX
megablast_database: XXXXX
nodesdb: XXXXX
n_chunks: 4 # chunk the fasta file into N splits for the megablastn search; randomly to balance as input fasta will be sorted by sequence length
# CAT parameters
cat_database: XXXXX
taxonomy_dir: XXXXX
diamond_path: XXXXX
# pr2 database
pr2_database: XXXXX
# cleanup options
cleanup_spades: true
cleanup_cat: true
Example command to launch snakemake:
snakemake -p --config input=input.csv -j 20 --retries 1 --latency-wait 60 --snakefile CoassemblyPipeline.smk --cluster-config cluster.json --cluster "sbatch -p {cluster.partition} -c {cluster.c} --mem={cluster.memory} --job-name={cluster.J} --time={cluster.time} --exclude={cluster.exclude} -o slurm_logs/slurm.{cluster.J}.%j.out"
This pipeline runs:
- genome assembly using SPAdes in single-cell or metagenome mode
- annotation of rRNA genes using barrnap, followed by classification based on comparison with the pr2 database
- taxonomic classification using:
- assembly stats using QUAST
- coverage stats based on mapping reads with minimap2 and QualiMap
- contig binning using MetaBAT2
- BUSCO
- CheckM
- Optionally, aligns RNA-Seq reads using hisat2 and assembles transcripts using StringTie2. RNA-Seq reads can come from any source (e.g., single-cell, metatranscriptome, or isolate RNA-Seq)
The Singularity definition file Singularity.def
contains most of the required software. Additional software requirements not included in the definition file (due to compatbility issues):
Example DAG