GitHub - EI-CoreBioinformatics/CoassemblyPipeline: Snakemake pipeline for genome coassembly

CoassemblyPipeline

Snakemake pipeline to generate and QC genome (co)assemblies from single-cell (e.g., G&T-Seq) or metagenome data.

THIS PIPELINE DOES NOT TRIM READS. IT EXPECTS THE INPUT READS TO HAVE ALREADY BEEN ADAPTER TRIMMED

The input to this pipeline is a comma separated file. The first line specifies the name of the coassembly, the following lines contain 4 comma separated values to specify the gDNA and cDNA libraries:

sample name
library type (gDNA or cDNA)
path to forward reads
path to reverse reads

Sample names should not contain periods

Example input file: input.csv

COASSEMBLY_NAME
Sample1,gDNA,Sample1_gDNA_R1.fq.gz,Sample1_gDNA_R2.fq.gz
Sample1,cDNA,Sample1_cDNA_R1.fg.gz,Sample1_cDNA_R2.fq.gz
Sample2,gDNA,Sample2_gDNA_R1.fq.gz,Sample2_gDNA_R2.fq.gz
Sample2,cDNA,Sample2_cDNA_R1.fg.gz,Sample2_cDNA_R2.fq.gz
Sample3,gDNA,Sample3_gDNA_R1.fq.gz,Sample3_gDNA_R2.fq.gz
Sample4,cDNA,Sample4_cDNA_R1.fq.gz,Sample4_cDNA_R2.fq.gz

Example config file: config.yaml

# File listing input reads, better to set this at command line with "--config input=XXX.csv"
# Should use absolute paths
input:

# Mode to run SPAdes in - "sc" for single-cell or "meta" for metagenome
assembly_type: [sc / meta]

# Optional tools to run
run_checkm: true
run_busco: true

# SPAdes parameters
kmers: "21,33,55,77"
min_scaffold_length: 1000 # for bbtools reformat

# BUSCO parameters
busco_version: 3
busco_database: XXXXX

# MetaBat2 parameters
metabat2_min_contig: 1500

# Tiara parameters
tiara_min_length: 1000

# EukRep parameters
eukrep_min_length: 1000

# Blobtools parameters
diamond_database: XXXXX
megablast_database: XXXXX
nodesdb: XXXXX
n_chunks: 4 # chunk the fasta file into N splits for the megablastn search; randomly to balance as input fasta will be sorted by sequence length

# CAT parameters
cat_database: XXXXX
taxonomy_dir: XXXXX
diamond_path: XXXXX

# pr2 database
pr2_database: XXXXX

# cleanup options
cleanup_spades: true
cleanup_cat: true

Example command to launch snakemake:

snakemake -p --config input=input.csv -j 20 --retries 1 --latency-wait 60 --snakefile CoassemblyPipeline.smk --cluster-config cluster.json --cluster "sbatch -p {cluster.partition} -c {cluster.c} --mem={cluster.memory} --job-name={cluster.J} --time={cluster.time} --exclude={cluster.exclude} -o slurm_logs/slurm.{cluster.J}.%j.out"

This pipeline runs:

genome assembly using SPAdes in single-cell or metagenome mode
annotation of rRNA genes using barrnap, followed by classification based on comparison with the pr2 database
taxonomic classification using:
- Blobtools
- CAT
- EukRep
- Tiara
assembly stats using QUAST
coverage stats based on mapping reads with minimap2 and QualiMap
contig binning using MetaBAT2
BUSCO
CheckM
Optionally, aligns RNA-Seq reads using hisat2 and assembles transcripts using StringTie2. RNA-Seq reads can come from any source (e.g., single-cell, metatranscriptome, or isolate RNA-Seq)

The Singularity definition file Singularity.def contains most of the required software. Additional software requirements not included in the definition file (due to compatbility issues):

Example DAG

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
scripts		scripts
.gitignore		.gitignore
CoassemblyPipeline.smk		CoassemblyPipeline.smk
LICENSE		LICENSE
README.md		README.md
Singularity.def		Singularity.def
cluster.json		cluster.json
config.yaml		config.yaml
example_DAG.png		example_DAG.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoassemblyPipeline

About

Releases

Packages

Languages

License

EI-CoreBioinformatics/CoassemblyPipeline

Folders and files

Latest commit

History

Repository files navigation

CoassemblyPipeline

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages