Ultra Turbo Pipeline

Overview

This pipeline processes Nanopore sequencing files, assembles them, assesses the quality of the assemblies, performs taxonomic classification, and searches for genes, including antibiotic resistance genes using RGI. The pipeline uses Nextflow for orchestration and supports Conda and Apptainer environments for software dependencies.

How to run the pipeline

Fetch the pipeline

The easiest way to download the pipeline is to clone this GitHub repo to a local directory on a local machine, cluster, or server like the example below.

git clone https://github.com/mnyt-aqw/Ultra-Turbo-Pipeline.git

Download containers

This script builds a custom container from the provided definition file and pulls pre-built containers for all other tools used in the pipeline. It will take roughly 20 minutes to run.

cd Containers/
bash containers.sh

Prerequisites

The only prerequisite you need to run this pipeline are Apptainer/Docker, and Nextflow. The pipeline will install everything else by it self.

Start pipeline

To start the pipeline, execute this line:

NXF_VER=24.04.2 nextflow run main.nf --profile {server or cluster} {other args}

This will download a specific version of Nextflow, and you will have to specify if you run the pipeline on a server or cluster. If you want to resume the pipeline, use the -resumeflag. The pipeline will then ignore processes it has already finished.

SLURM

When running Nextflow on a SLURM-managed cluster, Nextflow submits its own jobs with the resources specified in the configuration file. Each process in the pipeline can have specific resource requirements (e.g., CPUs, time) defined in nextflow.config. Nextflow handles the job submission, management, and resource allocation automatically.

If you need to submit jobs on a cluster like Vera using an SBATCH script, This script is the same as submit_sbatch.sh:

#!/usr/bin/env bash
#SBATCH -A C3SE2021-2-3 -p vera     # Add YOUR Project name and Partition/queue name.
#SBATCH -J {name}                   # Job name
#SBATCH -c 1                        # Number of CPU cores. You do not need more than 1 here
#SBATCH -t 24:00:00                 # Maximum runtime
#SBATCH --error={PATH}/job.%J.err   # Path to the error file
#SBATCH --output={PATH}/job.%J.out  # Path to the output file

# Unload all modules and load Nextflow
module purge
module load Nextflow

# Start the pipeline with specific version of Nextflow
NXF_VER=24.04.2 nextflow run main.nf -profile "cluster" -resume

If you need to change the resources allocated to each process, open the nextflow.config file and change the values.

process {
    withName:PRODIGAL {
        cpus = 1 // nr CPUs
        time = '1m' // max duration
        container = "${projectDir}/Containers/apptainer.sif" // container
        storeDir = "${params.storeDir}/Prodigal/" // where to store cache
        publishDir = "${params.directory_out}/Prodigal/" // where to publish results
    }
}

Workflow

Basecalling
- DORADO_BASECALL: Converts raw Pod5 files to basecalled BAM files.
- DORADO_DEMULTIPLEX: Splits the BAM files into barcoded reads and provides a summary for quality control.
- PYCOQC: Generates quality control metrics from the basecalling summary.
- FILTLONG: Filters and trims the reads to ensure high quality.
Assembly
- FLYE: Assembles the filtered reads into contigs.
- MEDAKA: Polishes the assembly to correct errors.
- CHECKM2_DATABASEDOWNLOAD: Downloads the CheckM2 database.
- CHECKM2: Assesses the quality of the assembly.
ARG_Mapping
- RGI: Screens for antibiotic resistance genes.
BLASTX
- PRODIGAL: Predicts coding regions in the assembly.
- DIAMOND_MAKEDB: Creates a DIAMOND database.
- DIAMOND_BLASTP: Screens for genes using the DIAMOND tool.
Taxonomy
- GTDB_TK_MAKEDB: Downloads and prepares the GTDB-TK database.
- GTDB_TK: Classifies the assembly based on the GTDB-TK database.
- MMSEQS2_MAKEDB: Creates an MMseqs2 database.
- MMSEQS2_CLASSIFY: Classifies the assembly using MMseqs2.

Pipeline Workflow

graph TD
    A[Input: Pod5/FASTQ files]

    subgraph Basecalling
        B[DORADO_BASECALL]
        B -->|BAM| C[DORADO_DEMULTIPLEX]
        C -->|Summary| D[PYCOQC]
        C -->|FASTQ| E[FILTLONG]
    end

    subgraph Assembly
        F[FLYE]
        F -->|Draft assembly| G[MEDAKA]
        G -->|Polished assembly| H[CHECKM2]
        G -->|Polished assembly| I[BUSCO]
    end

    subgraph Taxonomy
        J[GTDB_TK]
        K[MMSEQS2]
    end

    subgraph ARG_Mapping
        L[RGI]
    end

    subgraph BLASTX
        M[PRODIGAL]
        N[DIAMOND_MAKEDB]
        O[DIAMOND_BLASTP]
    end

    A -->|Pod5| B
    E -->|Filtered FASTQ| F
    G -->|Polished assembly| J
    G -->|Polished assembly| K
    G -->|Polished assembly| L
    G -->|Polished assembly| M
    M -->|Predicted genes| O
    N -->|Database| O

Parameters

The pipeline parameters are defined in the config file. You can either change each value on the command line like this --input_files {VAL} --databases {PATH}. Or by changing the values in thenextflow.config` file.

Note

When ypu specify paths to files you can either use full paths. Or you can use relative paths. But instead us specifying "../../db/*.gz". You do it like this "${projectDir}/../../db/*.gz". If you do that do not use single quotes ', use double ".

Mandatory

Input/output

input_files: Path to input Pod5 files, or basecalled fasta files, depending on which modules you want to run. Use glob patterns like this {PATH}/*pod5

Machine options

clusterOptions: Specifies additional cluster options, similar to the options used in an SBATCH script. For example, use -A {PROJECT_NAME} -p {CLUSTER_NAME} when running on a cluster. Leave this empty if not applicable. (Only if -profile "cluster")
DORADO_device: Specifies the device for running Dorado (e.g., “cpu”, “gpu”, “cuda:0”, “all”). Using a GPU is highly recommended as it significantly speeds up the process. For the Vera cluster, specify cuda:0 to ensure proper GPU detection.(Only if Basecalling:true)
gpu_allocated: Specifies the GPU allocation for running Dorado. For the Vera cluster, use --gpus-per-node=V100:1 to allocate one V100 GPU per process. Note that Dorado only supports V100 and A100 GPUs. (Only if Basecalling:true)

Process setting

DORADO_kit: Kit name for barcodes.
mmseq2_db: Name of MMseqs2 database. Read about the different options here. (Only if Taxonomy:true)

Which modules to tun

To run a module enter true. To skip it enter false. You can see what module does what in the flow chart above.

Basecalling: true
Assembly: true
Taxonomy: false
ARG_Mapping: true
BLASTX: false

Important

Taxonomy currently does not work on Vera due to the limited tmp dir spaces. This will be fixed after summer. If you want to run Taxonomy you will have to do that on the server.

Settings with predefined defaults

Input/output

storeDir: Directory to store intermediate files. ./Store
directory_out: Directory for final output. ./Results
databases: Path to protein fasta files. Can either be specified like this: {PATH}/file.fasta. Or if you have multiple files (can be as many as you like): {PATH}/*.fasta.

Process settings

diamond_id: Minimum percent identity for DIAMOND BLASTP. Default: 90 %
diamond_subject_cov: Minimum subject coverage for DIAMOND BLASTP. Default 80 %
filtlong_min_lenght: Minimum length of reads to keep. Default 6000
filtlong_keep_percent: Percent of the longest reads to keep. Default 90 %
DORADO_model: Model for Dorado basecalling. This determines the accuracy and time it takes. Choose between three models:
fast: fast and less accurate.
hac: slower and more accurate.
sup: Slow and most accurate.
GTDB_db: URL for GTDB database. Default is downloading the latest version. Default: v.220.

The assembly related defaults are based on the valued identified by the authors of this paper.

Pipeline Outputs

The Ultra Turbo Pipeline generates several output files in the directory specified by params.directory_out (default: “./Results”):

Basecalling (Dorado/)

*.fa.gz: Demultiplexed sequence reads for each barcode
dorado_summary.tsv: Summary statistics of the basecalling run

Quality Control (pycoQC/)

summary.html: Interactive report of sequencing run quality
summary.json: Raw data of sequencing quality metrics

Read Filtering (Filtlong/)

*_trimmed.fq.gz: Filtered and trimmed high-quality reads

Assembly (Flye/)

*.assembly.fa.gz: Final assembled genome sequences
*.assembly_graph.gfa.gz: Assembly graph in GFA format
*.assembly_info.txt: Statistics about the assembly
*.log: Log file of the Flye assembly process

Assembly Polishing (Medaka/)

*.fa.gz: Polished genome assembly with reduced errors

Assembly Quality Assessment (CheckM2/)

quality_report.tsv: Completeness and contamination estimates of the assembly

Gene Prediction (Prodigal/)

*.translations.faa: Predicted protein sequences
*.gbk: Gene annotations in GenBank format

Antibiotic Resistance Gene Detection (RGI/)

*.txt: Identified antibiotic resistance genes and their annotations

Protein Search (Diamond_Blastp/)

*_diamond.tsv: Results of protein similarity searches

Taxonomic Classification

GTDB_TK/: - gtdbtk.*.summary.tsv: Taxonomic classification based on GTDB - gtdbtk.log: Log file of the GTDB-Tk run

mmseqs2/: - *_taxonomy_lca.tsv: Taxonomic classification of contigs using LCA algorithm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ultra Turbo Pipeline

Overview

How to run the pipeline

Fetch the pipeline

Download containers

Prerequisites

Start pipeline

SLURM

Workflow

Pipeline Workflow

Parameters

Mandatory

Input/output

Machine options

Process setting

Which modules to tun

Settings with predefined defaults

Input/output

Process settings

Pipeline Outputs

Basecalling (Dorado/)

Quality Control (pycoQC/)

Read Filtering (Filtlong/)

Assembly (Flye/)

Assembly Polishing (Medaka/)

Assembly Quality Assessment (CheckM2/)

Gene Prediction (Prodigal/)

Antibiotic Resistance Gene Detection (RGI/)

Protein Search (Diamond_Blastp/)

Taxonomic Classification

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Containers		Containers
README_files/libs		README_files/libs
modules		modules
README.md		README.md
README.qmd		README.qmd
main.nf		main.nf
nextflow.config		nextflow.config
submit_sbatch.sh		submit_sbatch.sh

mnyt-aqw/Ultra-Turbo-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Ultra Turbo Pipeline

Overview

How to run the pipeline

Fetch the pipeline

Download containers

Prerequisites

Start pipeline

SLURM

Workflow

Pipeline Workflow

Parameters

Mandatory

Input/output

Machine options

Process setting

Which modules to tun

Settings with predefined defaults

Input/output

Process settings

Pipeline Outputs

Basecalling (Dorado/)

Quality Control (pycoQC/)

Read Filtering (Filtlong/)

Assembly (Flye/)

Assembly Polishing (Medaka/)

Assembly Quality Assessment (CheckM2/)

Gene Prediction (Prodigal/)

Antibiotic Resistance Gene Detection (RGI/)

Protein Search (Diamond_Blastp/)

Taxonomic Classification

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages