Skip to content

Quality control pipeline for nanopore long read data

Notifications You must be signed in to change notification settings

Snitkin-Lab-Umich/nanoQC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

nanoQC

A quality control workflow implemented in Snakemake to assemble Nanopore (long read) data.

Summary

As part of the Data Flow SOP in the Snitkin lab, this pipeline should be run on raw sequencing data as soon as it is available from Plasmidsaurus.

In short, it performs the following steps:

  • Runs Filtlong to remove low quality reads and discards reads less than 1000 Bp.
  • Generates pre and post-Filtlong QC plots using Nanoplot.
  • Assemble clean filtlong nanopore reads with Flye assembler.
  • Flye assembly is then polished with long reads using Medaka
  • The medaka assembly is then passed through Prokka for annotation, skani to identify closest reference genome, and MLST for determining sequence type based on sequences of housekeeping genes
  • Flye and medaka assemblies are then run on BUSCO for assembly completeness statistics and QUAST for assembly statistics.

The workflow generates all the output in the prefix folder set in config/config.yaml. Each workflow steps gets its own individual folder as shown below. This structure provides a general view of how outputs are organized, with each tool or workflow step having its own directory. Note that this overview does not capture all possible outputs from each tool; it only highlights the primary directories and their contents.

results/2024-05-01_Project_IMPALA_nanoQC/
├── 2024-05-01_Project_IMPALA_nanoQC_report
|   ├── 2024-05-01_Project_IMPALA_nanoQC_report.csv
|   └── 2024-08-26_Lojek_G6C_rerun_nanoQC_Skani_report_final.csv
├── busco
|   ├── Lojek_R5G_1
|   └── Lojek_R5G_2
├── filtlong
|   ├── Lojek_R5G_1
|   └── Lojek_R5G_2
├── flye
|   ├── Lojek_R5G_1
|   └── Lojek_R5G_2
├── medaka
|   ├── Lojek_R5G_1
|   └── Lojek_R5G_2
├── mlst
|   ├── Lojek_R5G_1
|   └── Lojek_R5G_2
├── nanoplot
|   ├── Lojek_R5G_1
|   └── Lojek_R5G_2
├── prokka
|   ├── Lojek_R5G_1
|   └── Lojek_R5G_2
├── quast
|   ├── Lojek_R5G_1
|   └── Lojek_R5G_2
└── skani
    ├── Lojek_R5G_1
    └── Lojek_R5G_2

Installation

Clone the github directory onto your system.


git clone https://github.com/Snitkin-Lab-Umich/nanoQC.git

Ensure you have successfully cloned nanoQC. Type ls and you should see the newly created directory nanoQC. Move to the newly created directory.


cd nanoQC

Load bioinformatics, snakemake and singularity modules from Great Lakes modules.


module load Bioinformatics snakemake singularity

This workflow makes use of singularity containers available through State Public Health Bioinformatics group. If you are working on Great Lakes (UofM HPC)—you can load snakemake and singularity modules as shown above. However, if you are running it on your local machine or other computing platform, ensure you have snakemake and singularity installed.

Setup config and sample files

Customize config.yaml and set tool specific parameters

As an input, the snakemake file takes a config file where you can set the path to samples.tsv, path to Nanopore long reads, path to adapter file etc. Instructions on how to modify config/config.yaml is found in the file itself.

Samples

config/samples.tsv should be a comma seperated file consisting of one column—barcode_id.

You can create samples.tsv file using the following for loop. It assumes that you are running for loop from the folder that contains your long read.

If your raw reads folder looks like this, you will not be able to run nanoQC.

(base) [dhatrib@gl-login2 nanoQC]$ ls /nfs/turbo/umms-esnitkin/Raw_sequencing_data/G6C
Lojek_G6C_10.fastq.gz  Lojek_G6C_17.fastq.gz  Lojek_G6C_23.fastq.gz  Lojek_G6C_2.fastq.gz   Lojek_G6C_36.fastq.gz  Lojek_G6C_42.fastq.gz  Lojek_G6C_49.fastq.gz  Lojek_G6C_9.fastq.gz
Lojek_G6C_11.fastq.gz  Lojek_G6C_18.fastq.gz  Lojek_G6C_24.fastq.gz  Lojek_G6C_30.fastq.gz  Lojek_G6C_37.fastq.gz 

To ensure the format of the raw reads are compatible with the pipeline, we need to create a folder with the corresponding sample name.

To move the raw read to its corresponding folder name, we will use move_long_reads_to_corr_folders.sh. Type the folllowing command on your terminal.

./workflow/scripts/move_long_reads_to_corr_folders.sh 

You now need to give it a fastq direcotry

./workflow/scripts/move_long_reads_to_corr_folders.sh /path/to/long/reads

You should see the folders corresponding with your sample name in your long read directory

(base) [dhatrib@gl-login2 nanoQC]$ ls /nfs/turbo/umms-esnitkin/Raw_sequencing_data/G6C
Lojek_G6C_10  Lojek_G6C_17 Lojek_G6C_23  Lojek_G6C_2   Lojek_G6C_36  Lojek_G6C_42

You can create config/samples.tsv file using the following for loop. Instead of /path/to/your/data/, give the path to your long reads. fastq-File-Pattern* would be if your samples looked like this: Lojek_G6C_10, Lojek_G6C_17, Lojek_G6C_23, the wildcard pattern that would catch all the samples would Lojek_G6C*.

echo "barcode_id" > config/samples.tsv

for folder in /path/to/your/data/fastq-File-Pattern*; do
    if [ -d "$folder" ]; then
        echo "$(basename "$folder")";
    fi;
done >> config/samples.tsv

Check if samples.tsv is populated with your samples. Hit q to exit.

less config/samples.tsv

Quick start

Run nanoQC on a set of samples.

Preview the steps in nanoQC by performing a dryrun of the pipeline.


snakemake -s workflow/nanoQC.smk --dryrun -p

Run the pipeline.


snakemake -s workflow/nanoQC.smk -p --use-singularity --use-envmodules --use-conda -j 999 --cluster "sbatch -A {cluster.account} -p {cluster.partition} -N {cluster.nodes}  -t {cluster.walltime} -c {cluster.procs} --mem-per-cpu {cluster.pmem} --output=slurm_out/slurm-%j.out" --conda-frontend conda --cluster-config config/cluster.json --configfile config/config.yaml --latency-wait 1000 --nolock 

Alt text

You can generate a MultiQC report on prokka, quast, busco and nanoplot folders after you finish running the snakemake workflow above.

Activate multiqc using conda.


module load multiqc

Change multiqc_report_filename to a file name of your choice and multiqc_output_dir to an output directory name of your choosing. Run multiqc on output folders specified above.


multiqc --filename multiqc_report_filename -o multiqc_output_dir -f prokka/ busco/ quast/ nanoplot/

Dependencies

Near Essential

Tool stack used in workflow

About

Quality control pipeline for nanopore long read data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published