A quality control workflow implemented in Snakemake to assemble Nanopore (long read) data.
As part of the Data Flow SOP in the Snitkin lab, this pipeline should be run on raw sequencing data as soon as it is available from Plasmidsaurus.
In short, it performs the following steps:
- Runs Filtlong to remove low quality reads and discards reads less than 1000 Bp.
- Generates pre and post-Filtlong QC plots using Nanoplot.
- Assemble clean filtlong nanopore reads with Flye assembler.
- Flye assembly is then polished with long reads using Medaka
- The medaka assembly is then passed through Prokka for annotation, skani to identify closest reference genome, and MLST for determining sequence type based on sequences of housekeeping genes
- Flye and medaka assemblies are then run on BUSCO for assembly completeness statistics and QUAST for assembly statistics.
The workflow generates all the output in the prefix
folder set in config/config.yaml
. Each workflow steps gets its own individual folder as shown below. This structure provides a general view of how outputs are organized, with each tool or workflow step having its own directory. Note that this overview does not capture all possible outputs from each tool; it only highlights the primary directories and their contents.
results/2024-05-01_Project_IMPALA_nanoQC/
├── 2024-05-01_Project_IMPALA_nanoQC_report
| ├── 2024-05-01_Project_IMPALA_nanoQC_report.csv
| └── 2024-08-26_Lojek_G6C_rerun_nanoQC_Skani_report_final.csv
├── busco
| ├── Lojek_R5G_1
| └── Lojek_R5G_2
├── filtlong
| ├── Lojek_R5G_1
| └── Lojek_R5G_2
├── flye
| ├── Lojek_R5G_1
| └── Lojek_R5G_2
├── medaka
| ├── Lojek_R5G_1
| └── Lojek_R5G_2
├── mlst
| ├── Lojek_R5G_1
| └── Lojek_R5G_2
├── nanoplot
| ├── Lojek_R5G_1
| └── Lojek_R5G_2
├── prokka
| ├── Lojek_R5G_1
| └── Lojek_R5G_2
├── quast
| ├── Lojek_R5G_1
| └── Lojek_R5G_2
└── skani
├── Lojek_R5G_1
└── Lojek_R5G_2
Clone the github directory onto your system.
git clone https://github.com/Snitkin-Lab-Umich/nanoQC.git
Ensure you have successfully cloned
nanoQC
. Typels
and you should see the newly created directory nanoQC. Move to the newly created directory.
cd nanoQC
Load bioinformatics, snakemake and singularity modules from Great Lakes modules.
module load Bioinformatics snakemake singularity
This workflow makes use of singularity containers available through State Public Health Bioinformatics group. If you are working on Great Lakes (UofM HPC)—you can load snakemake and singularity modules as shown above. However, if you are running it on your local machine or other computing platform, ensure you have snakemake and singularity installed.
As an input, the snakemake file takes a config file where you can set the path to samples.tsv
, path to Nanopore long reads, path to adapter file etc. Instructions on how to modify config/config.yaml
is found in the file itself.
config/samples.tsv
should be a comma seperated file consisting of one column—barcode_id.
You can create samples.tsv
file using the following for loop. It assumes that you are running for loop from the folder that contains your long read.
If your raw reads folder looks like this, you will not be able to run nanoQC.
(base) [dhatrib@gl-login2 nanoQC]$ ls /nfs/turbo/umms-esnitkin/Raw_sequencing_data/G6C
Lojek_G6C_10.fastq.gz Lojek_G6C_17.fastq.gz Lojek_G6C_23.fastq.gz Lojek_G6C_2.fastq.gz Lojek_G6C_36.fastq.gz Lojek_G6C_42.fastq.gz Lojek_G6C_49.fastq.gz Lojek_G6C_9.fastq.gz
Lojek_G6C_11.fastq.gz Lojek_G6C_18.fastq.gz Lojek_G6C_24.fastq.gz Lojek_G6C_30.fastq.gz Lojek_G6C_37.fastq.gz
To ensure the format of the raw reads are compatible with the pipeline, we need to create a folder with the corresponding sample name.
To move the raw read to its corresponding folder name, we will use
move_long_reads_to_corr_folders.sh
. Type the folllowing command on your terminal.
./workflow/scripts/move_long_reads_to_corr_folders.sh
You now need to give it a fastq direcotry
./workflow/scripts/move_long_reads_to_corr_folders.sh /path/to/long/reads
You should see the folders corresponding with your sample name in your long read directory
(base) [dhatrib@gl-login2 nanoQC]$ ls /nfs/turbo/umms-esnitkin/Raw_sequencing_data/G6C
Lojek_G6C_10 Lojek_G6C_17 Lojek_G6C_23 Lojek_G6C_2 Lojek_G6C_36 Lojek_G6C_42
You can create
config/samples.tsv
file using the following for loop. Instead of/path/to/your/data/
, give the path to your long reads.fastq-File-Pattern*
would be if your samples looked like this:Lojek_G6C_10, Lojek_G6C_17, Lojek_G6C_23
, the wildcard pattern that would catch all the samples wouldLojek_G6C*
.
echo "barcode_id" > config/samples.tsv
for folder in /path/to/your/data/fastq-File-Pattern*; do
if [ -d "$folder" ]; then
echo "$(basename "$folder")";
fi;
done >> config/samples.tsv
Check if
samples.tsv
is populated with your samples. Hitq
to exit.
less config/samples.tsv
Preview the steps in nanoQC by performing a dryrun of the pipeline.
snakemake -s workflow/nanoQC.smk --dryrun -p
Run the pipeline.
snakemake -s workflow/nanoQC.smk -p --use-singularity --use-envmodules --use-conda -j 999 --cluster "sbatch -A {cluster.account} -p {cluster.partition} -N {cluster.nodes} -t {cluster.walltime} -c {cluster.procs} --mem-per-cpu {cluster.pmem} --output=slurm_out/slurm-%j.out" --conda-frontend conda --cluster-config config/cluster.json --configfile config/config.yaml --latency-wait 1000 --nolock
You can generate a MultiQC report on prokka, quast, busco and nanoplot folders after you finish running the snakemake workflow above.
Activate multiqc using conda.
module load multiqc
Change
multiqc_report_filename
to a file name of your choice andmultiqc_output_dir
to an output directory name of your choosing. Run multiqc on output folders specified above.
multiqc --filename multiqc_report_filename -o multiqc_output_dir -f prokka/ busco/ quast/ nanoplot/