nf-cmgg/preprocessing is a bioinformatics pipeline that demultiplexes and aligns raw sequencing data. It also performs basic QC and coverage analysis.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
Steps inlcude:
- Demultiplexing using
BCLconvert
- Read QC and trimming using
fastp
- Alignment using either
bwa
,bwa-mem2
,bowtie2
,dragmap
orsnap
for DNA-seq andSTAR
for RNA-seq - Duplicate marking using
bamsormadup
orsamtools markdup
- Coverage analysis using
mosdepth
andsamtools coverage
- Alignment QC using
samtools flagstat
,samtools stats
,samtools idxstats
andpicard CollecHsMetrics
,picard CollectWgsMetrics
,picard CollectMultipleMetrics
- QC aggregation using
multiqc
Note
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test
before running the workflow on actual data.
The full documentation can be found here
First, prepare a samplesheet with your input data that looks as follows:
samplesheet.csv
for fastq inputs:
id,samplename,organism,library,fastq_1,fastq_2
sample1,sample1,Homo sapiens,Library_Name,reads1.fq.gz,reads2.fq.gz
samplesheet.csv
for flowcell inputs:
id,samplesheet,lane,flowcell,sample_info
flowcell_id,/path/to/illumina_samplesheet.csv,1,/path/to/sequencer_uploaddir,/path/to/sampleinfo.csv
sampleinfo.csv
for use with flowcell inputs:
samplename,library,organism,tag
fc_sample1,test,Homo sapiens,WES
Now, you can run the pipeline using:
nextflow run nf-cmgg/preprocessing \
-profile <docker/singularity/.../institute> \
--igenomes_base /path/to/genomes \
--input samplesheet.csv \
--outdir <OUTDIR>
Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file
option. Custom config files including those provided by the -c
Nextflow option can be used to provide any configuration except for parameters;
see docs.
nf-cmgg/preprocessing was originally written by the CMGG ICT team.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.