This lesson is an introduction to the workflow manager Nextflow, and nf-core, a community effort to collect a curated set of analysis pipelines built using Nextflow.
Nextflow enables scalable and reproducible scientific workflows using software containers such as Docker and Singularity. It allows the adaptation of pipelines written in the most common scripting languages such as R and Python. Nextflow is a Domain Specific Language (DSL) that simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.
This lesson motivates the use of Nextflow and nf-core as a development tool for building and sharing computational pipelines that facilitate reproducible (data) science workflows.
From the terminal of your local computer, you can log into the HPC using the
following command line, followed by pressing . You will be prompted to
type in your password. On a Linux system, you can use the Ctrl-Alt-T
keyboard
shortcut to open a terminal.
ssh <train11>@172.16.13.171
If using PuTTY, type 172.16.13.171
in the Host Name (or IP address)
field
and open
the program. Login with user name when prompted and key in your password.
In your home
directory, follow the steps:
Clone the repo in your home
directory
git clone https://github.com/ajodeh-juma/ngs-academy-africa-nfcore.git
-
Open your first nextflow script
wc.nf
using your favourite text editor (nano
orvim
) -
Run the script using
nextflow
nextflow run wc.nf
-
Create a
process
in the script toprint
the number of reads in the input file provided. Ensure that you capture theoutput
asstdout
Quiz: How many reads are in the input file?
Answer
-
Create a
conda
environmentconda env create -f environment.yaml
The
environment.yml
file has all the required tools/software and dependencies for the simple pipleine that we will run. (You can have a preview of the file) -
Activate the conda environment
conda activate rnaseq-env
-
Run the script using
conda
profilenextflow run main.nf -profile conda
-
In the
environment.yaml
file, add a dependencyfastp
andupdate
the conda environment by using the command:conda env update -f environment.yaml
-
In your
workflow
:(a). Add a
process
that preprocesses the raw reads usingfastp
and use the preprocessed reads asinput
for thequantification
step withsalmon
.(b). As output(s), emit the
channels
:.json
,.html
and the.log
files as outputs.(c). Use the
.json
outputs as input for in themultiqc
process to summarize and visualize.(d). Add a
process
that counts the number of reads before (raw reads) and after (preprocessed reads). Print the output(s) instdout
.
-
Deactivate the
conda
environmentconda deactivate
-
Build a Docker image
docker build -t rnaseq-image .
This may take a couple of minutes
-
Test container by looking at the
Salmon
versiondocker run rnaseq-image salmon --version
-
Mount the parent directory identical to the container using the flag
-v
or--volume
and generate the genome index usingsalmon
by running the container in interactive mode. -
Run the script using
docker
nextflow run main.nf -with-docker rnaseq-image
-