This section of the workshop is a step by step tutorial on how to run the Grape pipeline on the workshop sample data.
The pipeline codebase is on Github. The Grape repository is configured to fullfil Nextflow requirements for pipeline sharing. In order to get the latest version of the pipeline run the following command:
nextflow pull guigolab/grape-nf
Checking guigolab/grape-nf ... downloaded from https://github.com/guigolab/grape-nf.git
The pipeline has been cloned from the github repository and saved in:
~/.nextflow/assets/guigolab/grape-nf/
You can see the help message of the pipeline by running:
nextflow run grape-nf --help
link:https://raw.githubusercontent.com/guigolab/grape-nf/master/docs/src/quickstart.adoc[role=include]
Note
|
since the pipeline has beed already pulled to the local system we can omit the repository owner name (guigolab ) when running the pipeline.
|
We need to create a base directory for the pipeline run. Create a folder named grape
in your home folder and go inside the folder:
mkdir grape && cd grape
Here is an example:
mouse_cns_E14_rep1 mouse_cns_E14_rep1 mouse_cns_E14_rep1_1.fastq.gz fastq FqRd1 mouse_cns_E14_rep1 mouse_cns_E14_rep1 mouse_cns_E14_rep1_2.fastq.gz fastq FqRd2
You can have have a look at the input index file for the workshop:
cat /users/ngs00/data/index.tsv
Create a nextflow.config
file in the current directory with the following content:
params { // (1)
index = '/users/ngs00/data/index.tsv'
genome = '/users/ngs00/refs/mouse_genome_mm9.fa'
annotation = '/users/ngs00/refs/mm65.long.ok.gtf'
genomeIndex = '/users/ngs00/refs/mouse_genome_mm9_STAR_index'
steps = 'mapping,bigwig'
}
trace.enabled = true // (2)
timeline.enabled = true // (3)
process { // (4)
executor = 'sge'
queue = 'NGS'
penv = 'smp'
}
includeConfig "/users/ngs00/ngs.config" // (5)
-
Parameter definitions. It could also be done from the command line (e.g.
--genome /users/ngs00/refs/mouse_genome_mm9.fa
). -
Enable process execution tracing
-
Enable HTML timeline generation
-
Process options for running in the workshop HPC cluster
-
Include an external configuration file
Grape uses Nextflow profiles to define the set of tools to be used for the pipeline run. To date three main profiles are available:
gemflux |
GEM for mapping, custom scripts for RNAseq Signal and Contig and FluxCapacitor for genes and transcripts expression quantification |
starrsem |
STAR for mapping and RNAseq Signal, custom script for Contig and RSEM for gene and transcripts expression quantification |
starflux |
STAR for mapping and RNAseq Signal, custom script for Contig and FluxCapacitor for gene and transcripts expression quantification |
Profiles can be specified explicitly on the command line. There is also a standard
profile used as the default when no profile is specified. For Grape, it is the same as starrsem
plus an additional configuration option and we are going to use this profile. It looks like:
link:https://raw.githubusercontent.com/guigolab/grape-nf/master/nextflow.config[role=include]
The bamSort
options tells the pipeline to use samtools
for sorting the mapping output file.
In order to run the pipeline, use this command:
nextflow -bg run grape-nf > pipeline.log
The status of the pipeline run can be monitored by looking at the pipeline.log
files we specified above.
tail -f ~/grape/pipeline.log
…
===============
Output files db → /nfs/users/ngs00/grape/pipeline.db
===============
[warm up] executor > sge
[c3/ecc8e6] Submitted process > fastaIndex (mouse_genome_mm9-SAMtools-0.1.19)
[26/3dba1e] Submitted process > mapping (mouse_cns_E14_rep1-STAR-2.4.0j)
[49/30e442] Submitted process > mapping (mouse_cns_E18_rep2-STAR-2.4.0j)
[00/681c79] Submitted process > mapping (mouse_cns_E18_rep1-STAR-2.4.0j)
[d0/1f18c1] Submitted process > mapping (mouse_cns_E14_rep2-STAR-2.4.0j)
[circle o notch]
The highlighted string specifies the prefix of the process folder. It is relative to the pipeline working folder which, by default, is work
inside the base folder. To get the full name of the folder you need to use bash autocompletion
feature. Type the path to the folder, include the prefix and press kbd:[TAB]. For the previous example process:
ls /nfs/users/ngs00/grape/work/00/681c79kbd:[TAB]
ls /nfs/users/ngs00/grape/work/00/681c79f509a8b084d19a57f6abf784/
If the run finishes without errors the pipeline.log
file ends with a success message:
… ----------------------- Pipeline run completed. -----------------------
The path and metadata of output files generated by the pipeline run are written inside a TAB
separated file, named pipeline.db
. The file can be found inside the pipeline working folder.
The first 5 columns follow the same schema as the input index file. Here is a description of the remaining columns:
1 |
read configuration |
the configuration of the reads (e.g. |
2 |
strandness |
the strandness of the experiments, automaticalliy inferred from the
mapping files (e.g. |
Have a look at the generated db file:
less -S ~/grape/pipeline.db
For this run, the 5th field of the db file can have one of the following values:
GenomeAlignments |
genomic mapping file (references are chromosomes, e.g. |
TranscriptomeAlignments |
transcriptomic mapping file (references are transcripts, e.g. |
PlusRawSignal |
RNA-seq signal for the plus strand from uniquely mapped reads |
MinusRawSignal |
RNA-seq signal for the minus strand from uniquely mapped reads |
MultiplePlusRawSignal |
RNA-seq signal for the plus strand from all mapped reads |
MultipleMinusRawSignal |
RNA-seq signal for the minus strand from all mapped reads |
When the mapping steps are done, we can have a look at the mapping files. We can get the genomic BAM
file for the experiment mouse_cns_E18_rep1
BAM=$(awk '$1=="mouse_cns_E18_rep1" && $5=="GenomeAlignments"{print $3}' ~/grape/pipeline.db)
We can explore the contents of the mapping file:
samtools view -h $BAM | less
or get the mapping statistics collected by STAR:
cat $(dirname $BAM)/Log.final.out
Started job on | Apr 04 12:28:45 Started mapping on | Apr 04 12:31:08 Finished on | Apr 04 12:31:43 Mapping speed, Million of reads per hour | 77.15 Number of input reads | 750067 Average input read length | 202 UNIQUE READS: Uniquely mapped reads number | 645130 Uniquely mapped reads % | 86.01% Average mapped length | 200.58 Number of splices: Total | 346827 Number of splices: Annotated (sjdb) | 341765 Number of splices: GT/AG | 343268 Number of splices: GC/AG | 2924 Number of splices: AT/AC | 404 Number of splices: Non-canonical | 231 Mismatch rate per base, % | 0.19% Deletion rate per base | 0.01% Deletion average length | 1.92 Insertion rate per base | 0.01% Insertion average length | 1.45 MULTI-MAPPING READS: Number of reads mapped to multiple loci | 27454 % of reads mapped to multiple loci | 3.66% Number of reads mapped to too many loci | 847 % of reads mapped to too many loci | 0.11% UNMAPPED READS: % of reads unmapped: too many mismatches | 0.00% % of reads unmapped: too short | 10.06% % of reads unmapped: other | 0.15%