-
Notifications
You must be signed in to change notification settings - Fork 28
How to run YAMP
This tutorial explains how to set the YAMP parameters and how to run it using Docker, and describes the output files. It requires a recent personal computer to run.
Please note that the parameters used in this tutorial are those you will find in the ./conf/test.config
file, and its results can be replicated using the following command:
nextflow run YAMP.nf -profile test,docker
All paths are relative to the YAMP repository, and the folder layout is described here.
We will use a small simulated dataset that is available within the YAMP repository. The dataset is located in the ./data/test_data
folder. For this tutorial we will use a paired-end layout, and, more specifically, the following files:
random_ncbi_reads_with_duplicated_and_contaminants_R1.fastq.gz
random_ncbi_reads_with_duplicated_and_contaminants_R2.fastq.gz
You will also need to have access to a set of databases that are queried during the YAMP execution, as explained in Getting started page. The folders layout used in this tutorial can be found here. Please note that the layout we are using here is only an example: you can use any layout you wish and specify the resource paths as described in the following paragraphs.
Please also remember to download the MetaPhlAn databases: these are not available within the demo datasets! See here for details (section: "Notes on the MetaPhlAn databases").
For the sake of simplicity, we are going to navigate the ./conf/test.config
file. More information on Nextflow config
and profiles file are available here.
We will be using the ./test
folder as the output directory by setting outdir
parameters to:
outdir = "$baseDir/tests"
($baseDir
is a Nextflow directive).
To make it clearer that this sample is a test sample we will set the prefix
parameter to test:
prefix="test"
This parameter is mostly useful when the user would like to assign a human-readable name to the sample, for instance, "treated_patient_ID123".
Both options can be set also when running from the command line, as:
run nextflow YAMP.nf --reads1 random_ncbi_reads_with_duplicated_and_contaminants_R1.fastq.gz
--reads2 random_ncbi_reads_with_duplicated_and_contaminants_R2.fastq.gz
--prefix test --outdir ./tests
We are using paired-end reads, so we should set the singleEnd
parameter to false
(if using single-end reads, this to have been set to true
). We also want to run YAMP in complete
mode, that is, we want to run both the QC and the characterisation steps.
To do so, we use the following assignments:
singleEnd = false
mode = "complete"
The simulated dataset also includes identical duplicates, so we also set the dedup
parameter to true
, therefore telling YAMP to perform de-duplication:
dedup = true
Again, both could be specified on the command line, as:
run nextflow YAMP.nf --reads1 random_ncbi_reads_with_duplicated_and_contaminants_R1.fastq.gz
--reads2 random_ncbi_reads_with_duplicated_and_contaminants_R2.fastq.gz
--singleEnd false --mode complete --dedup true
We will keep the default parameter for trimming, decontamination, and taxonomic profiling (specified both in './conf/test.config' and ./conf/base.config
):
qin=33 //Input quality offset: 33 (ASCII+33) or 64 (ASCII+64)
kcontaminants = 23 //Kmer length used for finding contaminants
phred = 10 //regions with average quality BELOW this will be trimmed
minlength = 60 //reads shorter than this after trimming will be discarded
mink = 11 //shorter kmers at read tips to look for
hdist = 1 //maximum Hamming distance for ref kmers
mind = 0.95 //Approximate minimum alignment identity to look for
maxindel = 3 //longest indel to look for
bwr=0.16 //restrict alignment band to this
bt2options="very-sensitive" //presets options for BowTie2 (MetaPhlAn)
For the sake of simplicity, we are going to use the demo file provided with YAMP (please remember to download the actual MetaPhlAn files, there is no demo for these!).
We are then going to set the following paths:
artefacts = "$baseDir/assets/data/sequencing_artifacts.fa.gz"
phix174ill = "$baseDir/assets/data/phix174_ill.ref.fa.gz"
adapters = "$baseDir/assets/data/adapters.fa"
metaphlan_databases="$baseDir/assets/data/metaphlan_databases/"
chocophlan="$baseDir/assets/demo/chocophlan"
uniref="$baseDir/assets/demo/uniref"
We also need to specify a contaminant (pan)genome. In this case, we don't have an already indexed genome available, so we set foreign_genome
while leaving foreign_genome_ref
empty:
foreign_genome = "$baseDir/assets/demo/genome.fa"
foreign_genome_ref = ""
If we had the reference genome already indexed, we would have done the opposite:
foreign_genome = ""
foreign_genome_ref = "path/to/my/indexed/genome"
Please remember that using an already indexed genome will potentially save you a lot of time!
We now need to provide the amount of time, CPUs, and memory required by each analysis step (these are upper limits). For instance, let's say we expect our deduplication step to run in no more than 15 minutes given 2 CPUs. We also believe it will not take more than 6GB. We specify this as:
withName: dedup
{
time = '15m'
cpus = 2
memory = '6 GB'
}
where the process name is the one used in the main YAMP.nf
file.
We provided some standard value for time, CPUs, and memory Feel free to play with them (they are an overestimate, by the way).
We also provide pre-set values for real-world analysis (in ./conf/base.config
). These values have been optimised using our in-house metagenomic dataset which is composed of about 2000 faecal samples with very different data quality and, thus, very different requirements. These values may require some tuning, but we are confident that they will cover most of the users' scenarios.
We can now run YAMP with the following command:
nextflow run YAMP.nf -reads1 random_ncbi_reads_with_duplicated_and_contaminants_R1.fastq.gz
--reads2 random_ncbi_reads_with_duplicated_and_contaminants_R2.fastq.gz
-profile test,docker
where the -profile test,docker
is telling YAMP to use the following profiles (as specified in nextflow.config
):
test {
includeConfig 'conf/test.config'
}
docker {
docker.enabled = true
docker.runOptions = '-u \$(id -u):\$(id -g)'
}
that is, to use the parameters specified in .conf/test.config
and to enable the use of Docker containers. You should always provide a profile file. Please read the How to use Nextflow profiles tutorial for details.
We used mode complete
because we wanted to run the entire workflow. If we wanted to limit our analysis to the quality control steps, we should have set --mode QC
. In this case, the command will have been:
nextflow run YAMP.nf -reads1 random_ncbi_reads_with_duplicated_and_contaminants_R1.fastq.gz
--reads2 random_ncbi_reads_with_duplicated_and_contaminants_R2.fastq.gz
--mode QC -profile test,docker
Please note that parameters set on the command line overwrite those specified in the config
files.
Is it also possible to run YAMP on externally QC'ed files using --mode characterisation
, as detailed in the How to run YAMP with QC'ed reads tutorial.
At the termination of the computation, we will obtain, in the tests
folder, a subfolder called test
(that is, as the selected prefix) which contains the following files:
├── fastqc
│ ├── test_QCd_fastqc.html
│ ├── test_QCd_fastqc.zip
│ ├── random_ncbi_reads_with_duplicated_and_contaminants_R1_fastqc.html
│ ├── random_ncbi_reads_with_duplicated_and_contaminants_R1_fastqc.zip
│ ├── random_ncbi_reads_with_duplicated_and_contaminants_R2_fastqc.html
│ └── random_ncbi_reads_with_duplicated_and_contaminants_R2_fastqc.zip
├── test_alpha_diversity.tsv
├── test.biom
├── test_genefamilies.tsv
├── test_HUMAnN.log
├── test_metaphlan_bugs_list.tsv
├── test_multiqc_data_complete
│ ├── multiqc_data.json
│ ├── multiqc_fastqc_fastqc_qcd.yaml
│ ├── multiqc_fastqc_fastqc_raw.yaml
│ ├── multiqc_general_stats.yaml
│ ├── multiqc.log
│ └── multiqc_sources.yaml
├── test_multiqc_report_complete.html
├── test_pathabundance.tsv
├── test_pathcoverage.tsv
└── test_QCd.fq.gz
The fastqc
folder will include information on the raw and QC'd reads quality as generated by FastQC (this folder will not be present when YAMP is run in characterisation
mode).
The QC steps will generate a single file test_QCd.fq.gz
which contains all the reads that survived the quality control.
The taxonomic binning and profiling will return two files:
test_metaphlan_bugs_list.tsv
test_QCd.fq.gz
the functional annotation four:
test_genefamilies.tsv
test_pathabundance.tsv
test_pathcoverage.tsv
test_HUMAnN.log
and the alpha-diversity one:
test_alpha_diversity.tsv
Finally, a log file is produced by MultiQC (test_multiqc_report_complete.html
) while detailed MultiQC information is included in a separated folder (test_multiqc_data_complete
). More information on the logs can be found here.
Nextflow will also create, in the folder where it has been run, a working folder, that can be removed with the following command (we suggest doing so since it could be very large):
rm -rf work/
Getting started
Tips and Tricks
Tutorials