Skip to content

How to run YAMP

Alessia Visconti edited this page Apr 13, 2021 · 12 revisions

This tutorial explains how to set the YAMP parameters and how to run it using Docker, and describes the output files. It requires a recent personal computer to run.

Please note that the parameters used in this tutorial are those you will find in the ./conf/test.config file, and its results can be replicated using the following command:

nextflow run YAMP.nf -profile test,docker

All paths are relative to the YAMP repository, and the folder layout is described here.

Localise the example data and prepare the running environment!

We will use a small simulated dataset that is available within the YAMP repository. The dataset is located in the ./data/test_data folder. For this tutorial we will use a paired-end layout, and, more specifically, the following files:

random_ncbi_reads_with_duplicated_and_contaminants_R1.fastq.gz
random_ncbi_reads_with_duplicated_and_contaminants_R2.fastq.gz

You will also need to have access to a set of databases that are queried during the YAMP execution, as explained in Getting started page. The folders layout used in this tutorial can be found here. Please note that the layout we are using here is only an example: you can use any layout you wish and specify the resource paths as described in the following paragraphs.

Please also remember to download the MetaPhlAn databases: these are not available within the demo datasets! See here for details (section: "Notes on the MetaPhlAn databases").

Set the parameters in your config file

For the sake of simplicity, we are going to navigate the ./conf/test.config file. More information on Nextflow config and profiles file are available here.

Select the output folder and the prefix name

We will be using the ./test folder as the output directory by setting outdir parameters to:

outdir = "$baseDir/tests"

($baseDir is a Nextflow directive).

To make it clearer that this sample is a test sample we will set the prefix parameter to test:

prefix="test"

This parameter is mostly useful when the user would like to assign a human-readable name to the sample, for instance, "treated_patient_ID123".

Both options can be set also when running from the command line, as:

run nextflow YAMP.nf --reads1 random_ncbi_reads_with_duplicated_and_contaminants_R1.fastq.gz 
   --reads2 random_ncbi_reads_with_duplicated_and_contaminants_R2.fastq.gz 
   --prefix test --outdir ./tests

Configure the running parameters

We are using paired-end reads, so we should set the singleEnd parameter to false (if using single-end reads, this to have been set to true). We also want to run YAMP in complete mode, that is, we want to run both the QC and the characterisation steps.

To do so, we use the following assignments:

singleEnd = false
mode = "complete"

The simulated dataset also includes identical duplicates, so we also set the dedup parameter to true, therefore telling YAMP to perform de-duplication:

dedup = true

Again, both could be specified on the command line, as:

run nextflow YAMP.nf --reads1 random_ncbi_reads_with_duplicated_and_contaminants_R1.fastq.gz 
   --reads2 random_ncbi_reads_with_duplicated_and_contaminants_R2.fastq.gz 
   --singleEnd false --mode complete --dedup true

We will keep the default parameter for trimming, decontamination, and taxonomic profiling (specified both in './conf/test.config' and ./conf/base.config):

qin=33             //Input quality offset: 33 (ASCII+33) or 64 (ASCII+64)
kcontaminants = 23 //Kmer length used for finding contaminants	
phred = 10         //regions with average quality BELOW this will be trimmed 
minlength = 60     //reads shorter than this after trimming will be discarded
mink = 11          //shorter kmers at read tips to look for 
hdist = 1          //maximum Hamming distance for ref kmers            
	
mind = 0.95        //Approximate minimum alignment identity to look for
maxindel = 3       //longest indel to look for
bwr=0.16           //restrict alignment band to this

bt2options="very-sensitive" //presets options for BowTie2 (MetaPhlAn)

Configure the path to the external databases

For the sake of simplicity, we are going to use the demo file provided with YAMP (please remember to download the actual MetaPhlAn files, there is no demo for these!).

We are then going to set the following paths:

artefacts = "$baseDir/assets/data/sequencing_artifacts.fa.gz"
phix174ill = "$baseDir/assets/data/phix174_ill.ref.fa.gz"
adapters = "$baseDir/assets/data/adapters.fa"

metaphlan_databases="$baseDir/assets/data/metaphlan_databases/"

chocophlan="$baseDir/assets/demo/chocophlan"
uniref="$baseDir/assets/demo/uniref"	

We also need to specify a contaminant (pan)genome. In this case, we don't have an already indexed genome available, so we set foreign_genome while leaving foreign_genome_ref empty:

foreign_genome = "$baseDir/assets/demo/genome.fa" 
foreign_genome_ref = "" 

If we had the reference genome already indexed, we would have done the opposite:

foreign_genome = "" 
foreign_genome_ref = "path/to/my/indexed/genome" 

Please remember that using an already indexed genome will potentially save you a lot of time!

Configure the computational resources

We now need to provide the amount of time, CPUs, and memory required by each analysis step (these are upper limits). For instance, let's say we expect our deduplication step to run in no more than 15 minutes given 2 CPUs. We also believe it will not take more than 6GB. We specify this as:

withName: dedup
{
	time =  '15m'
	cpus = 2
	memory = '6 GB'  
}	

where the process name is the one used in the main YAMP.nf file.

We provided some standard value for time, CPUs, and memory Feel free to play with them (they are an overestimate, by the way).

We also provide pre-set values for real-world analysis (in ./conf/base.config). These values have been optimised using our in-house metagenomic dataset which is composed of about 2000 faecal samples with very different data quality and, thus, very different requirements. These values may require some tuning, but we are confident that they will cover most of the users' scenarios.

Run YAMP

We can now run YAMP with the following command:

nextflow run YAMP.nf -reads1 random_ncbi_reads_with_duplicated_and_contaminants_R1.fastq.gz 
   --reads2 random_ncbi_reads_with_duplicated_and_contaminants_R2.fastq.gz 
   -profile test,docker

where the -profile test,docker is telling YAMP to use the following profiles (as specified in nextflow.config):

test {
  includeConfig 'conf/test.config'
}

docker {
  docker.enabled = true
  docker.runOptions = '-u \$(id -u):\$(id -g)'
}

that is, to use the parameters specified in .conf/test.config and to enable the use of Docker containers. You should always provide a profile file. Please read the How to use Nextflow profiles tutorial for details.

We used mode complete because we wanted to run the entire workflow. If we wanted to limit our analysis to the quality control steps, we should have set --mode QC. In this case, the command will have been:

nextflow run YAMP.nf -reads1 random_ncbi_reads_with_duplicated_and_contaminants_R1.fastq.gz 
   --reads2 random_ncbi_reads_with_duplicated_and_contaminants_R2.fastq.gz 
   --mode QC -profile test,docker

Please note that parameters set on the command line overwrite those specified in the config files.

Is it also possible to run YAMP on externally QC'ed files using --mode characterisation, as detailed in the How to run YAMP with QC'ed reads tutorial.

Output files

At the termination of the computation, we will obtain, in the tests folder, a subfolder called test (that is, as the selected prefix) which contains the following files:

├── fastqc
│   ├── test_QCd_fastqc.html
│   ├── test_QCd_fastqc.zip
│   ├── random_ncbi_reads_with_duplicated_and_contaminants_R1_fastqc.html
│   ├── random_ncbi_reads_with_duplicated_and_contaminants_R1_fastqc.zip
│   ├── random_ncbi_reads_with_duplicated_and_contaminants_R2_fastqc.html
│   └── random_ncbi_reads_with_duplicated_and_contaminants_R2_fastqc.zip
├── test_alpha_diversity.tsv
├── test.biom
├── test_genefamilies.tsv
├── test_HUMAnN.log
├── test_metaphlan_bugs_list.tsv
├── test_multiqc_data_complete
│   ├── multiqc_data.json
│   ├── multiqc_fastqc_fastqc_qcd.yaml
│   ├── multiqc_fastqc_fastqc_raw.yaml
│   ├── multiqc_general_stats.yaml
│   ├── multiqc.log
│   └── multiqc_sources.yaml
├── test_multiqc_report_complete.html
├── test_pathabundance.tsv
├── test_pathcoverage.tsv
└── test_QCd.fq.gz

The fastqc folder will include information on the raw and QC'd reads quality as generated by FastQC (this folder will not be present when YAMP is run in characterisation mode).

The QC steps will generate a single file test_QCd.fq.gz which contains all the reads that survived the quality control.

The taxonomic binning and profiling will return two files:

test_metaphlan_bugs_list.tsv
test_QCd.fq.gz

the functional annotation four:

test_genefamilies.tsv
test_pathabundance.tsv
test_pathcoverage.tsv
test_HUMAnN.log

and the alpha-diversity one:

test_alpha_diversity.tsv

Finally, a log file is produced by MultiQC (test_multiqc_report_complete.html) while detailed MultiQC information is included in a separated folder (test_multiqc_data_complete). More information on the logs can be found here.

Nextflow will also create, in the folder where it has been run, a working folder, that can be removed with the following command (we suggest doing so since it could be very large):

rm -rf work/