From 58d7b85c944f01d4c8de8ce41d5415ddc287b376 Mon Sep 17 00:00:00 2001 From: MaxUlysse Date: Mon, 21 Sep 2020 15:58:13 +0200 Subject: [PATCH 01/11] update docs, SciLifeLab logo, F1000 article --- README.md | 58 +- docs/README.md | 25 +- docs/abstracts/2020-10-VCBS.md | 36 + docs/annotation.md | 90 --- docs/ascat.md | 176 ----- docs/containers.md | 147 ---- docs/images/SciLifeLab_logo.png | Bin 8387 -> 12212 bytes docs/images/SciLifeLab_logo.svg | 178 +++-- docs/input.md | 238 ------ docs/install_bianca.md | 202 ----- docs/output.md | 481 +++++++----- docs/reference.md | 62 -- docs/sentieon.md | 73 -- docs/usage.md | 1213 ++++++++++++++++++++++++------- docs/use_cases.md | 125 ---- docs/variant_calling.md | 57 -- 16 files changed, 1421 insertions(+), 1740 deletions(-) create mode 100644 docs/abstracts/2020-10-VCBS.md delete mode 100644 docs/annotation.md delete mode 100644 docs/ascat.md delete mode 100644 docs/containers.md delete mode 100644 docs/input.md delete mode 100644 docs/install_bianca.md delete mode 100644 docs/reference.md delete mode 100644 docs/sentieon.md delete mode 100644 docs/use_cases.md delete mode 100644 docs/variant_calling.md diff --git a/README.md b/README.md index 29fab9a95d..9d486710a5 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ > **An open-source analysis pipeline to detect germline or somatic variants from whole genome or targeted sequencing** -[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A520.07.0--RC1-brightgreen.svg)](https://www.nextflow.io/) +[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A519.10.0-brightgreen.svg)](https://www.nextflow.io/) [![nf-core](https://img.shields.io/badge/nf--core-pipeline-brightgreen.svg)](https://nf-co.re/) [![DOI](https://zenodo.org/badge/184289291.svg)](https://zenodo.org/badge/latestdoi/184289291) @@ -10,11 +10,9 @@ [![GitHub Actions Linting status](https://github.com/nf-core/sarek/workflows/nf-core%20linting/badge.svg)](https://github.com/nf-core/sarek/actions?query=workflow%3A%22nf-core+linting%22) [![CircleCi build status](https://img.shields.io/circleci/project/github/nf-core/sarek?logo=circleci)](https://circleci.com/gh/nf-core/sarek/) -[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg)](http://bioconda.github.io/) -[![Docker Container available](https://img.shields.io/docker/automated/nfcore/sarek.svg)](https://hub.docker.com/r/nfcore/sarek/) -[![Install with Singularity](https://img.shields.io/badge/use%20with-singularity-purple.svg)](https://www.sylabs.io/docs/) - -[![Join us on Slack](https://img.shields.io/badge/slack-nfcore/sarek-blue.svg)](https://nfcore.slack.com/channels/sarek) +[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg)](https://bioconda.github.io/) +[![Docker](https://img.shields.io/docker/automated/nfcore/sarek.svg)](https://hub.docker.com/r/nfcore/sarek) +[![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23sarek-4A154B?logo=slack)](https://nfcore.slack.com/channels/sarek) ## Introduction @@ -33,49 +31,31 @@ It's listed on [Elixir - Tools and Data Services Registry](https://bio.tools/Sar ## Quick Start -i. Install [`Nextflow`](https://nf-co.re/usage/installation) +1. Install [`Nextflow`](https://nf-co.re/usage/installation) -ii. Install either [`Docker`](https://docs.docker.com/engine/installation/) or [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/) for full pipeline reproducibility (please only use [`Conda`](https://conda.io/miniconda.html) as a last resort; see [docs](https://nf-co.re/usage/configuration#basic-configuration-profiles)) +2. Install either [`Docker`](https://docs.docker.com/engine/installation/) or [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/) for full pipeline reproducibility _(please only use [`Conda`](https://conda.io/miniconda.html) as a last resort; see [docs](https://nf-co.re/usage/configuration#basic-configuration-profiles))_ -iii. Download the pipeline and test it on a minimal dataset with a single command +3. Download the pipeline and test it on a minimal dataset with a single command: -```bash -nextflow run nf-core/sarek -profile test, -``` + ```bash + nextflow run nf-core/sarek -profile test, + ``` -> Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. -> If so, you can simply use `-profile ` in your command. -> This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment. + > Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. + > If so, you can simply use `-profile ` in your command. + > This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment. -iv. Start running your own analysis! +4. Start running your own analysis! -```bash -nextflow run nf-core/sarek -profile --input '*.tsv' --genome GRCh38 -``` + ```bash + nextflow run nf-core/sarek -profile --input '*.tsv' --genome GRCh38 + ``` See [usage docs](docs/usage.md) for all of the available options when running the pipeline. ## Documentation -The nf-core/sarek pipeline comes with documentation about the pipeline, found in the `docs/` directory: - -1. [Installation](https://nf-co.re/usage/installation) -2. Pipeline configuration - * [Local installation](https://nf-co.re/usage/local_installation) - * [Adding your own system config](https://nf-co.re/usage/adding_own_config) - * [Install on a secure cluster](docs/install_bianca.md) - * [Reference genomes](https://nf-co.re/usage/reference_genomes) - * [Extra documentation on reference](docs/reference.md) -3. [Running the pipeline](docs/usage.md) - * [Examples](docs/use_cases.md) - * [Input files documentation](docs/input.md) - * [Documentation about containers](docs/containers.md) -4. [Output and how to interpret the results](docs/output.md) - * [Extra documentation on variant calling](docs/variant_calling.md) - * [Complementary information about ASCAT](docs/ascat.md) - * [Complementary information about Sentieon](docs/sentieon.md) - * [Extra documentation on annotation](docs/annotation.md) -5. [Troubleshooting](https://nf-co.re/usage/troubleshooting) +The nf-core/sarek pipeline comes with documentation about the pipeline which you can read at [https://nf-core/sarek/docs](https://nf-core/sarek/docs) or find in the [`docs/` directory](docs). ## Credits @@ -135,7 +115,7 @@ For further information or help, don't hesitate to get in touch on [Slack](https ## Citation If you use `nf-core/sarek` for your analysis, please cite the `Sarek` article as follows: -> Garcia M, Juhos S, Larsson M et al. **Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants [version 1; peer review: 2 approved]** *F1000Research* 2020, 9:63 [doi: 10.12688/f1000research.16665.1](http://dx.doi.org/10.12688/f1000research.16665.1). +> Garcia M, Juhos S, Larsson M et al. **Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants [version 2; peer review: 2 approved]** *F1000Research* 2020, 9:63 [doi: 10.12688/f1000research.16665.2](http://dx.doi.org/10.12688/f1000research.16665.2). You can cite the sarek zenodo record for a specific version using the following [doi: 10.5281/zenodo.3476426](https://zenodo.org/badge/latestdoi/184289291) diff --git a/docs/README.md b/docs/README.md index 43fe182c67..9ae0a5eaf5 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,21 +1,10 @@ # nf-core/sarek: Documentation -The nf-core/sarek documentation is split into the following files: +The nf-core/sarek documentation is split into the following pages: -1. [Installation](https://nf-co.re/usage/installation) -2. Pipeline configuration - * [Local installation](https://nf-co.re/usage/local_installation) - * [Adding your own system config](https://nf-co.re/usage/adding_own_config) - * [Install on a secure cluster](install_bianca.md) - * [Reference genomes](https://nf-co.re/usage/reference_genomes) - * [Extra documentation on reference](reference.md) -3. [Running the pipeline](usage.md) - * [Examples](use_cases.md) - * [Input files documentation](input.md) - * [Documentation about containers](containers.md) -4. [Output and how to interpret the results](output.md) - * [Extra documentation on variant calling](variant_calling.md) - * [Complementary information about ASCAT](ascat.md) - * [Complementary information about Sentieon](sentieon.md) - * [Extra documentation on annotation](annotation.md) -5. [Troubleshooting](https://nf-co.re/usage/troubleshooting) +- [Usage](usage.md) + - An overview of how the pipeline works, how to run it and a description of all of the different command-line flags. +- [Output](output.md) + - An overview of the different results produced by the pipeline and how to interpret them. + +You can find a lot more documentation about installing, configuring and running nf-core pipelines on the website: [https://nf-co.re](https://nf-co.re) diff --git a/docs/abstracts/2020-10-VCBS.md b/docs/abstracts/2020-10-VCBS.md new file mode 100644 index 0000000000..ca4b00314c --- /dev/null +++ b/docs/abstracts/2020-10-VCBS.md @@ -0,0 +1,36 @@ +# Victorian Cancer Bioinformatics Symposium - online, 2020-10-23 + +## Sarek, a reproducible and portable workflow for analysis of matching tumor-normal NGS data + +Maxime Garcia [1], Szilveszter Juhos [1], Teresita Díaz de Ståhl [1], Markus Mayrhofer [2], Johanna Sandgren [1], Björn Nystedt [2], Monica Nistér [1] + +[1] Dept. of Oncology Pathology, The Swedish Childhood Tumor Biobank (Barntumörbanken, BTB); Karolinska Institutet +[2] Dept. of Cell and Molecular Biology; National Bioinformatics Infrastructure Sweden, Science for Life Laboratory; Uppsala University + +### Introduction + +High throughput sequencing for precision medicine is a routine method. +Numerous tools have to be used, and analysis is time consuming. +We propose Sarek, an open-source container based bioinformatics workflow for germline or tumor/normal pairs (can include matched relapses), written in Nextflow, to process WGS, whole-exome or gene-panel samples. + +### Methods + +Sarek is part of nf-core, a collection of high quality peer-reviewed workflows; supported environments are Docker, Singularity and Conda, enabling version tracking and reproducibility. +It is designed with flexible environments in mind: local fat node, HTC cluster or cloud environment like AWS. +Several model organism references are available (including Human GRCh37 and GRCh38). +Sarek is based on GATK best practices to prepare short-read data. +The pipeline then reports germline and somatic SNVs and SVs (HaplotypeCaller, Strelka, Mutect2, Manta and TIDDIT). +CNVs, purity and ploidy is estimated with ASCAT and Control-FREEC. +At the end of the analysis the resulting VCF files can be annotated by SNPEff and/or VEP to facilitate further downstream processing. +Furthermore, a broad set of QC metrics is reported as a final step of the workflow with MultiQC. +Additional software can be included as new modules. + +### Results + +From FASTQs to annotated VCFs it takes four days for a paired 90X/90X WGS-sample on a 48 cores node, with the complete set of tools. +Processing can be sped-up with the optional use of Sentieon (C). +Sarek is used in production at the National Genomics Infrastructure Sweden for germline and cancer samples for the Swedish Childhood Tumor Biobank and other research groups. + +### Conclusion + +Sarek is an easy-to-use tool for germline or cancer NGS samples, to be downloaded from [nf-co.re/sarek](https://nf-co.re/sarek) under MIT license. diff --git a/docs/annotation.md b/docs/annotation.md deleted file mode 100644 index ba94d85671..0000000000 --- a/docs/annotation.md +++ /dev/null @@ -1,90 +0,0 @@ -# Annotation - -## Tools - -With Sarek, annotation is done using `snpEff`, `VEP`, or even both consecutively: - -- `--tools snpEff` - - To annotate using `snpEff` -- `--tools VEP` - - To annotate using `VEP` -- `--tools snpEff,VEP` - - To annotate using `snpEff` and `VEP` -- `--tools merge` - - To annotate using `snpEff` followed by `VEP` - -VCF produced by Sarek will be annotated if `snpEff` or `VEP` are specified with the `--tools` command. -As Sarek will use `bgzip` and `tabix` to compress and index VCF files annotated, it expects VCF files to be sorted. - -In these examples, all command lines will be launched starting with `--step annotate`. -It can of course be started directly from any other step instead. - -## Using genome specific containers - -Sarek has already designed containers with `snpEff` and `VEP` files for Human (`GRCh37`, `GRCh38`), Mouse (`GRCm38`), Dog (`CanFam3.1`) and Roundworm (`WBcel235`). -Default settings will run using these containers. - -The main Sarek container has also `snpEff` and `VEP` installed, but without the cache files that can be downloaded separately. - -## Download cache - -A Nextflow helper script has been designed to help downloading `snpEff` and `VEP` caches. -Such files are meant to be shared between multiple users, so this script is mainly meant for people administrating servers, clusters and advanced users. - -```bash -nextflow run download_cache.nf --snpeff_cache --snpeff_db --genome -nextflow run download_cache.nf --vep_cache --species --vep_cache_version --genome -``` - -## Using downloaded cache - -Both `snpEff` and `VEP` enable usage of cache. -If cache is available on the machine where Sarek is run, it is possible to run annotation using cache. -You need to specify the cache directory using `--snpeff_cache` and `--vep_cache` in the command lines or within configuration files. -The cache will only be used when `--annotation_cache` and cache directories are specified (either in command lines or in a configuration file). - -Example: - -```bash -nextflow run nf-core/sarek --tools snpEff --step annotate --sample --snpeff_cache --annotation_cache -nextflow run nf-core/sarek --tools VEP --step annotate --sample --vep_cache --annotation_cache -``` - -## Using VEP CADD plugin - -To enable the use of the VEP CADD plugin: - -- Download the CADD files -- Specify them (either on the command line, like in the example or in a configuration file) -- use the `--cadd_cache` flag - -Example: - -```bash -nextflow run nf-core/sarek --step annotate --tools VEP --sample --cadd_cache \ - --cadd_indels \ - --cadd_indels_tbi \ - --cadd_wg_snvs \ - --cadd_wg_snvs_tbi -``` - -### Downloading CADD files - -An helper script has been designed to help downloading CADD files. -Such files are meant to be share between multiple users, so this script is mainly meant for people administrating servers, clusters and advanced users. - -```bash -nextflow run download_cache.nf --cadd_cache --cadd_version --genome -``` - -## Using VEP GeneSplicer plugin - -To enable the use of the VEP GeneSplicer plugin: - -- use the `--genesplicer` flag - -Example: - -```bash -nextflow run nf-core/sarek --step annotate --tools VEP --sample --genesplicer -``` diff --git a/docs/ascat.md b/docs/ascat.md deleted file mode 100644 index 718363e00a..0000000000 --- a/docs/ascat.md +++ /dev/null @@ -1,176 +0,0 @@ -# ASCAT - -## Introduction - -ASCAT is a software for performing allele-specific copy number analysis of tumor samples and for estimating tumor ploidy and purity (normal contamination). -ASCAT is written in R and available here: [github.com/Crick-CancerGenomics/ascat](https://github.com/Crick-CancerGenomics/ascat). - -To run ASCAT on NGS data we need BAM files for the tumor and normal samples, as well as a loci file with SNP positions. -If ASCAT is run on SNP array data, the loci file contains the SNPs on the chip. -When runnig ASCAT on NGS data we can use the same loci file, for example the one corresponding to the AffymetrixGenome-Wide Human SNP Array 6.0, but we can also choose a loci file of our choice with i.e. SNPs detected in the 1000 Genomes project. - -### BAF and LogR values - -Running ASCAT on NGS data requires that the BAM files are converted into BAF and LogR values. -This can be done using the software [AlleleCount](https://github.com/cancerit/alleleCount) followed by a simple R script. -AlleleCount extracts the number of reads in a BAM file supporting each allele at specified SNP positions. -Based on this, BAF and logR can be calculated for every SNP position i as: - -```R -BAFi(tumor)=countsBi(tumor)/(countsAi(tumor)+countsBi(tumor)) -BAFi(normal)=countsBi(normal)/(countsAi(normal)+countsBi(normal)) -LogRi(tumor)=log2((countsAi(tumor)+countsBi(tumor))/(countsAi(normal)+countsBi(normal)) - median(log2((countsA(tumor)+countsB(tumor))/(countsA(normal)+countsB(normal))) -LogRi(normal)=0 -``` - -For male samples, the X and Y chromosome markers have special treatment: - -```R -LogRi(tumor)=log2((countsAi(tumor)+countsBi(tumor))/(countsAi(normal)+countsBi(normal))-1 - median(log2((countsA(tumor)+countsB(tumor))/(countsA(normal)+countsB(normal))-1) -``` - -where: -*i* corresponds to the postions of all SNPs in the loci file. -*CountsA* and *CountsB* are vectors containing number of reads supporting the *A* and *B* alleles of all SNPs -*A* = the major allele -*B* = the minor allele -*Minor* and *major* alleles are defined in the loci file (it actually doesn't matter which one is defied as A and B in this application) - -Calculation of LogR and BAF based on AlleleCount output is done as in [runASCAT.R](https://github.com/cancerit/ascatNgs/tree/dev/perl/share/ascat/runASCAT.R) in the ascatNgs repository on Github. - -### Loci file - -The loci file was created based on the 1000Genomes latest release (phase 3, releasedate 20130502), available [here](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp//release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz). -The following filter was applied: Only bi-allelc SNPs with minor allele frequencies > 0.3. - -The loci file was originally generated for GRCh37. -It was translated into GRCh38 using the tool liftOver available at the UCSC Genome Browser. -To run liftOver the loci file was first written in bed format: - -```bash -awk '{print "chr"$1":"$2"-"$2}' 1000G_phase3_20130502_SNP_maf0.3.loci > 1000G_phase3_20130502_SNP_maf0.3.bed -``` - -Using the web interface to liftOver at [genome.ucsc.edu](https://genome.ucsc.edu/cgi-bin/hgLiftOver) the file was translated into GRCh38 coordinates. -LiftOver was possible for 3261270 out of 3268043 SNPs. -The converted SNP positions were printed in the format required by AlleleCounter by: - -```bash -more hglft_genome_5834_13aba0.bed | awk 'BEGIN{FS="chr"} {print $2}' | awk 'BEGIN{FS="-"} {print $1}' | awk 'BEGIN{FS=":";OFS="\t"} {print $1,$2}' > 1000G_phase3_GRCh38_maf0.3.loci -``` - -### GC correction file - -Input files for ASCAT's GC correction were created for the above loci files, using the scripts and instructions for this on [ASCAT's github repository](https://github.com/Crick-CancerGenomics/ascat/tree/master/gcProcessing) - -#### scripts and data for generating the GC correction file - -The following scripts were downloaded from : - -- *createGCcontentFile.R* -- *createWindowBed.pl* -- *GCfileCreation.sh*. - -To generate the GC correction file additional files are needed: - -- *locifile* described above -- *reference.fasta* is the genome reference file in fasta format - -The files are descibed in [Genomes and reference files documentation](reference.md) - -- *chromosomesizes.txt* is a tab delimited text file containing the size of all chromosomes included in the loci file. - -An example file is available in - -#### Modification of createWindowBed.pl for our GRCh37 loci file - -The genomc reference file we use in Sarek for GRCh37 is coded without "chr" in the chromosome names, while the genome reference file we use in Sarek for GRCh38 includes "chr" in the chromosome names. -The script [createWindowBed.pl](https://github.com/Crick-CancerGenomics/ascat/tree/master/gcProcessing/createWindowBed.pl) assumes that `chr` is included in the chromosome names of the reference file, so a small modification of this script was done for the process to work on our GRCh37 loci file. - -These two lines in createWindowBed.pl generate output (lines 61 and 64): - -```perl -(61) print OUT "chr".$tab[1]."\t".$start."\t".$stop."\t".$tab[0]."\t".$tab[2]."\t".($w*2+1)."\n"; -(64) print OUT "chr".$tab[1]."\t".$start."\t".$stop."\t".$tab[0]."\t".$tab[2]."\t".($w*2)."\n"; -``` - -and were changed to: - -```perl -(61) print OUT $tab[1]."\t".$start."\t".$stop."\t".$tab[0]."\t".$tab[2]."\t".($w*2+1)."\n"; -(64) print OUT $tab[1]."\t".$start."\t".$stop."\t".$tab[0]."\t".$tab[2]."\t".($w*2)."\n"; -``` - -After this modification the script works for our GRCh37 loci file. - -#### Process - -The following sbatch script was run on the Uppmax cluster Rackham, to generate the CG correction files: - -```bash -#!/bin/bash -l -#SBATCH -A projid -#SBATCH -p node -#SBATCH -t 24:00:00 -#SBATCH -J createGCfile -module load bioinfo-tools -module load BEDTools -module load R/3.5.0 -module load R_packages/3.5.0 -./GCfileCreation.sh 1000G_phase3_20130502_SNP_maf0.3.loci chrom.sizes 19 human_g1k_v37_decoy.fasta -``` - -where: - -- *1000G_phase3_20130502_SNP_maf0.3.loci* is the loci file for GRCh37 described above -- *human_g1k_v37_decoy.fasta* is the genome reference file used for GRCh37 -- *chrom.sizes* is the list of the chromosome lengths in GRCh37 - -Names of the chromosomes in chrom.sizes file must be the same as in the genome reference, so in case of GRCh37 we used "1", "2" etc and in GRCh38 we used "chr1", "chr2" etc. - -- *19* means that 19 cores are available for the script. - -This created GC correction files with the following column headers: - -| | | | | | | | | | | | | | | | | | | | | -|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| -|Chr|Position|25|50|100|200|500|1000|2000|5000|10000|20000|50000|100000|200000|500000|1M|2M|5M|10M| - -This file gave an error when running ASCAT, and the error message suggested that it had to do with the column headers. -The Readme.txt in suggested that the column headers should be: - -| | | | | | | | | | | | | | | | | | | | | -|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| -|Chr|Position|25bp|50bp|100bp|200bp|500bp|1000bp|2000bp|5000bp|10000bp|20000bp|50000bp|100000bp|200000bp|500000bp|1M|2M|5M|10M| - -The column headers headers of the generated GC correction files were therefore manually edited. - -#### Format of GC correction file - -The final files are tab-delimited with the following columns (and some example data): - -|Chr|Position|25bp|50bp|100bp|200bp|500bp|1000bp|2000bp|5000bp|10000bp|20000bp|50000bp|100000bp|200000bp|500000bp|1M|2M|5M|10M| -|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| -|snp1|1|14930|0.541667|0.58|0.61|0.585|0.614|0.62|0.6|0.5888|0.588|0.4277|0.395041|0.380702|0.383259|0.341592|0.339747|0.386343|0.500537|0.511514 -|snp2|1|15211|0.625|0.64|0.67|0.63|0.61|0.612|0.6135|0.591|0.5922|0.4358|0.39616|0.380411|0.383167|0.34163|0.339771|0.386417|0.500558|0.511511 -|snp3|1|15820|0.541667|0.56|0.62|0.655|0.65|0.612|0.5885|0.5936|0.5797|0.4511|0.397771|0.379945|0.382999|0.341791|0.339832|0.386554|0.500579|0.511504 - -### Output - -The ASCAT process gives several images as output, described in detail in this [book chapter](http://www.ncbi.nlm.nih.gov/pubmed/22130873). -The script also gives out a text file (*tumor.cnvs.txt*) with information about copy number state for all the segments predicted by ASCAT. -The output is a tab delimited text file with the following columns: - -```text -chr startpos endpos nMajor nMinor -``` - -Where: - -- *chr* is the chromosome number -- *startpos* is the start position of the segment -- *endpos* is the end position of the segment -- *nMajor* is number of copies of one of the allels (for example the chromosome inherited from the father) -- *nMinor* is the number of copies of the other allele (for example the chromosome inherited of the mother) - -The file *tumor.cnvs.txt* contains all segments predicted by ASCAT, both those with normal copy number (nMinor = 1 and nMajor =1) and those corresponding to copy number aberrations. diff --git a/docs/containers.md b/docs/containers.md deleted file mode 100644 index a926d1f709..0000000000 --- a/docs/containers.md +++ /dev/null @@ -1,147 +0,0 @@ -# Containers - -Our main container is designed using [Conda](https://conda.io/) to install all tools used in Sarek: - -- [sarek](#sarek-) - -For annotation, the main container can be used, but the cache has to be downloaded, or additional containers are available with cache (see [extra annotation documentation](annotation.md)): - -- [sareksnpeff](#sareksnpeff-) -- [sarekvep](#sarekvep-) - -## What is actually inside the containers - -### sarek [![sarek-docker status](https://img.shields.io/docker/automated/nfcore/sarek.svg)](https://hub.docker.com/r/nfcore/sarek) - -- Based on `nfcore/base:1.9` -- Contain **[ASCAT](https://github.com/Crick-CancerGenomics/ascat)** 2.5.2 -- Contain **[AlleleCount](https://github.com/cancerit/alleleCount)** 4.0.2 -- Contain **[BCFTools](https://github.com/samtools/bcftools)** 1.9 -- Contain **[bwa-mem2](https://github.com/bwa-mem2/bwa-mem2)** 2.0 -- Contain **[Control-FREEC](https://github.com/BoevaLab/FREEC)** 11.5 -- Contain **[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)** 0.11.9 -- Contain **[FreeBayes](https://github.com/ekg/freebayes)** 1.3.2 -- Contain **[GATK4-spark](https://github.com/broadinstitute/gatk)** 4.1.6.0 -- Contain **[GeneSplicer](https://ccb.jhu.edu/software/genesplicer/)** 1.0 -- Contain **[ggplot2](https://github.com/tidyverse/ggplot2)** 3.3.0 -- Contain **[HTSlib](https://github.com/samtools/htslib)** 1.9 -- Contain **[Manta](https://github.com/Illumina/manta)** 1.6.0 -- Contain **[msisensor](https://github.com/ding-lab/msisensor)** 0.5 -- Contain **[MultiQC](https://github.com/ewels/MultiQC/)** 1.8 -- Contain **[Qualimap](http://qualimap.bioinfo.cipf.es)** 2.2.2d -- Contain **[samtools](https://github.com/samtools/samtools)** 1.9 -- Contain **[snpEff](http://snpeff.sourceforge.net/)** 4.3.1t -- Contain **[Strelka2](https://github.com/Illumina/strelka)** 2.9.10 -- Contain **[TIDDIT](https://github.com/SciLifeLab/TIDDIT)** 2.7.1 -- Contain **[pigz](https://zlib.net/pigz/)** 2.3.4 -- Contain **[Trim Galore](https://github.com/FelixKrueger/TrimGalore)** 0.6.5 -- Contain **[VCFanno](https://github.com/brentp/vcfanno)** 0.3.2 -- Contain **[VCFtools](https://vcftools.github.io/index.html)** 0.1.16 -- Contain **[VEP](https://github.com/Ensembl/ensembl-vep)** 99.2 - -### sareksnpeff [![sareksnpeff-docker status](https://img.shields.io/docker/automated/nfcore/sareksnpeff.svg)](https://hub.docker.com/r/nfcore/sareksnpeff) - -- Based on `nfcore/base:1.9` -- Contain **[snpEff](http://snpeff.sourceforge.net/)** 4.3.1t -- Contains cache for `GRCh37`, `GRCh38`, `GRCm38`, `CanFam3.1` or `WBcel235` - -### sarekvep [![sarekvep-docker status](https://img.shields.io/docker/automated/nfcore/sarekvep.svg)](https://hub.docker.com/r/nfcore/sarekvep) - -- Based on `nfcore/base:1.9` -- Contain **[GeneSplicer](https://ccb.jhu.edu/software/genesplicer/)** 1.0 -- Contain **[VEP](https://github.com/Ensembl/ensembl-vep)** 99.2 -- Contain cache for `GRCh37`, `GRCh38`, `GRCm38`, `CanFam3.1` or `WBcel235` - -## Building your own - -Our containers are designed using [Conda](https://conda.io/). -The [`environment.yml`](../environment.yml) file can be modified if particular versions of tools are more suited to your needs. - -The following commands can be used to build/download containers on your own system: - -- Adjust `VERSION` for sarek version (typically a release or `dev`). - -### Build with Conda - -```Bash -conda env create -f environment.yml -``` - -### Build with Docker - -- `sarek` - -```Bash -docker build -t nfcore/sarek: . -``` - -- `sareksnpeff` - -Adjust arguments for `GENOME` version and snpEff `CACHE_VERSION` - -```Bash -docker build -t nfcore/sareksnpeff:. containers/snpeff/. --build-arg GENOME= --build-arg CACHE_VERSION= -``` - -- `sarekvep` - -Adjust arguments for `GENOME` version, `SPECIES` name and VEP `VEP_VERSION` - -```Bash -docker build -t nfcore/sarekvep:. containers/vep/. --build-arg GENOME= --build-arg SPECIES= --build-arg VEP_VERSION= -``` - -### Pull with Docker - -- `sarek` - -```Bash -docker pull nfcore/sarek: -``` - -- `sareksnpeff` - -Adjust arguments for `GENOME` version - -```Bash -docker pull nfcore/sareksnpeff:. -``` - -- `sarekvep` - -Adjust arguments for `GENOME` version - -```Bash -docker pull nfcore/sarekvep:. -``` - -### Pull with Singularity - -You can directly pull singularity image, in the path used by the Nextflow ENV variable `NXF_SINGULARITY_CACHEDIR`, ie: - -```Bash -cd $NXF_SINGULARITY_CACHEDIR -singularity build ... -``` - -- `sarek` - -```Bash -singularity build nfcore-sarek-.img docker://nfcore/sarek: -``` - -- `sareksnpeff` - -Adjust arguments for `GENOME` version - -```Bash -singularity build nfcore-sareksnpeff-..img docker://nfcore/sareksnpeff:. -``` - -- `sarekvep` - -Adjust arguments for `GENOME` version - -```Bash -singularity build nfcore-sarekvep-..img docker://nfcore/sarekvep:. -``` diff --git a/docs/images/SciLifeLab_logo.png b/docs/images/SciLifeLab_logo.png index bc4dbda623f8f70d82f883c5019638d86016a298..e71d44b9319ed8f02e860517807e1269eaec9646 100644 GIT binary patch literal 12212 zcmYjX1yCDpwB6wD5?qSAySr0pX`#4Nptx(0;_eQ`N{c(iDHL~i*WyxKUija_sb=(!yd0`zk%d7=dR1?McK_#O%R@ecLFhfQ2Ow*%@fVLBjfdIqhbQp+ zlQ^Z?cF1D>7F(MHABBrw{Ss9ffxrJ{qek3$ES!V55yn-zD3hR;DQ3JeRNXcVr z2Fle<>-Dd8leKnsEQcgzExTl_P(yyZt`@eL@BbNzC~-q<#(}-}oBKnF2|OV6cP!Vt zeDWted=E8{L|$M4^VFvNom466P_*PePt43yqw=*@rdu#sQ2Y(8e)1Hb_sw;T) zzdrv1OxuTWxiHVSwEuh4Ck@pc44bp?LxGF(tAPuFt60a4e+}U-sQs*-SU*yw7qUc z|1T0MXe1y&#vRs(Ju?o5!ds`~KcgAYTHz)pY=v3ZU|$-4Gzm)m$NXx#{M@XuNmD~e zH1X+~4nQ}U&RuPCO(gtBu(j&Xi%z3TsvevOkVYJh2>}v_og`^iw2G~8xz~F{rUu&Y zR#usyHbk@MNCxnZSIysv(}!h(rhnSu zf7oZqpcOsJfx0!RJfoL~2w4~^)>94kWc%1++S??Hub$>SH~IPLkmrz`!ylctUD9}5 zmw3RYlBnfYE4Agdb6h#`a?7pF)Nc{4ApSnpui)Eq@{KupxS7`QWw^xTFbv@eQAF}d zGxKsBy%<=)N27={Me@+AU8_W3b zb~)#EGr_MdeLZcP*Wvj4a)}51JIKMN*ym}Pmmb~W2b=C?iW+K7&s4u!jWa%XMNW&! zp!-V2$JfD;ZOu0h+c{%OK%0Im)l}CmHns4~$p5$czle(ZheI26PgjM$oBs_bpRIh2 z1c#SwAGE$$9hv)Pot*#gg?XKNi$-j=(BLw{x>0NM26SYO=_{fiy&e9@g!NOa|z)N+n) zoFFVZ7+pWC0{F2n&5z88dmm2w;oGXex@~OD#GR9E)5ogEXL*;D3wtYfP=7X2;cCV2 zJeM^cD(vR7P#~Kf+K*1}d)m zGJa;gV<3v{D*)ba=CPn0df{S|{kmMMlvll!6&SV*hYPgi+~B^8;ocTLH=CS#KmEtr z$_lJ^dZ}vJ%LBklSW&G9WQ$kQ2X^JD4JRTA;6xz+L13`Vf{XzL>}{GPrwL0Uv4IK= zPHF4kNRr>p&%VIk9&6{75{tQCy_4wQJ1MU-dN?uBHAe*XSo*|E=7sM*5g1oDZNjb! zbU|BT+Z#q1NR+~AD)Sh{J1UR%>0N6U|w%cbj8Z z$MO*&a27MF$k1lElNLgbvrj@oawybbmyME-8#Xar?Yab>tg)8>1K0pon!ANl{t-wL zO*^Wp!XAF2L)!!-IT~A}sr>V1>V0g*VeYl02*#PlGHqF?>l1$DzU>sxOX8^>-JL9i zZO}a{EGMMTTMh5cG#0U9hF)h>C;gfk50`_RHs&g^GM$}!X}JOo7eu(c5^~F=K;HIn zzA{&bQ$z8-D-G2sMEB_oZsIKmwI?Pj4bwvQDg-)pG9=vyXp9VhV-gr&TbgtUi!*6X z-nc|AK|H;#wglEq!$7Li7AF^sV;mF9M5csnZl$&SQFf~w-gnb2$d5sqhei!-aO+-q zh?jf%m9<#G8M(d`$Z*6&n;9|XOQY~slugs-cy*ZIKCBI}gb#ZeNhxrzl~Hb3toh^T zsp(r}Tza)D`^B@3H9*X4tiYwNH*F1$bVBbdoCZ18__nD;D&>nI+}W3Sqq7k!>T@5# zXdY1e2T>2!@7#@$2t%A99Ey~w_I!^K$EfGI@Z$Rg=Hg|lLJ!o=k087Rx!JzX10w?- z_diK}Ck8Wh{tKcjiYRMAKtr+~Xo}XIHG6ig{MlvWpUT5BxOZ$-$gbmO-Ek*JRM79{ zb3(dm(%>Nwn#-^731IDj-Z;edsi7B^*k%`Fk(r`$&4U~y82Ct!@dXubtW-Nx1aJ1= znDAqn(SW(AA|gzHWVzSbDX{hx=h6Vl1b;s{!-8agua1;(zDBlHHY)S2iG`p=*6}Ic z&|XEF$hSiB6AOziXA;n3Q4|xx1EX&`iSI)6hGaJU!YX3-&lxv1RS0_-qWvsb8I%P& zy?GA?#z)AW8rn&a;ne>%mH#w3d(S_31vkvo4$*{98eqi|Wrh8cYH<%~zUQc@iv z*+0FzN!o~WidSaT!!#@2{l)qY!~GkZ#9JNDv+#Wa{xvQD2oNj-t=YTG1n(^r-kI65^$vmmelD^6bO4&@S8~ zPGOsBviw~BXp!P36WMszFP>OWS7|)xwvWmhMKNDY#{veRaSM(HhzEBV0uD!9Is*4C zB@7Cg9TiLk>S2ha-`d@Tm5Y3}xV) zO4!!8fMkFV1TmU0L?Nf9aUp0eP7-5KD%wW7BJ*wV#axXNJk5>YBm@S%+U{qJocpqN z-mL;0zoM+!THFtk3=Q~=Czf`ub~;M7=_8$HH130}>)RrY!T&VfM%g*~b!D`BIt$En zDrIzaO)+-rjW`58D3HRgC&1+X9*sM=It8*K4eDsrYbP;vGEQ|RPQJS#G~Ai*enW8| z!VQ-((URdO-#uE+(-XS}i8Fn(?uw!ijHN%Og zuxFaQKsU26gXrTKbZ7F7ll@#Vf8|L2=kz79XaCpriNR%!u&T`#S1HV?^yI^hn)+#SQ zWLjabG(-A`!!YO6QE%iq_|vRr-`v=^J8hR|05k7y`-Y-CuyvB;2z#IRhdyE~?g7$o z*PYp};AHbTkPM4*Q-Zu9@D+|_4m&lMxJmB~`W6z5;l^8*kziC4#Y>ebo)M?T$j06m zBJL47>S4HfxBg(FwQz86yqmdKh6mRjp@|@NR4AwwbzL9EPnCtfSLJem$$d<#__Lc# zVmKIC^b}W|`ye`y{chat54D)SWDLuA{{!9&SHg%Z2R=i#6-_ z(GvDl*>^Vhi%&8oR6IDrxJQFnXAaXVlrQ0FpT%I}rUZ`Tjk~Lk3u@-Z*Et5#SPTV+ z-0_QcT~M!v1?n`>2?cU8bFmKf|Dpg_@Ehk3tJy+@S4=-~_H(oS_J{1A$s^>XghupI zj;2J&=ug+GpF)A7cWlZ6=&TIh0np_Ge67qrCpRaz#56-8`y<#cAg8l}rk)Yb@j03o zg^H&!wN_)i<~BA~)Xh!tuyU#Xnt|!jI<1gJ5WmXCm(r=(*!vAZWWCoysQGC}2YQ}h z!&fBw7z&~3$fz(S^~nWKh;;bRN+S)t#$&MqI|uls)yAyEDWY%4PM$M(rhmN={XI}2 zI)8G!rX(xv(rax8z4Ic>e|W!cXZe|?KNzTp44zzvK#G$0L-9O5DJ*|}S2%X<_Y?#3 z_Je^-k>wkzYxAWgM>f$mH|!Op&~9|8!93cX=fJNB0E-8z?q;{puQ%;Z$Dfjezaxbh zqtWDlBfg}xAeUjej;CSe@K@=SH=AiQjI_5VhOXj^OH1fAWg9)49g535K4?RqBAc5_ zND`&oQLoIj&c4-^4V(X#|CHV`wdXvPWuCZ}#p!px79koR_K~{;CqS$_kA!(VhC4jK z?LdX5pln;=s}giu0$p4v0K{G9Zk(+Rsk~S?m}t!~Q^idOEcqH??Q)kc0CXFpve8@d1{RLZ{E^_o8hgVOS@n==xP;eit01AOqOW@Lz9 zymA|dWY0Y%-Zg9Zv)+XZ2Pw*Uk`3@s>YM!%ioGsBHwKDpQ;Gv7iHK%{USPJV8G8!C zW%3IVzn=u;c!uuj47`+ig7=eW~J8TU_lCMctWJfTfP1~7O4g!yiX4~d~VNWN+J&oZ~h5810ZZ9lADv@_S zEPx6?D_N{lyfKWtYDsApnVSG6=~6D;UH`x?mnV%OO$E^hR!%UbMq;&$6sz`gb7yj9Dm_Lex93(mEktsg}+)&y4 zb-!}{f_ZW=5eksp#nzH|)8ftIj;q(I&wG0P)^iyRc*{^nq2q(fHOn7IPoj?<0^<_d zdut(J^ccXh##(RQDBO4$nvU?@rs?j(+3(s;HbwixN@t9TNe@%&>5^5gRTE!8(&nM^}z;KGoPeJOHPHaGBTp2KfAihkWM z?lkb4s#m=mw6zhAOMl;HBmX1#^#oclC!8zNHY}gdy`NlFp?+Q!{ZrWL_Kb>n3I@jT ztJ0C2VpA)x(IW=NQX^s19mjw++D*_STdoBTdkAbmK^BVJLx2SzpFF5_UE_Q6dXIc` zE_*2%jNX8TeQ4L-BNS}UYI!&|UHNPLRCI2m-euelufcz?q*UEuWD zSpS6=ye6Z;Pv45a&Id3eL}dO_y&aT46!&Qz*qWa5Mx|SfIbw7E)!rbGm0d09j?208 z=)W{SuYEHS60Ht7OP0D<=#&G-ET+Jh`oIfJ@b~zU=a2n7TYhL>LOj%03sQ@RpQ+>f zL7#J9cp+p#V1OeI*op<7fIN~e9c*?B9)n~8^EF=Kq>n7zR?m9eR z05=}E{Gh#U2inKP@hAIn*pNAX>}YH}E&`*@T=#8q9Jbc#gAV&@x|QL4Wul|BZ=04s z9;Aa%Q`_oR!HYX94m*y}pLbrbp6ucxT(00y+UxA2@M-gxaaLn}{=bc<&gQgq?MY0e z94D|28X#GI?tOPimb%!QZcA)(G`XnPnu~m2HCz_b_#fbio)$+<`JhWsfsWKKHX_l3 z%TW|^=uAfP87!|5vQjh6Z=yIQCVP@oVmj2Y262dx7DQg>gzU*C4){~5OW40As(Y9E zD#L3as~0_UNH(D-6@@soy~EpV1#{xE)7P+#lcaySA&}Gj5XG~PF0W#qA0HoYj2&}l znd;xyuHgr4pXwCT?k`c*F-aFzpDP2gm;?6|ee2 zLPFjwLSN4K^IQ0T>bD5QV8Z!vWf-!@ztx{-U-=DH)c%~@PpT;%Ij zb#6#GadybberTSY^6OwGah{2I+u0P4q=uQvt|W=rm$*BeCLMZjm5Y4gq(+&?nF8I^1QfT zkc|owP}PrC!k%qnaCXb2j#~S))*fZhe54tN#Vb(}_AG)#ZaDR7D@jsC&ig;eKgvT5eXSR6N7#J(4+NLYp|Ev`aE2o{I3+%41lgJ%YM1w1!%vUku^jY?IUMqjf~>*-CixCvt%Hm*oT zab2vX#=;xEQ?5O_4FA4~ij)t*++xF{`b|dcY@^NZM=|@4o0yoGspgv|8%v#Mm7!Q; zIO`XsX`XC4id}r=A_Lm9aKP3Mr{tTV{>n~K!?*WIMmRUH{lf7HMC|cv?go6S&Wav; zZ+l)AlBhzeFk?hz7C*X++13_2k<8q&*?s=ZxMv)H*!ZslrAmEQHe;tM+fkAxj4i4N zN6jAF_K)hNWbtyhhW?eJ^m5#x#OlJ=xoF*Sl?1iGt?lGMGYOX;sJdNl$*1+9rC})Q zM^jXXdf9>_!IqU@^d9RnUA-7HR2~z$2n$Ng6zVV`m!Ny)>|I@o*`Tpt*j|@z4%@zo zWjvJN7zBp|eWC=HaFPL~{DJL`8r-iJroCDvUEk1)TH=help*@7@tA+huF~xaIc6qZ z=Q^*yVC9lpBSo_`X8$x58*PWA9rhCZWc)D#0McfHO|9{!SvTfO8f7FoXyLEFK+&LZ`Y=>2-NRSS zx~lO-qFZmxR6Nm*JcB+rXDkm>54(V=kM~TUCr{Q~^w*aLV^d2WwyKtro z1afr zH z_?#Bv2SngABl${NzhvRtb6$RR>kB_aLea%Etjd0PGe6=eM1zHd=2SbP6xp{$@m=$W z>RtQ#=4qWENeACTGADY$mEv095P8kOS71p3bA4WcX7c+Xe?+rxI`1z**!A>mLjZ!9 zLMDm?K-2k%WA+Ty3t#1JBOPGilSZmA$QFjE3_(^KWUS)u;DZLI%e;Zq2j&eeGc zCco43mz@WyY%#n}dVBFfwSVqAw;%bm-t@ytPz?Et4=%%x5iF_C?te&<9A@B~kW+)Z z-lo=2E&ZfB_d3h6{QgelLjYE%$`r{L3C2cij-5)^*2XkgdA~iv?6?|92Xv+mJl2;z zFH|AXTd{QiYG>FI?kjamzL0?l2s*}DfM7@eF)25wV-5c@7iIIituaFB%cmMM#f*T| zqd^XhM#M1kY^O^S+?*Gq>xcn&9PhLD>e>RPNBQFVx<==otciWy7Ln5?NYUiZ1v#J0 zGAs}w;;^^Q($d}QVq$BApsc7Pv6UT>1588@gHmHXTYxaL_}An?un|Xzbj`}ez`s|# zBxQnLcgI$%^nEPeCel({Dbnh(^XMlWw-!&u_rj<0BS>GR9?ZFjzK80#Uo3tktpxf{ z$~7CqgGGg)oPd1Y3ns#OFFTA+Smcb|1RNVJlTQ#u$Zdl11E~C$*D&>%yH_dvmoHyP zKq6@C8+4s>7GQah0CE&AQQBM0qNp+J&Lv~f+YxOp$6S4l8mq!?yPz1N#-z5m9)*WL zv910LM$6l>Q!p12E$S?=qFzz$NO;gG@2f>LeHbS|5oK$=)YZS+Jx`btoa<#p>fR}H zI&aVnM%RB?*n)j3>+KWXD;IgjOK&g;65h%}1g%K-Tf9v7ijH<;oxg z`Pf{AVRuhTWg)H8z+^{lUI(6{p0!q0#KzDjPQ{;MM3q5SoLmEM5=E19LXA4$(DJsz z)~m={dEiBqre%s1*?IW*W{>9^zSNlAQ*-8`B)qBd@4#PhsBci1uzs+xY}DuX{&PIa zi`Z5*igO_V?2N{&4)$2;O*L6*aMF+Hix;eX`kqt%9XryJa;#Ffb-g)=po%b7s9Mf8QN~en^M{=6AJSNiphO zL-wIXdh>Io(%K~Y7cP!uzJIN_h-b1G*&Ea_d<Wp6Rw`F zB`b@D+jaeFvAOTiY}4mwDvONo)x}>y?IaDK&06wFeuum?ceSG@-ev%Ul*A|s&_c|PU%67lS7MXkZl<{Nx-c_4j$g%1ZQms?r zIlPmPLUbx;w;1j4^V>UrJX}t0OFP=qlVeOoI_MQud7NH-zRE%6T&muDa;+*^Ew)e}+VDM2;6L#mziD!SxAb%x8vb?drKe$u&EB+oj(1tdC zh(<(j!7W`ElP-)Quns9#PGK4F?k1m&egVLLl`5GTx?P`0oe*7ho0pkkpV;Q2J?}(; zj-|VuW>0213flBBmue+J-|gBQvv+^XO$N4WHzd~K;6HGHSZu2W|aJDcIXqo1}sdEgV&sfc=8jE^xcdAK>p|6igjjPYm45!BL`71Cb zJGJ@qN*3PKdQ+)OG)C3635|OnRQHX}u5MwThPJ5t{za~Vdqo|gof+4trCbPezWU1& z!(`+U0ASL-Y614f&9z{$?M*Wc1(lEXO!$)x0>c`YEISM5OnwAghNpPdLXl!ng1sZd zbpzhj3>2L$qt5;*D+7CkacPGe?mYwcrEPOR3;mcsUB4P}SXbxImvv#pO)+zvE7LIG z5VjAw@GidMOIUF8P(`t;60X?(uIMRzv^vbWYOC%}Ui%Tz6M(~R5Rb8&0IcS_f5t}J9X!Md+;9sp-qR0xDbAM_CR9cctvQ&S6 zV^~4TPg}tiHOFmakVQcV5>jr-rm2Q^ZD%i2PG4f4{XzVvXY0rUta#EsRpyn?9-yL& z`8%@*&jm`_ikj%9rNsbSN3Di|^dH*QF_7)jnu}V>57%DE=J%E;7NGg&N58S8RT-`& z_JCx8A5!`UTERU(s>Y1|z2za+H~d9D!_nZMsRFaI>QN$LSSUF3T@P;kedXUdpX^*W zHZ9mN@hairs0p_|QL3!jS$9K%Z^=Q@bLpj48!^$ckYgutLeaOJU;=-|D2Cr`GS;rI;3f|$(%_ti zx^8^|LU);^#XNZ-M9GI*>=M_E!#dVEM@aKiSJ=f$i|8vjm8mST~7#Z&P#><^3=%{Zx{if+HlF5T7+_0rumcYbG?9cabyG!Yg z?<&}-v(_auBW(6D8$yXvqphvL2H!@naBMSY)Gr-5AM^zl1zA3xYFEj$>b85PV&VFa zeA^MN7-1SGv^u#M_|CCqJ4TXV>#-VwYVu>61h8$?G2ayK#U%R_FNlcUEr0_#R(TIu zU40UsFuOnpN^M6!-&#HOG7lcECAYMt8RJ^p7E`WkkIa$^xs2Nuytk+-fB;Z?x!j!okT8V& zCltv;WS$_1WQzKd+Dm<=k73?xTYCBsiE{O|u;+$!P`hO6IcMu(81CVvk;a}Et$Su5 zHPCkMQTo#+5&zP=<>YL$V6tH+wR3j84Xb%ZNSP~`iLXotp6&|IW zBd?!m%KMx`@`Z7gs^jGW;9E$cTE?IJ$KEnuMlcbL7*!pz(T(Jq&?RW#3i0PL&GbwTZhgY@8eQvJm20tN)JZE zKqu)nH4;IZx3{W~T&vLodlg1PG25#WF2j@ZvJCf>VF8aHm+eJ_^R1!>1@HZuKMiW2}(5JEP~l+efWCyxQPb$HYz$rj{I509A}JIDD0~!ghKkKLB}r$ zAhfusRO%2;W;_&po_2COBzdW3M@+?-Z~dMh68g)jS`wk~X#_M?(vh*lK!Kw50@*qn zx4k$<1DhEY-kvTqrs8=Pq~LBf z_^GgcKrlJ|FyjKj>z&HStCcZ3t;)gwXU*b3@Oh$VrK>00ZG}9t&F6SbEC08>}v~q&>}4~(Ye4@ z#mRfeMJCJ!RORX~svzJbI202_r_?bLaATkz$~iOcrF=d3rpV4gJ{Ix z-5x`bMA^G%{!}*%q#VX+uw# zJwW^UTNl$2OXO3_F?LMx@Zs;~cKE!^jjX|sZ+=$~3lPwZ&l%EpH^&3};($HK7_}xD z%}mZ8=oZYLsGSkA6MsR5`CvUMG;=^_k_v^bqPvXF@TfdFk5YTDVD}Ls_d~+$%HYl# z%*yCC%OUe!!_r|M0uqT_om~s@u=0-4cXd(^f4`lhHFr!#b4~GsuKFc9>{gnOLyUbS zO}%;<=TCXU!HIc_!LFu=U<#}=a$I*VD+-tp&@JBEv+pZrl8GZSzuoYof+PVk!`?eu zZf@Jq#u_UMfH2^enh}JA>NHmr+B9xR-hJZr2amBNW!a?ve7v*M z2&)3vLYbryV1?2?>KIB7WQo^D)LVD$57&)kUhe}|L(GJtf5!Rdkm1Nr|J3I4wCUk6 zFz9o%qC|7Jjy2iaYi)}4b|3aY_B%e(;+*0OT06lB@A(Pc!lcuPXR@cOHoDa~ zGhqz+snzgPzdW7(#;W3x7T`=#Np}AB;b%Y3NGw{nfEve3pada-qcsJV90vzX9u8ki z-bd~@OM2X0Z`s^W1^LCE_S$oZpzR$!e7fJ+!rUG?Ih9!080^)42Z^)QDdPlQBBwup z*?YgR*B3f<)U3%yPwbt~n_Ak)lxW_+BiG1w8-k9Wz5Y{@zC6S%x>=9HAfj3rMrUg% zXTj8enxH`Q!Sz#mHz0E8wC2`lBtp%5?m5My6?eL!GIbP2pGe4gD+Oaa7u-Ur8cZ>F zI%TiKm43u4>6Z$v6lU|rY_%4B_v;t*TpZpJdu-z$(&vf{pKCc+n|{{2>eCM?4tumG z`3dm{gbthvS3{G558v~d#`n&(+^rjhhnYxvE5j^-vvV#GQ;gGbyf#5jclOUHDcdG9z@K9y zFyH}GP#{N%)ukX&=~%HyGjnsG=ijr6*PLZX(J<3o(3!Ks7au6ni)RyE?og0h&z7O` z%$&hPALxJ}fvPbXMs82LKG)-q!{^rz*0$-JuoPAe@S1uQk`~Kq>XOD1$dfvaIANR#!&I;Z#;N_v^ zD)h*6IS4?4G=|aX3-E(!fi}7@!;&7AtR5XN6{97w33W4k0!F{PLFn*&A^Yk( zZ@SuIgWItI_pES!ymz^Y$u@I-VcEG)&|#C2MxWS&4Hsn`7?7QK;&%Aw10;d64)Pg( z{!<`av4EE(%C-rIBku^3A>M!GJsE&gMLN5;w^9@Y{my-n_{B+dE9IAX6S8h`u)2Uky9R3BiJ~$2a3^mbbaW>Sfkz=K}2D&)VKn zj?2iX0;*DOzYSooTT@ywjc7RyLOY!`rhJfN1eKjiKe3d$I;X0q_jWqZe9qn-12IuhvqVPYHgnS(j|AU2q08aZ-?p()5DdpMueAg1=z>}iXycUj7P*)%CO65MrZY3Ze_o12B^kEAcfqO|lvQ1+xt zH@vC6-B0&hWvxxLF&NfYEVrY!mK#^8sDUR{@NqsF+DnJ^IX$I7j6`i+J}*({k(Cun zORo~v%KYB_%X#Cw0^ldt#8%EhV#o6urN+`%A*LGB_|o}jf4T5lQsqA~u{g=ecxtcOUa@a5ZPBl4SLx0v zeA^o7h^u9khW{0zGK~vUrmTe^eg-=mkGDdp?d8Kso!4i~6u`6v+4S;P!kxj6kx-YV z=r;JzUB2pvliDIsuHfhEB%gsZE7QQTk%aAE-1Kd$C&nbu{|y=_&<-<>zsgM4V`A-}pJ;D1&H_9Qmh7LV6uzW;_UG4Exc(Q8TLFx3_=&e=gq=^ga{E7f9EYJR z$H-RF%U9hq9>6r7dievf(VGbDpxQhnMssD$U;~ysHp6ajm&hN z`^V|d^mJEsS65e0_hg>$=QE#Vs;jG>>8|?KdCob{IfBD6!ti_d-5bfgL(#X2G7Xrl zu!q1#0EO2@^)Zkotl#|t|M2?;8a!mjFd~f}fWr|7e_8vaAn?u<)iZ#-EA21v3Is>^ zJ!_WKe{_y!qNULTa5%~_w{QAB8CtI*tH%r48uM+h;>l6|Y$s@=5hrSnro``?-G;9J zk}6*Y9j7thC$Jus_xZC|E~@JQ9Zidyema_txl3nswrJ*dRc7Qiy^8V&guj8lEaHt- z)oB9z038aM+XMD;cR6*=y*?v-~odQ6UyNs{U31mIo39AGzK7r+Bv1|A2N0rvvyQW~RUE9c_A znFopn3xF0CKH~&!M6(m->)bo{Z~p)H0se3dpc+v-7?_I*S}X7(@F4IH;5lF-a0u|Z z-0SxMHvo46VabmdVdB@DaO_RT#Y<*(iI<&O*OLU97cZOh=x^(Kj$!1B4;`>=W^?~7 z3T>i#|E1ll8#U#e15By&Ii3KP*7+R6MLY0zO2uo3i2YX#)3{=Z+GODSz)`v17h~df zAK_xv2Al}Y18xKU4%`gf2t1nGXP?|LRHRM2ic9We%M+>?GWH&Kqs($gU2$2 zNddMfKZwE!X?qE1phse9geTB{ScN1osMvAb0ycE01aPm~!j@d=$IhcT0GTzhT$^ zkFlA|)=CR-H1L0c>1Fo27C2`(22@Ve&cZA-FZcVoz;_dz<4eHTBVIcVxFx|cM+|=7 z>^5b?ji{bn&HDs-9Lc`d?_RU2USn${CoLc|b6Urvz`g*A9(P&K>f0MUtYY@StgAUy zzVDB~@oe2uxNhW9j-^l%h5qH_kh{f+-y5Ys``%jnZ zrXA}Br2ojRRnP#srna9-PS8MK_S#zRZs>r*c@{Vp_)8V<`8(!*8BKx)@K4~2jUEJq zi&{JI!-CfW;DQ>@b0P5iknN))`SYhOqCOzbx(*Id!P_VGFnsiF9ahsc&|O7Q;M z?^?0Bkz*>uKY*`PvH$s)u#HCU0A6VHAc{pzfa`(r1>5eyoSd}~U_OfCG>>K%H38lX zdMzE1T3mi$S+em*Z)vD4Q*yo+Yz`Mn>OA8 z^kGrQT@xQsDk=Ks8e=YGKpNG*d3-Gt@M1W}n*CSoF7$!-0QKdO%CdsDSF)q{pup z7cZIF<%!;;Lqqp%3g5eU>CCrPb1c8FYcj%N`QHUzXWgVYg3)N=xC5H*yK4YVHT~QV zn_?C7j}*_hk)Z*HX0_DD0a4M)V}Um$I%-f^uvFy9M8~cp9{NFrv0*P|f^1b0tw0Ap z!qDq(?Fgw%`#&-EX}a&O0jzKI017Xz_eSh9CgpQp2U{vGG^?dHiSlFHInhzQ7)!@Q zn{IiTND!H6uPVD#b1V-zW$3|KlW(C`OTp{PYoBPOSO&VOQgJkY$B z+A~C((d&sbjSL|xUS)ra%BCd8t|x)WLs74ZNbFPZpp?y~rER{R|}#aBG;JIuQ@iF04& zEhWo_gO-sZYK{>?Uex-rX-`yDiGRa#wEmjnxo>A^l5aE9tl93vjxPjoUr_b?7IsYx z$nIU7y}YqyW182L!!cy~_dKvy#C{WjyMV)hmr^?S)lVh z(x@>xwoVFtYLy(e2Vm}CZWUKG4k*VmPHLBe;fXhIJR~QMeTPn_Me5VFgUJ0 z!*Oh_s01@gwpkSnha*)AMeWxy-?J<53v7{Z$%X20grZVXKv0~hIqE>6Pirw&? z?>!N-*nS9nmCDoCdNK~19`B9$1@qC|SAkcsHFaxEZt=aFcnL_9krDhE?hQ! zm$ChsDS;vCB*ahANsq}`3H3K%jhz%r(ti&4Gf-88yx38H8SyQ) z{QJ)W&H(-cxQc|?_6gVtD+hiua3Hqy>JDHLW&!Mmtz|tK`vL>3(!xzx)s4b{#~ZLO zd2C7z8b`Mrh?W`&X*n%aho`yw9ed^a>Vh^n74cTrFxmuO!z&OYy&edSse7tXVtLplliqkoQI!!tdobC5^+_Y(I>leOuz{6=p^qYk}v7P}ZQCb7=yTIouO(s4c z6SlK~&*XkqV|^BGU~s*58vIM{T)za~PpMGtglW))SY?6CD~uSEk7FN3z75v~toY=D3h#SAR%^bPp8_9Y@C%*>T!{%>6t|Nw zP%?MvjLuAuDf@W{Tkyk`yBV@1eyVpEl=^bBu=zwjb!a)UNcya(e`ih>_ zm1Y>HVzZHQd;s$|HHuu`MF;V+4 zrV-_sidmG!F0pl(=-wFirm#G*jljDxtL@DdP8x+^1t*Wo4Qeif%Z-vS7OU=Ydcn3V zX1d-He!dsJ@fRaDRH~pTpE~N>DcvE z!UgSh;Jg^`KOZ;-6O=q}!P2>UfY)jm+rMJMb`q5Z4d5xP&+u;;TD*2VW<4H;S#(3J z&Zx#K&tnAy$1!g)%ov$2YFdx?GMcd~QPtd^T>_3Tk?X#5D3i~W>6#)Q# z_sZKmB(|S*UT^#PDspi>$CO8q$7%ECR^D?Ezh})8fyk9bn|DXEKRanb>T8@E!J8ts zFCbn6nGIOerDEE#doC#_V$yo#BKAHhVf7PbSf z!PLIQVO)Xhf%g<`TM2x#%Hw>WaI;{1$e8{@yu7{9fd;W2ZvK+aPk}_WLrk3j{brJe zE&W-&UWzJeDD-qq>o_yL(KU>)Qcj9Z`{EL1MSYCqeL8l=Rbc<;;DtnBrW*N(-O zac&`w<#aO^8N8m_BO!0cl6S)@?1u4iOjL{cGV!V=6tWugDHOAg@HMTAt33YJOVCoW z-1fzbI}Sqh3=ftGEG_yesZA%aK4ibBpOPt$z;iF{Fz~}9f6iAAT=fD{8s%3wA5Gp| zr%AX7OJgeJ2C56(ttEV#Ig#k4lyUeyRi(n2#(8pTwdlc@sUrdKHGHB)G)T9Glv_XUD!?KKOO{fvX-Q zqUo>S*YR_d;|sSb+<$4$s-r8Nd+y@)X)T_Fq+aMRub#0s?8zMu<##Z2Ou+Pum(4un zavp8=Mlj`g1lS)-v)UbdI9J;+Jizi!OM*LCWHqUOyw+5xuL7UIqNBxp2TS{Sy;>jF za?JTZ2MZ)5Wn0W27FC!MZ2rD|$oA_#!(l7?Xkj1IZlsVC~|L0^SNdknk8gVNSlV4{1+q{vO6p6P_n#8FAlMd(XdEPr)6iQ()9%SJQVE;lL2^Ur zucTurJOw;NcpmK=f%Y`&nJMOg{}txIzmV!WP<#D+B-30LGUCyMpT}5F_6Vy`p2=p*ikiQo zYrJmAh9-ZDauWlY^EQq5QeT)pZc`A}-*mLmjYT@sg_lD$g#*0tr8& zAk12UGZ_@Mca+&L%_UckVIY7#itQzV{uNKJEgxOp*#AcGLa)!jEiE|&&)j-)dCM=r+b8-( zb%ItO#$v|E6w)t9k>(1Wa(ykP)JxRBqKv0trA*?GZ>GF&#Onc;S-e%FGUAgzi;&Wi zLviw)g3|hHdA#ZquZN3z-cR*Ju)NWw*xxXlGF31VlBTG=LbQ)}LcJTGY{1rg{W8YB z`N(Nh|Iv-qQ~+mWASzqK`ok?qpp!~uS}LzW6r?Tk>lSb46oyt0Mp7f84CW?!50tA#WS;dNmQ%5lqmK8)IH?=Q;k; zuBZS_a=cPYElzI&ksz`stg7x->#*uXqeSgu;yLzr!b&LJjAa`v1ipY}`gc)3L4))^ zoAnXfH+xvL@tBlPSj-JE_Q|gvh{|PY#fn01<#@wCg;7M%%g6?iQ@*4~5R^PHRV&v} z3&e}MXu+$xD8g{j2rGLRtmtV}&AUlhF6bgGt>QnhLKQn>Vm1>K#d>#kOWP>fE@JyS z51T_5SHVb|I7_PXA8NVK@6_6d(%dM~7D~OicxlIez{IR4$=6awpduxr=4nv5fw7iZ zrlwMEd0iuE#BAbim`|+O7}yIdUH=po%{-7n(R~!lGB}ap3%;d&U57291o&%S)Q%u} z+Ls28QrXF$ujEj!%+Y@745xV!BWeqaBg4I#36astK^J)M0rFKXLTTCpHa@~i&K zk*@<)E;1Ik(^XR<+183vOGS4aji`s2zy~=EA0D_hG44ak{FdZFQ!(QX`}>_7;|U43-cot}L52 zl?qMf^J2=a8mPDLuaKyVII+(wquGj2D?x@;)cl2A6OmlPbH7){Qq#~U<11qaB2!QX zGMQ3;8Yt^FR}{6Yx>r6ca&i6DtN;FUmPlpJi|bf4S1pg|*P?O_UEHaTF)plBs;5 zNaPZ+x)6*RI7M@R>*m&SJ~aug&d?fc8~J&(5aLgR7sDF0gr73Hj=dtsTNRICBhF1e ztlu5x@P|Pw?bAd@L9d~F(rPhjTb*e}tfPv|J`X%lC~6*-Q&um^{ZaqPU{ca!l+R{9 zJGBv&C+cQRENVhf(GR4`p8mqFwAbB*a?%1aqOzqKM=+)(Qf#!ej{^X0UQ46{I0*5l zX%#otfa*;|*UlTGGVtZN$BK<2{)(>gA+N*{wf)l`P8lX*gBpvzjqE1X68(92A$8P5<5v{EqG$DL zk?7nhLC*FUcSNl~Nrd0qekAL&Vb@=rGS+Wk_#t(gm;et6Prs@8kb;AU&E}g zsCrVhRQ`;20}g|~=)%PvJ9#2QYQC>{Vri(TwY?HV{>}X@WvW_eY2q!S|BOFMfZ!sD z@B=A~#p8xlyFiJYhiKTE$SioAkZJR$h;+Ji}bIwZuX)-kFdq;B30GZ^1IipLiCEL z*A?D6b6WfSYK|FsuHMdLRGEd0zpLn_e3E(`&c#v=>cs@i5GWLv*Gabu#=FbBR?FX3 zYxcq{!nNdY4EF9VvDCms?ncsc>`f-RH21kg*Q);=(QWVqv*zMW=fd%@OzNnq?u>}- zQ%KNs=E7yuchUGlJVt};_v+Dy(pNnCI<5sN?lK?gPsaK>lHSfABuUEk z7j?~S4fI~%3GcPti~D1GRNut+Y^jD+{WP??13J&0KHs$ z%SzYxc*+Me%JCU&kw75<)-~%7g&V^xwLCr84PQ&d;_HM@Wp5x}6f?k{8aGG0zCV`N z7PpURo)fU}cgfA2<7DjV7KKk0*4n%!ou{bq>DZ$i zl~{YjFhVW03m13nq`D=_7CJ_MS$mXIxpcb*)mv)7{8`J_peqW(0La?q(hb0*D*RY3ntrbPK$7U1{6YXfv zM?b$wQIqBZE{t)m4y;(~b(nyJxqptpdW#g(j`a|!^toR~_30uV}SohsZ(zN`=hR;`ZjmxVj>y#Bp$%Nh)XD+7SN*U1CVLd~`n!W;X!Selg9Ji_eD}iucWb`l|n?V+c7u?9JuP=;~ z=jHd!Zp#MidpwW#3Y@B1-l7%caldEf5b^xd1MSg97D-O?>6 zSB|^139-m*zk6lH=E#vK;`er*t)gG2jDJVaXHZ!X2oGuytj&152AMWA*bBV9Rf2a5 z@=i6&dwIXeQCD@ZER4)f#+v7zn0TzlE5Hx2n{OHMO7QFm90I%#3p`X3fnP9a0i6N- zfN*g+kD_ko1-uRT2^K-j`zXJOUE{^9#Y*r$fZcoV!dzBA#!_~!t}xzvbK{M(vf<3! z@d~G{E3hW#q21sF`;f20ESLOK>JZEf{QxFVo3Z|wU#D`O(XrT5s#wh*0zQS6J4wo9 z?8$j5;mNW6m@n>bY$5MIunS@`_QAg=cmA^&T#l8&qCIcUy?#{0vBP))Gx_fUe(f)7 zAI|^>;CXwga%gV(QQAq2n}A0Ic~ap8A$SGyd1T%M2zD2|X`uUR6;&eyR`@-u_VXX? zm`XAQ^0wtiUA z;{^`V7!M>tR)Q|*?+^aHKwvMxR@IC|?qSfn$}4&uei)Q{Ao7(+?`j<9o&5dO$7T_3%iRo+DRCHx-3~33fBL;q7 z$A?sh`a1+(@_Sa*ZMDSb7w)yq_ATSz6nM-)buTZM8yO%QJ$NA?czi%Ie=eG4ZtVzW z^>kqm@bTEa@if*XDQ*s1enzt&_PIWX-Rpm1s2Ju&t#C?Q?wIXZWO-LCH#Eo{`#CI{ zyOt^|UZ+tHe?o3>>uBEE@s43;UT^#Pf(%J*5M-s_v$9!hp*hl_3wSHGhWqJ83?PjX zyJ;rI7Qai$Tk`K6E%z!-<+Bb;261sOX|bjKwkfaH8g@2O(v=GfvbT!-5kv zN9t&iPfcKbX5Fq2H)=eN(S;K=N9ssX?@NW>7M00%IO>KIHAkxWi>6NqiaIX|@?VV_ zk7M-VM9q;Z+14?$3i~k#^luv5az`V_R9q{l=4H=AMRN_R< zkuEy43wKajqu$}DD^AoLsp0?ruofNK=k0;IZ{IcIj-Q>zk7_osbPb0j9Q$iIa1;C3-{Vage^T!MfH5Yd+CNo zj>$1PaiZo(h3#9n`-H-d0HS)o-?RFbO7C$v>VXqAN6mccf%fe{FUbA6iJ;|^NOcWp z9gehcqUNZXZN|!ch1~!IJ;#5`iboqgCdcR{jSlUOQOvx)_LETgselN4-|t!V$wrUK zF}iV@e`zMvuwiaH8f&#Jt|lb36&|2F8Kj_2Lc#$6Y;R z?dy#mm&4(vpAJr1K&Ep_=doGI^j$gUk!K=zRAk=L_QO1tD-{j`5ctUpI}Ci(37Vr> zaH2NK`SfGc$L{oU%Pk5g00^vB;Xw)XnZV;&FYp9(XJBs?j%2W`&}M-z_&uwxY1rT$ zju4!vjbi*o(`aWoIkHS7NcHe~Z|3&#eb5`ATsq5Y8XmaRohXDWp z05nNNK~$WmjY1^bva5Q*9UyNMVFpT@!uG(+!1JO!sqiQ9WbwMz)_dpATE4!~V|F;= Z^1n)knbg1LpQ-==002ovPDHLkV1kk8bd>-A diff --git a/docs/images/SciLifeLab_logo.svg b/docs/images/SciLifeLab_logo.svg index b8a44b794e..3602a3b855 100644 --- a/docs/images/SciLifeLab_logo.svg +++ b/docs/images/SciLifeLab_logo.svg @@ -12,18 +12,23 @@ version="1.1" id="svg2" xml:space="preserve" - width="309.41211" - height="100" - viewBox="0 0 309.41211 100" - sodipodi:docname="SciLifeLab.svg" - inkscape:export-filename="SciLifeLab.png" - inkscape:export-xdpi="96" - inkscape:export-ydpi="96" - inkscape:version="0.92.3 (2405546, 2018-03-11)">image/svg+xml \ No newline at end of file + inkscape:label="SciLifeLab_Logotype_Green_POS" + transform="matrix(1.3333333,0,0,-1.3333333,0,117.17813)"> \ No newline at end of file diff --git a/docs/input.md b/docs/input.md deleted file mode 100644 index c92c84d5e6..0000000000 --- a/docs/input.md +++ /dev/null @@ -1,238 +0,0 @@ -# Input Documentation - -## General information about the TSV files - -Input files for Sarek can be specified using a TSV (Tab Separated Values) file given to the `--input` command (Note, the delimiter is the tab (`\t`) character). -There are different kinds of TSV files that can be used as input, depending on the input files available (FASTQ, uBAM, BAM...). -For all possible TSV files, described in the next sections, here is an explanation of what the columns refer to: - -- `subject` designates the subject, it should be the ID of the subject, and it must be unique for each subject, but one subject can have multiple samples (e.g. -normal and tumor) -- `sex` are the sex chromosomes of the subject, (XX or XY) -- `status` is the status of the measured sample, (0 for Normal or 1 for Tumor) -- `sample` designates the sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each subject, i.e. -a tumor and a relapse), it must be unique, but samples can have multiple lanes (which will later be merged) -- `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample (but does not need to be the original lane name), and must contain at least one character -- `fastq1` is the path to the first pair of the FASTQ file -- `fastq2` is the path to the second pair of the FASTQ file -- `bam` is the path to the BAM file -- `bai` is the path to the BAM index file -- `recaltable` is the path to the recalibration table -- `mpileup` is the path to the mpileup file - -It is recommended to add the absolute path of the files, but relative path should also work. - -All examples are given for a normal/tumor pair. -If no tumors are listed in the TSV file, then the workflow will proceed as if it is a normal sample instead of a normal/tumor pair, producing the germline Variant Calling results only. - -Sarek will output results in a different directory for each sample. -If multiple samples are specified in the TSV file, Sarek will consider all files to be from different samples. -Multiple TSV files can be specified if the path is enclosed in quotes. - -Output from Variant Calling and/or Annotation will be in a specific directory for each sample (or normal/tumor pair if applicable). - -## Starting from the mapping step - -When starting from the mapping step (`--step mapping`), the first step of Sarek, the input can have three different forms: - -- A TSV file containing the sample metadata and the path to the paired-end FASTQ files. -- The path to a directory containing the FASTQ files -- A TSV file containing the sample metadata and the path to the unmapped BAM (uBAM) files. - -### Providing a TSV file with the path to FASTQ files - -The TSV file to start with the step mapping with paired-end FASTQs should contain the columns: - -`subject sex status sample lane fastq1 fastq2` - -In this sample for the normal case there are 3 read groups, and 2 for the tumor. - -| | | | | | | -|-|-|-|-|-|-| -|G15511|XX|0|C09DFN|C09DF_1|/path/to/C09DFACXX111207.1_1.fastq.gz|/path/to/C09DFACXX111207.1_2.fastq.gz| -|G15511|XX|0|C09DFN|C09DF_2|/path/to/C09DFACXX111207.2_1.fastq.gz|/path/to/C09DFACXX111207.2_2.fastq.gz| -|G15511|XX|0|C09DFN|C09DF_3|/path/to/C09DFACXX111207.3_1.fastq.gz|/path/to/C09DFACXX111207.3_2.fastq.gz| -|G15511|XX|1|D0ENMT|D0ENM_1|/path/to/D0ENMACXX111207.1_1.fastq.gz|/path/to/D0ENMACXX111207.1_2.fastq.gz| -|G15511|XX|1|D0ENMT|D0ENM_2|/path/to/D0ENMACXX111207.2_1.fastq.gz|/path/to/D0ENMACXX111207.2_2.fastq.gz| - -### Providing the path to a FASTQ directory - -Input files for Sarek can be specified using the path to a FASTQ directory given to the `--input` command only with the `mapping` step. - -```bash -nextflow run nf-core/sarek --input ... -``` - -#### Input FASTQ file name best practices - -The input folder, containing the FASTQ files for one subject (ID) should be organized into one sub-folder for every sample. -All FASTQ files for that sample should be collected here. - -```text -ID -+--sample1 -+------sample1_lib_flowcell-index_lane_R1_1000.fastq.gz -+------sample1_lib_flowcell-index_lane_R2_1000.fastq.gz -+------sample1_lib_flowcell-index_lane_R1_1000.fastq.gz -+------sample1_lib_flowcell-index_lane_R2_1000.fastq.gz -+--sample2 -+------sample2_lib_flowcell-index_lane_R1_1000.fastq.gz -+------sample2_lib_flowcell-index_lane_R2_1000.fastq.gz -+--sample3 -+------sample3_lib_flowcell-index_lane_R1_1000.fastq.gz -+------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz -+------sample3_lib_flowcell-index_lane_R1_1000.fastq.gz -+------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz -``` - -FASTQ filename structure: - -- `sample_lib_flowcell-index_lane_R1_1000.fastq.gz` and -- `sample_lib_flowcell-index_lane_R2_1000.fastq.gz` - -Where: - -- `sample` = sample id -- `lib` = identifier of library preparation -- `flowcell` = identifier of flow cell for the sequencing run -- `lane` = identifier of the lane of the sequencing run - -Read group information will be parsed from FASTQ file names according to this: - -- `RGID` = "sample_lib_flowcell_index_lane" -- `RGPL` = "Illumina" -- `PU` = sample -- `RGLB` = lib - -### Providing a TSV file with the paths to uBAM files - -The TSV file for starting the mapping from uBAM files should contain the columns: - -- `subject sex status sample lane bam` - -In this sample for the normal case there are 3 read groups, and 2 for the tumor. - -| | | | | | | -|-|-|-|-|-|-| -|G15511|XX|0|C09DFN|C09DF_1|/path/to/C09DFAC_1.bam| -|G15511|XX|0|C09DFN|C09DF_2|/path/to/C09DFAC_2.bam| -|G15511|XX|0|C09DFN|C09DF_3|/path/to/C09DFAC_3.bam| -|G15511|XX|1|D0ENMT|D0ENM_1|/path/to/D0ENMAC_1.bam| -|G15511|XX|1|D0ENMT|D0ENM_2|/path/to/D0ENMAC_2.bam| - -## Starting from the BAM prepare recalibration step - -To start from the preparation of the recalibration step (`--step prepare_recalibration`), a TSV file for a normal/tumor pair needs to be given as input containing the paths to the non recalibrated but already mapped BAM files. -The TSV needs to contain the following columns: - -- `subject sex status sample bam bai` - -The same way, if you have non recalibrated BAMs and their indexes, you should use a structure like: - -| | | | | | | -|-|-|-|-|-|-| -|G15511|XX|0|C09DFN|/path/to/G15511.C09DFN.md.bam|/path/to/G15511.C09DFN.md.bai| -|G15511|XX|1|D0ENMT|/path/to/G15511.D0ENMT.md.bam|/path/to/G15511.D0ENMT.md.bai| - -When starting Sarek from the mapping step, a TSV file is generated automatically after the `MarkDuplicates` process. -This TSV file is stored under `results/Preprocessing/TSV/duplicates_marked_no_table.tsv` and can be used to restart Sarek from the non-recalibrated BAM files. -Using the parameter `--step prepare_recalibration` will automatically take this file as input. - -Additionally, individual TSV files for each sample (`duplicates_marked_no_table_[SAMPLE].tsv`) can be found in the same directory. - -If `--skip_markduplicates` has been specified, the TSV file for this step will be slightly different: - -| | | | | | | -|-|-|-|-|-|-| -|G15511|XX|0|C09DFN|/path/to/G15511.C09DFN.bam|/path/to/G15511.C09DFN.bai| -|G15511|XX|1|D0ENMT|/path/to/G15511.D0ENMT.bam|/path/to/G15511.D0ENMT.bai| - -When starting Sarek from the mapping step with `--skip_markduplicates`, a TSV file is generated automatically after the `Mapping` processes. -This TSV file is stored under `results/Preprocessing/TSV/mapped.tsv` and can be used to restart Sarek from the non-recalibrated BAM files. -Using the parameter `--step recalibrate --skip_markduplicates` will automatically take this file as input. - -Additionally, individual TSV files for each sample (`mapped_[SAMPLE].tsv`) can be found in the same directory. - -## Starting from the BAM recalibration step - -To start from the recalibration step (`--step recalibrate`), a TSV file for a normal/tumor pair needs to be given as input containing the paths to the non recalibrated but already mapped BAM files. -The TSV needs to contain the following columns: - -- `subject sex status sample bam bai recaltable` - -The same way, if you have non recalibrated BAMs, their indexes and their recalibration tables, you should use a structure like: - -| | | | | | | | -|-|-|-|-|-|-|-| -|G15511|XX|0|C09DFN|/path/to/G15511.C09DFN.md.bam|/path/to/G15511.C09DFN.md.bai|/path/to/G15511.C09DFN.recal.table| -|G15511|XX|1|D0ENMT|/path/to/G15511.D0ENMT.md.bam|/path/to/G15511.D0ENMT.md.bai|/path/to/G15511.D0ENMT.recal.table| - -When starting Sarek from the mapping step, a TSV file is generated automatically after the `BaseRecalibrator` processes. -This TSV file is stored under `results/Preprocessing/TSV/duplicates_marked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files. -Using `--step recalibrate` will automatically take this file as input. - -Additionally, individual TSV files for each sample (`duplicates_marked_[SAMPLE].tsv`) can be found in the same directory. - -If `--skip_markduplicates` has been specified, the TSV file for this step will be slightly different: - -| | | | | | | | -|-|-|-|-|-|-|-| -|G15511|XX|0|C09DFN|/path/to/G15511.C09DFN.bam|/path/to/G15511.C09DFN.bai|/path/to/G15511.C09DFN.recal.table| -|G15511|XX|1|D0ENMT|/path/to/G15511.D0ENMT.bam|/path/to/G15511.D0ENMT.bai|/path/to/G15511.D0ENMT.recal.table| - -When starting Sarek from the mapping step with `--skip_markduplicates`, a TSV file is generated automatically after the `BaseRecalibrator` processes. -This TSV file is stored under `results/Preprocessing/TSV/mapped_no_duplicates_marked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files. -Using `--step recalibrate` will automatically take this file as input. - -Additionally, individual TSV files for each sample (`mapped_no_duplicates_marked_[SAMPLE].tsv`) can be found in the same directory. - -## Starting from the variant calling step - -A TSV file for a normal/tumor pair with recalibrated BAM files and their indexes can be provided to start Sarek from the variant calling step (`--step variantcalling`). -The TSV file should contain the columns: - -- `subject sex status sample bam bai` - -Here is an example for two samples from the same subject: - -| | | | | | | -|-|-|-|-|-|-| -|G15511|XX|0|C09DFN|/path/to/G15511.C09DFN.recal.bam|/path/to/G15511.C09DFN.recal.bai| -|G15511|XX|1|D0ENMT|/path/to/G15511.D0ENMT.recal.bam|/path/to/G15511.D0ENMT.recal.bai| - -When starting Sarek from the mapping or recalibrate steps, a TSV file is generated automatically after the recalibration processes. -This TSV file is stored under `results/Preprocessing/TSV/recalibrated.tsv` and can be used to restart Sarek from the recalibrated BAM files. -Using the parameter `--step variantcalling` will automatically take this file as input. - -Additionally, individual TSV files for each sample (`recalibrated_[SAMPLE].tsv`) can be found in the same directory. - -## Starting from the mpileup file with the Control-FREEC step - -To start from the Control-FREEC step (`--step Control-FREEC`), a TSV file for a normal/tumor pair needs to be given as input containing the paths to the mpileup files. -The TSV needs to contain the following columns: - -- `subject sex status sample mpileup` - -Here is an example for one normal/tumor pair from one subjects: - -| | | | | | -|-|-|-|-|-| -|G15511|XX|0|C09DFN|/path/to/G15511.C09DFN.pileup| -|G15511|XX|1|D0ENMT|/path/to/G15511.D0ENMT.pileup| - -When starting Sarek from the Control-FREEC step, a TSV file is generated automatically after the `mpileup` process. -This TSV file is stored under `results/VariantCalling/TSV/control-freec_mpileup.tsv` and can be used to restart Sarek from the mpileup files. -Using the parameter `--step Control-FREEC` will automatically take this file as input. - -Additionally, individual TSV files for each sample (`control-freec_mpileup_[SAMPLE].tsv`) can be found in the same directory. - -## VCF files for annotation - -Input files for Sarek can be specified using the path to a VCF directory given to the `--input` command only with the annotation step (`--step annotate`). -As Sarek will use `bgzip` and `tabix` to compress and index VCF files annotated, it expects VCF files to be sorted. -Multiple VCF files can be specified, using a [glob path](https://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob), if enclosed in quotes. -For example: - -```bash -nextflow run nf-core/sarek --step annotate --input "results/VariantCalling/*/{HaplotypeCaller,Manta,Mutect2,Strelka,TIDDIT}/*.vcf.gz" ... -``` diff --git a/docs/install_bianca.md b/docs/install_bianca.md deleted file mode 100644 index 5418db21d1..0000000000 --- a/docs/install_bianca.md +++ /dev/null @@ -1,202 +0,0 @@ -# Installation on a secure cluster - -This small tutorial will explain to you how to install and run nf-core/sarek on a small sample test data on the Swedish UPPMAX cluster `bianca` made for sensitive data. -It can be followed to install on any similar secure cluster. - -For more information about `bianca`, follow the [`bianca` user guide](http://uppmax.uu.se/support/user-guides/bianca-user-guide/). -For more information about using Singularity with UPPMAX, follow the [Singularity UPPMAX guide](https://www.uppmax.uu.se/support-sv/user-guides/singularity-user-guide/). - -## Install Nextflow - -```bash -# Connect to rackham -> ssh -AX [USER]@rackham.uppmax.uu.se -# Or just open a terminal - -# Download the all Nextflow bundle -> wget https://github.com/nextflow-io/nextflow/releases/download/v[xx.yy.zz]/nextflow-[xx.yy.zz]-all - -# Send to bianca (here using sftp) -# For FileZilla follow the bianca user guide -> sftp [USER]-[PROJECT]@bianca-sftp.uppmax.uu.se:[USER]-[PROJECT] -> put nextflow-[xx.yy.zz]-all - -# Exit sftp -> exit - -# Connect to bianca -> ssh -A [USER]-[PROJECT]@bianca.uppmax.uu.se - -# Go to your project -> cd /castor/project/proj_nobackup - -# Make directory for Nextflow -> mkdir tools -> mkdir tools/nextflow - -# Move Nextflow from wharf to its directory -> mv /castor/project/proj_nobackup/wharf/[USER]/[USER]-[PROJECT]/nextflow-[xx.yy.zz]-all /castor/project/proj_nobackup/tools/nextflow - -# Establish permission -> chmod a+x /castor/project/proj_nobackup/tools/nextflow/nextflow-[xx.yy.zz]-all - -# If you want other people to use it -# Be sure that your group has rights to the directory as well - -> chown -R .[PROJECT] /castor/project/proj_nobackup/tools/nextflow/nextflow-[xx.yy.zz]-all - -# Make a link to it -> ln -s /castor/project/proj_nobackup/tools/nextflow/nextflow-[xx.yy.zz]-all /castor/project/proj_nobackup/tools/nextflow/nextflow - -# And everytime you're launching Nextflow, don't forget to export the following ENV variables -# Or add them to your .bashrc file -> export NXF_HOME=/castor/project/proj/nobackup/tools/nextflow/ -> export PATH=${NXF_HOME}:${PATH} -> export NXF_TEMP=$SNIC_TMP -> export NXF_LAUNCHER=$SNIC_TMP -> export NXF_SINGULARITY_CACHEDIR=/sw/data/uppnex/ToolBox/sarek -``` - -## Install nf-core/sarek - -nf-core/sarek use Singularity containers to package all the different tools. -All containers, and all Reference files are already stored on UPPMAX. - -As `bianca` is secure, no direct download is available, so nf-core/sarek will have to be installed and updated manually. - -You can either download nf-core/sarek on your computer or on `rackham`, make an archive, and send it to `bianca` using `FileZilla` or `sftp` given your preferences. - -```bash -# Connect to rackham -> ssh -AX [USER]@rackham.uppmax.uu.se -# Or just open a terminal - -# Clone the repository -> git clone https://github.com/nf-core/sarek.git - -# Go to the newly created directory -> cd sarek - -# It is also possible to checkout a specific version using -> git checkout - -# You also include the nf-core/test-datasets and nf-core/configs using git-archive-all -# Install pip -> curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py -> python get-pip.py - -# If it fails due to permission, you could consider using -> python get-pip.py --user - -# Install git-archive-all using pip -> pip install git-archive-all -# If you used --user before, you might want to do that here too -> pip install git-archive-all --user -> ./scripts/make_snapshot.sh --include-test-data --include-configs - -# Or you can just include nf-core/sarek: -> ./scripts/make_snapshot.sh - -# You will get this message in your terminal -Wrote sarek-[snapID].tar.gz - -# Send the archive to bianca (here using sftp) -# For FileZilla follow the bianca user guide -> sftp [USER]-[PROJECT]@bianca-sftp.uppmax.uu.se:[USER]-[PROJECT] -> put sarek-[snapID].tar.gz -> exit - -# The archive will be in the wharf folder in your user home on your bianca project - -# Connect to bianca -> ssh -A [USER]-[PROJECT]@bianca.uppmax.uu.se - -# Go to your project -> cd /castor/project/proj_nobackup - -# Make and go into a nf-core/sarek directoy (where you will store all nf-core/sarek versions) -> mkdir sarek -> cd sarek - -# Copy the tar from wharf to the project -> cp /castor/project/proj_nobackup/wharf/[USER]/[USER]-[PROJECT]/sarek-[snapID].tgz /castor/project/proj_nobackup/sarek - -# extract the archive. Also remember to extract the containers you uploaded. -> tar xvzf sarek-[snapID].tgz - -# If you want other people to use it, -# Be sure that your group has rights to the directory as well -> chown -R .[PROJECT] sarek-[snapID] - -# Make a symbolic link to the extracted repository -> ln -s sarek-[snapID] default -``` - -The principle is to have every member of your project to be able to use the same nf-core/sarek version at the same time. -So every member of the project who wants to use nf-core/sarek will need to do: - -```bash -# Connect to bianca -> ssh -A [USER]-[PROJECT]@bianca.uppmax.uu.se - -# Go to your user directory -> cd /home/[USER] - -# Make a symbolic link to the default nf-core/sarek -> ln -s /castor/project/proj_nobackup/sarek/default sarek -``` - -Singularity images for Sarek are available on Uppmax in `/sw/data/uppnex/ToolBox/sarek`. -Sometimes Nextflow needs write access to the image folder, and if so the images needs to be copied to a location with write permission, for example in a subfolder of your project folder. - -```bash -#Create a folder for the singularity images somewhere in your project: -mkdir sarek_simg - -#Copy the relevant singularity image from the write protected folder on Uppmax to the folder where you have write permission: -cp /sw/data/uppnex/ToolBox/sarek/nfcore-sarek-dev.img /path/to/your/sarek_simg/. - -#Update the ENV parameter NXF_SINGULARITY_CACHEDIR -export NXF_SINGULARITY_CACHEDIR=/path/to/your/sarek_simg -``` - -And then nf-core/sarek can be used with: - -```bash -> nextflow run ~/sarek/main.nf -profile uppmax --custom_config_base ~/sarek/configs --project [PROJECT] --genome [GENOME ASSEMBLY] ... -``` - -This command worked on Bianca 20190906: - -```bash ->screen -S SAMPLE /path/to/nextflow run /path/to/sarek/main.nf -profile uppmax --project PROJID --sample SAMPLE.tsv --genome GRCh37 --genomes_base /sw/data/uppnex/ToolBox/ReferenceAssemblies/hg38make/bundle/2.8/b37 --step variantcalling --tools ASCAT --igenomesIgnore - -#To detach screen: -ctrl-A-D -``` - -This is an example of how to run sarek with the tool Manta and the genome assembly version GRCh38: - -```bash -> nextflow run ~/sarek/main.nf -profile uppmax --custom_config_base ~/sarek/configs --project [PROJECT] --tools Manta --sample [SAMPLE.TSV] --genome GRCh38 -``` - -## Update nf-core/sarek - -Repeat the same steps as for installing nf-core/sarek, and once the tar has been extracted, you can replace the link. - -```bash -# Connect to bianca (Connect to rackham first if needed) -> ssh -A [USER]-[PROJECT]@bianca.uppmax.uu.se - -# Go to the sarek directory in your project -> cd /castor/project/proj_nobackup/sarek - -# Remove link -> rm default - -# Link to new nf-core/sarek version -> ln -s sarek-[NEWsnapID] default -``` - -You can for example keep a `default` version that you are sure is working, an make a link for a `testing` or `development` diff --git a/docs/output.md b/docs/output.md index 0e500afd00..ce5758a240 100644 --- a/docs/output.md +++ b/docs/output.md @@ -2,19 +2,25 @@ This document describes the output produced by the pipeline. +The directories listed below will be created in the results directory after the pipeline has finished. +All paths are relative to the top-level results directory. + ## Pipeline overview The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: - [Preprocessing](#preprocessing) - [Map to Reference](#map-to-reference) - - [BWA mem](#bwa-mem) + - [bwa](#bwa) + - [BWA-mem2](#bwa-mem2) - [Mark Duplicates](#mark-duplicates) - - [GATK MarkDuplicatesSpark](#gatk-markduplicatesspark) + - [GATK MarkDuplicates](#gatk-markduplicates) - [Base (Quality Score) Recalibration](#base-quality-score-recalibration) - [GATK BaseRecalibrator](#gatk-baserecalibrator) - [GATK ApplyBQSR](#gatk-applybqsr) - [TSV files](#tsv-files) + - [TSV files with `--skip_markduplicates`](#tsv-files-with---skip_markduplicates) + - [TSV files with `--sentieon`](#tsv-files-with---sentieon) - [Variant Calling](#variant-calling) - [SNVs and small indels](#snvs-and-small-indels) - [FreeBayes](#freebayes) @@ -43,7 +49,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d - [QC](#qc) - [FastQC](#fastqc) - [bamQC](#bamqc) - - [MarkDuplicates reports](#markduplicates-reports) + - [GATK MarkDuplicates reports](#gatk-markduplicates-reports) - [samtools stats](#samtools-stats) - [bcftools stats](#bcftools-stats) - [VCFtools](#vcftools) @@ -51,97 +57,123 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d - [VEP reports](#vep-reports) - [Reporting](#reporting) - [MultiQC](#multiqc) +- [Pipeline information](#pipeline-information) ## Preprocessing -Sarek preprocesses raw FastQ files or unmapped BAM files, based on [GATK best practices](https://software.broadinstitute.org/gatk/best-practices/). - -BAM files with Recalibration tables can also be used as an input to start with the recalibration of said BAM files, for more information see [TSV files output information](#tsv-files) +`Sarek` pre-processes raw `FASTQ` files or `unmapped BAM` files, based on [GATK best practices](https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-Workflows). ### Map to Reference -#### BWA mem +#### bwa -[BWA mem](http://bio-bwa.sourceforge.net/) is a software package for mapping low-divergent sequences against a large reference genome. +[bwa](https://github.com/lh3/bwa) is a software package for mapping low-divergent sequences against a large reference genome. Such files are intermediate and not kept in the final files delivered to users. -### Mark Duplicates +#### BWA-mem2 + +[BWA-mem2](https://github.com/bwa-mem2/bwa-mem2) is a software package for mapping low-divergent sequences against a large reference genome. -#### GATK MarkDuplicatesSpark +Such files are intermediate and not kept in the final files delivered to users. + +### Mark Duplicates -[GATK MarkDuplicatesSpark](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_spark_transforms_markduplicates_MarkDuplicatesSpark.php) is a Spark implementation of [Picard MarkDuplicates](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_MarkDuplicates.php) and locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA. +#### GATK MarkDuplicates -If the pipeline is run with the option `--no_gatk_spark` then [GATK MarkDuplicates](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.4.0/picard_sam_markduplicates_MarkDuplicates.php) is used instead. +By default, `Sarek` will use [GATK MarkDuplicatesSpark](https://gatk.broadinstitute.org/hc/en-us/articles/360042912511-MarkDuplicatesSpark), `Spark` implementation of [GATK MarkDuplicates](https://gatk.broadinstitute.org/hc/en-us/articles/360042477492-MarkDuplicates-Picard), which locates and tags duplicate reads in a `BAM` or `SAM` file, where duplicate reads are defined as originating from a single fragment of DNA. -This directory is the location for the BAM files delivered to users. -Besides the duplicates-marked BAM files, the recalibration tables (`*.recal.table`) are also stored, and can be used to create base recalibrated files. +Specify `--no_gatk_spark` to use `GATK MarkDuplicates` instead. -For further reading and documentation see the [data pre-processing workflow from the GATK best practices](https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165). +This directory is the location for the `BAM` files delivered to users. +Besides the `duplicates-marked BAM` files, the recalibration tables (`*.recal.table`) are also stored, and can be used to create `recalibrated BAM` files. For all samples: + **Output directory: `results/Preprocessing/[SAMPLE]/DuplicatesMarked`** - `[SAMPLE].md.bam` and `[SAMPLE].md.bai` - - BAM file and index + - `BAM` file and index + +For further reading and documentation see the [data pre-processing for variant discovery from the GATK best practices](https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery). ### Base (Quality Score) Recalibration #### GATK BaseRecalibrator -[GATK BaseRecalibrator](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_bqsr_BaseRecalibrator.php) generates a recalibration table based on various covariates. +[GATK BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360042477672-BaseRecalibrator) generates a recalibration table based on various co-variates. For all samples: + **Output directory: `results/Preprocessing/[SAMPLE]/DuplicatesMarked`** - `[SAMPLE].recal.table` - - Recalibration Table associated to the duplicates marked BAMs. + - Recalibration table associated to the `duplicates-marked BAM` file. #### GATK ApplyBQSR -[GATK ApplyBQSR](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_bqsr_ApplyBQSR.php) recalibrates the base qualities of the input reads based on the recalibration table produced by the [`BaseRecalibrator`](#gatk-baserecalibrator) tool. +[GATK ApplyBQSR](https://gatk.broadinstitute.org/hc/en-us/articles/360042476852-ApplyBQSR) recalibrates the base qualities of the input reads based on the recalibration table produced by the [GATK BaseRecalibrator](#gatk-baserecalibrator) tool. -This directory is usually empty, it is the location for the final recalibrated BAM files. -Recalibrated BAM files are usually 2-3 times larger than the duplicates-marked BAM files. -To re-generate recalibrated BAM file you have to apply the recalibration table delivered to the `DuplicatesMarked` directory either within Sarek, or doing this recalibration step yourself. - -For further reading and documentation see the [data pre-processing workflow from the GATK best practices](https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165). +This directory is the location for the final `recalibrated BAM` files. +`Recalibrated BAM` files are usually 2-3 times larger than the `duplicates-marked BAM` files. +To re-generate `recalibrated BAM` file you have to apply the recalibration table delivered to the `DuplicatesMarked\` folder either using `Sarek` ( [`--step recalibrate`](usage.md#step-recalibrate) ) , or doing this recalibration yourself. For all samples: + **Output directory: `results/Preprocessing/[SAMPLE]/Recalibrated`** - `[SAMPLE].recal.bam` and `[SAMPLE].recal.bam.bai` - - BAM file and index + - `BAM` file and index + +For further reading and documentation see the [data pre-processing for variant discovery from the GATK best practices](https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery). ### TSV files -The TSV files are auto-generated and can be used by Sarek for further processing and/or variant calling. +The `TSV` files are auto-generated and can be used by `Sarek` for further processing and/or variant calling. -For further reading and documentation see the [input documentation](https://github.com/nf-core/sarek/blob/master/docs/input.md). +For further reading and documentation see the [`--input`](usage.md#--input) section in the usage documentation. For all samples: + **Output directory: `results/Preprocessing/TSV`** - `duplicates_marked_no_table.tsv`, `duplicates_marked.tsv` and `recalibrated.tsv` - - TSV files to start Sarek from `prepare_recalibration`, `recalibrate` or `variantcalling` steps. -- `duplicates_marked_no_table_[SAMPLE].tsv` `duplicates_marked_[SAMPLE].tsv` and `recalibrated_[SAMPLE].tsv` - - TSV files to start Sarek from `prepare_recalibration`, `recalibrate` or `variantcalling` steps for a specific sample. + - `TSV` files to start `Sarek` from `prepare_recalibration`, `recalibrate` or `variantcalling` steps. +- `duplicates_marked_no_table_[SAMPLE].tsv`, `duplicates_marked_[SAMPLE].tsv` and `recalibrated_[SAMPLE].tsv` + - `TSV` files to start `Sarek` from `prepare_recalibration`, `recalibrate` or `variantcalling` steps for a specific sample. -> `/!\` Only with [`--sentieon`](usage.md#--sentieon) +### TSV files with `--skip_markduplicates` + +> **WARNING** Only with [`--skip_markduplicates`](usage.md#--skip_markduplicates) For all samples: + +**Output directory: `results/Preprocessing/TSV`** + +- `mapped.tsv`, `mapped_no_duplicates_marked.tsv` and `recalibrated.tsv` + - `TSV` files to start `Sarek` from `prepare_recalibration`, `recalibrate` or `variantcalling` steps. +- `mapped_[SAMPLE].tsv`, `mapped_no_duplicates_marked_[SAMPLE].tsv` and `recalibrated_[SAMPLE].tsv` + - `TSV` files to start `Sarek` from `prepare_recalibration`, `recalibrate` or `variantcalling` steps for a specific sample. + +### TSV files with `--sentieon` + +> **WARNING** Only with [`--sentieon`](usage.md#--sentieon) + +For all samples: + **Output directory: `results/Preprocessing/TSV`** - `sentieon_deduped.tsv` and `recalibrated_sentieon.tsv` - - TSV files to start Sarek from `variantcalling` step. + - `TSV` files to start `Sarek` from `variantcalling` step. - `sentieon_deduped_[SAMPLE].tsv` and `recalibrated_sentieon_[SAMPLE].tsv` - - TSV files to start Sarek from `variantcalling` step for a specific sample. + - `TSV` files to start `Sarek` from `variantcalling` step for a specific sample. ## Variant Calling -All the results regarding Variant Calling are collected in this directory. If some results from a variant caller do not appear here, please check out the [Variant calling](./variant_calling.md) documentation. +All the results regarding Variant Calling are collected in this directory. +If some results from a variant caller do not appear here, please check out the [`--tools`](usage.md#--tools) section in the usage documentation. -Recalibrated BAM files can also be used as an input to start the Variant Calling, for more information see [TSV files output information](#tsv-files) +`Recalibrated BAM` files can used as an input to start the Variant Calling. ### SNVs and small indels @@ -149,146 +181,158 @@ Recalibrated BAM files can also be used as an input to start the Variant Calling [FreeBayes](https://github.com/ekg/freebayes) is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read sequencing alignment. -For further reading and documentation see the [FreeBayes manual](https://github.com/ekg/freebayes/blob/master/README.md#user-manual-and-guide). - For all samples: + **Output directory: `results/VariantCalling/[SAMPLE]/FreeBayes`** - `FreeBayes_[SAMPLE].vcf.gz` and `FreeBayes_[SAMPLE].vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index -#### GATK HaplotypeCaller +For further reading and documentation see the [FreeBayes manual](https://github.com/ekg/freebayes/blob/master/README.md#user-manual-and-guide). -[GATK HaplotypeCaller](https://github.com/broadinstitute/gatk) calls germline SNPs and indels via local re-assembly of haplotypes. +#### GATK HaplotypeCaller -Germline calls are provided for all samples, to able comparison of both tumor and normal for possible mixup. +[GATK HaplotypeCaller](https://gatk.broadinstitute.org/hc/en-us/articles/360042913231-HaplotypeCaller) calls germline SNPs and indels via local re-assembly of haplotypes. -For further reading and documentation see the [HaplotypeCaller manual](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.2.0/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php). +Germline calls are provided for all samples, to enable comparison of both, tumor and normal, for possible mixup. For all samples: + **Output directory: `results/VariantCalling/[SAMPLE]/HaploTypeCaller`** - `HaplotypeCaller_[SAMPLE].vcf.gz` and `HaplotypeCaller_[SAMPLE].vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index -#### GATK GenotypeGVCFs +For further reading and documentation see the [HaplotypeCaller manual](https://gatk.broadinstitute.org/hc/en-us/articles/360042913231-HaplotypeCaller). -[GATK GenotypeGVCFs](https://github.com/broadinstitute/gatk) performs joint genotyping on one or more samples pre-called with HaplotypeCaller. +#### GATK GenotypeGVCFs -Germline calls are provided for all samples, to able comparison of both tumor and normal for possible mixup. +[GATK GenotypeGVCFs](https://gatk.broadinstitute.org/hc/en-us/articles/360042914991-GenotypeGVCFs) performs joint genotyping on one or more samples pre-called with HaplotypeCaller. -For further reading and documentation see the [GenotypeGVCFs manual](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.2.0/org_broadinstitute_hellbender_tools_walkers_GenotypeGVCFs.php). +Germline calls are provided for all samples, to enable comparison of both, tumor and normal, for possible mixup. For all samples: + **Output directory: `results/VariantCalling/[SAMPLE]/HaplotypeCallerGVCF`** - `HaplotypeCaller_[SAMPLE].g.vcf.gz` and `HaplotypeCaller_[SAMPLE].g.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index + +For further reading and documentation see the [GenotypeGVCFs manual](https://gatk.broadinstitute.org/hc/en-us/articles/360042914991-GenotypeGVCFs). #### GATK Mutect2 -[GATK Mutect2](https://github.com/broadinstitute/gatk) calls somatic SNVs and indels via local assembly of haplotypes. +[GATK Mutect2](https://gatk.broadinstitute.org/hc/en-us/articles/360042477952-Mutect2) calls somatic SNVs and indels via local assembly of haplotypes. + +For further reading and documentation see the [Mutect2 manual](https://gatk.broadinstitute.org/hc/en-us/articles/360042477952-Mutect2). +It is recommended to have [panel of normals (PON)](https://gatk.broadinstitute.org/hc/en-us/articles/360035890631-Panel-of-Normals-PON) for this version of `GATK Mutect2` using at least 40 normal samples. +Additionally, you can add your `PON` file to get filtered somatic calls. -For further reading and documentation see the [Mutect2 manual](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php). -It is recommended to have panel of normals [PON](https://gatkforums.broadinstitute.org/gatk/discussion/11136/how-to-call-somatic-mutations-using-gatk4-mutect2) for this version of Mutect2 using at least 40 normal samples, and you can add your PON file to get filtered somatic calls. +For a Tumor/Normal pair: -For a Tumor/Normal pair only: **Output directory: `results/VariantCalling/[TUMOR_vs_NORMAL]/Mutect2`** Files created: - `Mutect2_unfiltered_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz` and `Mutect2_unfiltered_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz.tbi` - - unfiltered (raw) Mutect2 calls VCF with Tabix index + - unfiltered (raw) Mutect2 calls `VCF` with Tabix index - `Mutect2_filtered_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz` and `Mutect2_filtered_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz.tbi` - - filtered Mutect2 calls VCF with Tabix index: these entries has a PASS filter, you can get these when supplying a panel of normals using the `--pon` option + - filtered Mutect2 calls `VCF` with Tabix index: these entries have a `PASS` filter, you can get these when supplying a panel of normals using the `--pon` option - `[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz.stats` - - a stats file generated during calling raw variants (needed for filtering) + - a stats file generated during calling of raw variants (needed for filtering) - `[TUMORSAMPLE]_contamination.table` - - a text file exported when panel-of-normals provided about sample contamination + - a text file exported when panel-of-normals about sample contamination are provided #### samtools mpileup -[samtools mpileup](https://www.htslib.org/doc/samtools.html) generate pileup for a BAM file. - -For further reading and documentation see the [samtools manual](https://www.htslib.org/doc/samtools.html#COMMANDS_AND_OPTIONS). +[samtools mpileup](https://www.htslib.org/doc/samtools.html) generates pileup of a `BAM` file. For all samples: + **Output directory: `results/VariantCalling/[SAMPLE]/mpileup`** - `[SAMPLE].pileup.gz` - - The pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. Alignment records are grouped by sample (SM) identifiers in @RG header lines. + - The pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. Alignment records are grouped by sample (`SM`) identifiers in `@RG` header lines. + +For further reading and documentation see the [samtools manual](https://www.htslib.org/doc/samtools.html#COMMANDS_AND_OPTIONS). #### Strelka2 [Strelka2](https://github.com/Illumina/strelka) is a fast and accurate small variant caller optimized for analysis of germline variation in small cohorts and somatic variation in tumor/normal sample pairs. -For further reading and documentation see the [Strelka2 user guide](https://github.com/Illumina/strelka/blob/v2.9.x/docs/userGuide/README.md). - For all samples: + **Output directory: `results/VariantCalling/[SAMPLE]/Strelka`** - `Strelka_Sample_genome.vcf.gz` and `Strelka_Sample_genome.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index - `Strelka_Sample_variants.vcf.gz` and `Strelka_Sample_variants.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index For a Tumor/Normal pair: + **Output directory: `results/VariantCalling/[TUMOR_vs_NORMAL]/Strelka`** - `Strelka_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_indels.vcf.gz` and `Strelka_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_indels.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index - `Strelka_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_snvs.vcf.gz` and `Strelka_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_snvs.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index + +Using [Strelka Best Practices](https://github.com/Illumina/strelka/blob/master/docs/userGuide/README.md#somatic-configuration-example) with the `candidateSmallIndels` from `Manta`: -Using [Strelka Best Practices](https://github.com/Illumina/strelka/blob/v2.9.x/docs/userGuide/README.md#somatic-configuration-example) with the `candidateSmallIndels` from `Manta`: **Output directory: `results/VariantCalling/[TUMOR_vs_NORMAL]/Strelka`** - `StrelkaBP_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_indels.vcf.gz` and `StrelkaBP_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_indels.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index - `StrelkaBP_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_snvs.vcf.gz` and `StrelkaBP_[TUMORSAMPLE]_vs_[NORMALSAMPLE]_somatic_snvs.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index + +For further reading and documentation see the [Strelka2 user guide](https://github.com/Illumina/strelka/blob/master/docs/userGuide/README.md). #### Sentieon DNAseq -> `/!\` Only with [`--sentieon`](usage.md#--sentieon) +> **WARNING** Only with [`--sentieon`](usage.md#--sentieon) [Sentieon DNAseq](https://www.sentieon.com/products/#dnaseq) implements the same mathematics used in the Broad Institute's BWA-GATK HaplotypeCaller 3.3-4.1 Best Practices Workflow pipeline. -For further reading and documentation see the [Sentieon DNAseq user guide](https://support.sentieon.com/manual/DNAseq_usage/dnaseq/). - For all samples: + **Output directory: `results/VariantCalling/[SAMPLE]/SentieonDNAseq`** - `DNAseq_Sample.vcf.gz` and `DNAseq_Sample.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index + +For further reading and documentation see the [Sentieon DNAseq user guide](https://support.sentieon.com/manual/DNAseq_usage/dnaseq/). #### Sentieon DNAscope -> `/!\` Only with [`--sentieon`](usage.md#--sentieon) +> **WARNING** Only with [`--sentieon`](usage.md#--sentieon) [Sentieon DNAscope](https://www.sentieon.com/products) calls SNPs and small indels. -For further reading and documentation see the [Sentieon DNAscope user guide](https://support.sentieon.com/manual/DNAscope_usage/dnascope/). - For all samples: + **Output directory: `results/VariantCalling/[SAMPLE]/SentieonDNAscope`** - `DNAscope_Sample.vcf.gz` and `DNAscope_Sample.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index + +For further reading and documentation see the [Sentieon DNAscope user guide](https://support.sentieon.com/manual/DNAscope_usage/dnascope/). #### Sentieon TNscope -> `/!\` Only with [`--sentieon`](usage.md#--sentieon) +> **WARNING** Only with [`--sentieon`](usage.md#--sentieon) [Sentieon TNscope](https://www.sentieon.com/products/#tnscope) calls SNPs and small indels on an Tumor/Normal pair. -For further reading and documentation see the [Sentieon TNscope user guide](https://support.sentieon.com/manual/TNscope_usage/tnscope/). - For a Tumor/Normal pair: + **Output directory: `results/VariantCalling/[TUMOR_vs_NORMAL]/SentieonTNscope`** - `TNscope_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz` and `TNscope_[TUMORSAMPLE]_vs_[NORMALSAMPLE].vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index + +For further reading and documentation see the [Sentieon TNscope user guide](https://support.sentieon.com/manual/TNscope_usage/tnscope/). ### Structural Variants @@ -296,86 +340,92 @@ For a Tumor/Normal pair: [Manta](https://github.com/Illumina/manta) calls structural variants (SVs) and indels from mapped paired-end sequencing reads. It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs. -`Manta` provides a candidate list for small indels also that can be fed to `Strelka` following [Strelka Best Practices](https://github.com/Illumina/strelka/blob/v2.9.x/docs/userGuide/README.md#somatic-configuration-example). - -For further reading and documentation see the [Manta user guide](https://github.com/Illumina/manta/blob/master/docs/userGuide/README.md). +`Manta` provides a candidate list for small indels that can be fed to `Strelka` following [Strelka Best Practices](https://github.com/Illumina/strelka/blob/master/docs/userGuide/README.md#somatic-configuration-example). For all samples: + **Output directory: `results/VariantCalling/[SAMPLE]/Manta`** - `Manta_[SAMPLE].candidateSmallIndels.vcf.gz` and `Manta_[SAMPLE].candidateSmallIndels.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index - `Manta_[SAMPLE].candidateSV.vcf.gz` and `Manta_[SAMPLE].candidateSV.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index For Normal sample only: - `Manta_[NORMALSAMPLE].diploidSV.vcf.gz` and `Manta_[NORMALSAMPLE].diploidSV.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index For a Tumor sample only: - `Manta_[TUMORSAMPLE].tumorSV.vcf.gz` and `Manta_[TUMORSAMPLE].tumorSV.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index + +For a Tumor/Normal pair: -For a Tumor/Normal pair only: **Output directory: `results/VariantCalling/[TUMOR_vs_NORMAL]/Manta`** - `Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSmallIndels.vcf.gz` and `Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSmallIndels.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index - `Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSV.vcf.gz` and `Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSV.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index - `Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].diploidSV.vcf.gz` and `Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].diploidSV.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index - `Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].somaticSV.vcf.gz` and `Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].somaticSV.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index + +For further reading and documentation see the [Manta user guide](https://github.com/Illumina/manta/blob/master/docs/userGuide/README.md). #### TIDDIT [TIDDIT](https://github.com/SciLifeLab/TIDDIT) identifies intra and inter-chromosomal translocations, deletions, tandem-duplications and inversions. -Germline calls are provided for all samples, to able comparison of both tumor and normal for possible mixup. -Low quality calls are removed internally, to simplify processing of variant calls but they are saved by Sarek. - -For further reading and documentation see the [TIDDIT manual](https://github.com/SciLifeLab/TIDDIT/blob/master/README.md). +Germline calls are provided for all samples, to enable comparison of both, tumor and normal, for possible mixup. +Low quality calls are removed internally, to simplify processing of variant calls but they are saved by `Sarek`. For all samples: + **Output directory: `results/VariantCalling/[SAMPLE]/TIDDIT`** - `TIDDIT_[SAMPLE].vcf.gz` and `TIDDIT_[SAMPLE].vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index - `TIDDIT_[SAMPLE].signals.tab` - tab file describing coverage across the genome, binned per 50 bp - `TIDDIT_[SAMPLE].ploidy.tab` - tab file describing the estimated ploidy and coverage across each contig - `TIDDIT_[SAMPLE].old.vcf` - - VCF including the low qualiy calls + - `VCF` including the low qualiy calls - `TIDDIT_[SAMPLE].wig` - wiggle file containing coverage across the genome, binned per 50 bp - `TIDDIT_[SAMPLE].gc.wig` - wiggle file containing fraction of gc content, binned per 50 bp +For further reading and documentation see the [TIDDIT manual](https://github.com/SciLifeLab/TIDDIT/blob/master/README.md). + #### Sentieon DNAscope SV -> `/!\` Only with [`--sentieon`](usage.md#--sentieon) +> **WARNING** Only with [`--sentieon`](usage.md#--sentieon) [Sentieon DNAscope](https://www.sentieon.com/products) can perform structural variant calling in addition to calling SNPs and small indels. -For further reading and documentation see the [Sentieon DNAscope user guide](https://support.sentieon.com/manual/DNAscope_usage/dnascope/). - For all samples: + **Output directory: `results/VariantCalling/[SAMPLE]/SentieonDNAscope`** - `DNAscope_SV_Sample.vcf.gz` and `DNAscope_SV_Sample.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index + +For further reading and documentation see the [Sentieon DNAscope user guide](https://support.sentieon.com/manual/DNAscope_usage/dnascope/). ### Sample heterogeneity, ploidy and CNVs #### ConvertAlleleCounts -[ConvertAlleleCounts](https://github.com/nf-core/sarek/blob/master/bin/convertAlleleCounts.r) is a R-script for converting output from AlleleCount to BAF and LogR values. +Running ASCAT on NGS data requires that the `BAM` files are converted into BAF and LogR values. +This can be done using the software [AlleleCount](https://github.com/cancerit/alleleCount) followed by the provided [ConvertAlleleCounts](https://github.com/nf-core/sarek/blob/master/bin/convertAlleleCounts.r) R-script. + +For a Tumor/Normal pair: -For a Tumor/Normal pair only: **Output directory: `results/VariantCalling/[TUMOR_vs_NORMAL]/ASCAT`** - `[TUMORSAMPLE].BAF` and `[NORMALSAMPLE].BAF` @@ -385,12 +435,13 @@ For a Tumor/Normal pair only: #### ASCAT -[ASCAT](https://github.com/Crick-CancerGenomics/ascat) is a method to derive copy number profiles of tumor cells, accounting for normal cell admixture and tumor aneuploidy. -ASCAT infers tumor purity and ploidy and calculates whole-genome allele-specific copy number profiles. +[ASCAT](https://github.com/Crick-CancerGenomics/ascat) is a software for performing allele-specific copy number analysis of tumor samples and for estimating tumor ploidy and purity (normal contamination). +It infers tumor purity and ploidy and calculates whole-genome allele-specific copy number profiles. +`ASCAT` is written in `R` and available here: [github.com/Crick-CancerGenomics/ascat](https://github.com/Crick-CancerGenomics/ascat). +The `ASCAT` process gives several images as output, described in detail in this [book chapter](http://www.ncbi.nlm.nih.gov/pubmed/22130873). -For further reading and documentation see [the Sarek documentation about ASCAT](https://github.com/nf-core/sarek/blob/master/docs/ascat.md) or the [ASCAT manual](https://www.crick.ac.uk/research/labs/peter-van-loo/software). +For a Tumor/Normal pair: -For a Tumor/Normal pair only: **Output directory: `results/VariantCalling/[TUMOR_vs_NORMAL]/ASCAT`** - `[TUMORSAMPLE].aberrationreliability.png` @@ -412,15 +463,27 @@ For a Tumor/Normal pair only: - `[TUMORSAMPLE].purityploidy.txt` - file with information about purity ploidy +The text file `[TUMORSAMPLE].cnvs.txt` countains predictions about copy number state for all the segments. +The output is a tab delimited text file with the following columns: + +- *chr*: chromosome number +- *startpos*: start position of the segment +- *endpos*: end position of the segment +- *nMajor*: number of copies of one of the allels (for example the chromosome inherited from the father) +- *nMinor*: number of copies of the other allele (for example the chromosome inherited of the mother) + +The file `[TUMORSAMPLE].cnvs.txt` contains all segments predicted by ASCAT, both those with normal copy number (nMinor = 1 and nMajor =1) and those corresponding to copy number aberrations. + +For further reading and documentation see the [ASCAT manual](https://www.crick.ac.uk/research/labs/peter-van-loo/software). + #### Control-FREEC -[Control-FREEC](https://github.com/BoevaLab/FREEC) is a tool for detection of copy-number changes and allelic imbalances (including LOH) using deep-sequencing data. -Control-FREEC automatically computes, normalizes, segments copy number and beta allele frequency profiles, then calls copy number alterations and LOH. -And also detects subclonal gains and losses and evaluate the likeliest average ploidy of the sample. +[Control-FREEC](https://github.com/BoevaLab/FREEC) is a tool for detection of copy-number changes and allelic imbalances (including loss of heterozygoity (LOH)) using deep-sequencing data. +`Control-FREEC` automatically computes, normalizes, segments copy number and beta allele frequency profiles, then calls copy number alterations and LOH. +And also detects subclonal gains and losses and evaluate the most likely average ploidy of the sample. -For further reading and documentation see the [Control-FREEC manual](http://boevalab.com/FREEC/tutorial.html). +For a Tumor/Normal pair: -For a Tumor/Normal pair only: **Output directory: `results/VariantCalling/[TUMOR_vs_NORMAL]/ControlFREEC`** - `[TUMORSAMPLE]_vs_[NORMALSAMPLE].config.txt` @@ -432,83 +495,81 @@ For a Tumor/Normal pair only: - `[TUMORSAMPLE].pileup.gz_BAF.txt` and `[NORMALSAMPLE].pileup.gz_BAF.txt` - file with beta allele frequencies for each possibly heterozygous SNP position +For further reading and documentation see the [Control-FREEC manual](http://boevalab.com/FREEC/tutorial.html). + ### MSI status -[Microsatellite instability](https://en.wikipedia.org/wiki/Microsatellite_instability) -is a genetic condition associated to deficiencies in the -mismatch repair (MMR) system which causes a tendency to accumulate a high -number of mutations (SNVs and indels). +[Microsatellite instability](https://en.wikipedia.org/wiki/Microsatellite_instability) is a genetic condition associated to deficiencies in the mismatch repair (MMR) system which causes a tendency to accumulate a high number of mutations (SNVs and indels). +An altered distribution of microsatellite length is associated to a missed replication slippage which would be corrected under normal MMR conditions. #### MSIsensor -[MSIsensor](https://github.com/ding-lab/msisensor) is a tool to detect the MSI -status of a tumor scanning the length of the microsatellite regions. An altered -distribution of microsatellite length is associated to a missed replication -slippage which would be corrected under normal mismatch repair (MMR) conditions. It requires -a normal sample for each tumour to differentiate the somatic and germline -cases. +[MSIsensor](https://github.com/ding-lab/msisensor) is a tool to detect the MSI status of a tumor scanning the length of the microsatellite regions. +It requires a normal sample for each tumour to differentiate the somatic and germline cases. -For further reading see the [MSIsensor paper](https://www.ncbi.nlm.nih.gov/pubmed/24371154). +For a Tumor/Normal pair: -For a Tumor/Normal pair only: **Output directory: `results/VariantCalling/[TUMORSAMPLE]_vs_[NORMALSAMPLE]/MSIsensor`** -- `[TUMORSAMPLE]_vs_[NORMALSAMPLE]`_msisensor +- `[TUMORSAMPLE]_vs_[NORMALSAMPLE]_msisensor` - MSI score output, contains information about the number of somatic sites. -- `[TUMORSAMPLE]_vs_[NORMALSAMPLE]`_msisensor_dis +- `[TUMORSAMPLE]_vs_[NORMALSAMPLE]_msisensor_dis` - The normal and tumor length distribution for each microsatellite position. -- `[TUMORSAMPLE]_vs_[NORMALSAMPLE]`_msisensor_germline - - somatic sites detected -- `[TUMORSAMPLE]_vs_[NORMALSAMPLE]`_msisensor_somatic - - germ line sites detected +- `[TUMORSAMPLE]_vs_[NORMALSAMPLE]_msisensor_germline` + - Somatic sites detected. +- `[TUMORSAMPLE]_vs_[NORMALSAMPLE]_msisensor_somatic` + - Germline sites detected. + +For further reading see the [MSIsensor paper](https://www.ncbi.nlm.nih.gov/pubmed/24371154). ## Variant annotation -This directory contains results from the final annotation steps: two software are used for annotation, [snpEff](http://snpeff.sourceforge.net/) and [VEP](https://www.ensembl.org/info/docs/tools/vep/index.html). -Only a subset of the VCF files are annotated, and only variants that have a PASS filter. -FreeBayes results are not annotated in the moment yet as we are lacking a decent somatic filter. -For HaplotypeCaller the germline variations are annotated for both the tumor and the normal sample. +This directory contains results from the final annotation steps: two tools are used for annotation, [snpEff](http://snpeff.sourceforge.net/) and [VEP](https://www.ensembl.org/info/docs/tools/vep/index.html). +Only a subset of the `VCF` files are annotated, and only variants that have a `PASS` filter. +Currently, `FreeBayes` results are not annotated as we are lacking a decent somatic filter. ### snpEff [snpeff](http://snpeff.sourceforge.net/) is a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes) using multiple databases for annotations. -The generated VCF header contains the software version and the used command line. - -For further reading and documentation see the [snpEff manual](http://snpeff.sourceforge.net/SnpEff_manual.html#outputSummary) +The generated `VCF` header contains the software version and the used command line. For all samples: + **Output directory: `results/Annotation/[SAMPLE]/snpEff`** - `VariantCaller_Sample_snpEff.ann.vcf.gz` and `VariantCaller_Sample_snpEff.ann.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index -### VEP +For further reading and documentation see the [snpEff manual](http://snpeff.sourceforge.net/SnpEff_manual.html#outputSummary) -[VEP (Variant Effect Predictor)](https://www.ensembl.org/info/docs/tools/vep/index.html), based on Ensembl, is a tools to determine the effects of all sorts of variants, including SNPs, indels, structural variants, CNVs. -The generated VCF header contains the software version, also the version numbers for additional databases like Clinvar or dbSNP used in the "VEP" line. -The format of the [consequence annotations](https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html) is also in the VCF header describing the INFO field. -In the moment it contains: - -- Consequence: impact of the variation, if there is any -- Codons: the codon change, i.e. cGt/cAt -- Amino_acids: change in amino acids, i.e. R/H if there is any -- Gene: ENSEMBL gene name -- SYMBOL: gene symbol -- Feature: actual transcript name -- EXON: affected exon -- PolyPhen: prediction based on [PolyPhen](http://genetics.bwh.harvard.edu/pph2/) -- SIFT: prediction by [SIFT](http://sift.bii.a-star.edu.sg/) -- Protein_position: Relative position of amino acid in protein -- BIOTYPE: Biotype of transcript or regulatory feature +### VEP -For further reading and documentation see the [VEP manual](https://www.ensembl.org/info/docs/tools/vep/index.html) +[VEP (Variant Effect Predictor)](https://www.ensembl.org/info/docs/tools/vep/index.html), based on `Ensembl`, is a tool to determine the effects of all sorts of variants, including SNPs, indels, structural variants, CNVs. +The generated `VCF` header contains the software version, also the version numbers for additional databases like `Clinvar` or `dbSNP` used in the `VEP` line. +The format of the [consequence annotations](https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html) is also in the `VCF` header describing the `INFO` field. +Currently, it contains: + +- *Consequence*: impact of the variation, if there is any +- *Codons*: the codon change, i.e. cGt/cAt +- *Amino_acids*: change in amino acids, i.e. R/H if there is any +- *Gene*: ENSEMBL gene name +- *SYMBOL*: gene symbol +- *Feature*: actual transcript name +- *EXON*: affected exon +- *PolyPhen*: prediction based on [PolyPhen](http://genetics.bwh.harvard.edu/pph2/) +- *SIFT*: prediction by [SIFT](http://sift.bii.a-star.edu.sg/) +- *Protein_position*: Relative position of amino acid in protein +- *BIOTYPE*: Biotype of transcript or regulatory feature For all samples: + **Output directory: `results/Annotation/[SAMPLE]/VEP`** - `VariantCaller_Sample_VEP.ann.vcf.gz` and `VariantCaller_Sample_VEP.ann.vcf.gz.tbi` - - VCF with Tabix index + - `VCF` with Tabix index + +For further reading and documentation see the [VEP manual](https://www.ensembl.org/info/docs/tools/vep/index.html) ## QC and reporting @@ -516,105 +577,112 @@ For all samples: #### FastQC -[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your reads. -It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). -You get information about adapter contamination and other overrepresented sequences. - -For further reading and documentation see the [FastQC help](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). +[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. +It provides information about the quality score distribution across your reads, per base sequence content (`%A/T/G/C`), adapter contamination and overrepresented sequences. For all samples: + **Output directory: `results/Reports/[SAMPLE]/fastqc`** - `sample_R1_XXX_fastqc.html` and `sample_R2_XXX_fastqc.html` - - FastQC report, containing quality metrics for each pair of the raw fastq files + - `FastQC` report containing quality metrics for your untrimmed raw `FASTQ` files - `sample_R1_XXX_fastqc.zip` and `sample_R2_XXX_fastqc.zip` - - zip file containing the FastQC reports, tab-delimited data files and plot images + - Zip archive containing the FastQC report, tab-delimited data file and plot images + +> **NB:** The `FastQC` plots displayed in the `MultiQC` report shows _untrimmed_ reads. +> They may contain adapter sequence and potentially regions with low quality. + +For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). #### bamQC -[Qualimap bamqc](http://qualimap.bioinfo.cipf.es/) reports information for the evaluation of the quality of the provided alignment data. In short, the basic statistics of the alignment (number of reads, coverage, GC-content, etc.) are summarized and a number of useful graphs are produced. +[Qualimap bamqc](http://qualimap.bioinfo.cipf.es/) reports information for the evaluation of the quality of the provided alignment data. +In short, the basic statistics of the alignment (number of reads, coverage, GC-content, etc.) are summarized and a number of useful graphs are produced. Plot will show: - Stats by non-reference allele frequency, depth distribution, stats by quality and per-sample counts, singleton stats, etc. For all samples: + **Output directory: `results/Reports/[SAMPLE]/bamQC`** - `VariantCaller_[SAMPLE].bcf.tools.stats.out` - - RAW statistics used by MultiQC + - Raw statistics used by `MultiQC` -For more information about how to use Qualimap bamqc reports, see [Qualimap bamqc manual](http://qualimap.bioinfo.cipf.es/doc_html/analysis.html#id7) +For further reading and documentation see the [Qualimap bamqc manual](http://qualimap.bioinfo.cipf.es/doc_html/analysis.html#id7) -#### MarkDuplicates reports +#### GATK MarkDuplicates reports -[[GATK MarkDuplicatesSpark](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_spark_transforms_markduplicates_MarkDuplicatesSpark.php), Spark implementation of [Picard MarkDuplicates](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_MarkDuplicates.php) locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA. - -If the pipeline is run with the option `--no_gatk_spark` then [GATK MarkDuplicates](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.4.0/picard_sam_markduplicates_MarkDuplicates.php) is used instead. - -Collecting duplicate metrics slows down performance. -To disable them use `--skip_qc MarkDuplicates`. +More information in the [GATK MarkDuplicates section](#gatk-markduplicates) Duplicates can arise during sample preparation _e.g._ library construction using PCR. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates. For all samples: + **Output directory: `results/Reports/[SAMPLE]/MarkDuplicates`** - `[SAMPLE].bam.metrics` - - RAW statistics used by MultiQC + - Raw statistics used by `MultiQC` For further reading and documentation see the [MarkDuplicates manual](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.2.0/picard_sam_markduplicates_MarkDuplicates.php). #### samtools stats -[samtools stats](https://www.htslib.org/doc/samtools.html) collects statistics from BAM files and outputs in a text format. +[samtools stats](https://www.htslib.org/doc/samtools.html) collects statistics from `BAM` files and outputs in a text format. + Plots will show: - Alignment metrics. For all samples: + **Output directory: `results/Reports/[SAMPLE]/SamToolsStats`** - `[SAMPLE].bam.samtools.stats.out` - - RAW statistics used by MultiQC + - Raw statistics used by `MultiQC` -For further reading and documentation see the [samtools manual](https://www.htslib.org/doc/samtools.html#COMMANDS_AND_OPTIONS) +For further reading and documentation see the [`samtools` manual](https://www.htslib.org/doc/samtools.html#COMMANDS_AND_OPTIONS) #### bcftools stats -[bcftools](https://samtools.github.io/bcftools/) is a program for variant calling and manipulating files in the Variant Call Format. +[bcftools](https://samtools.github.io/bcftools/) is a program for variant calling and manipulating `VCF` files. + Plot will show: - Stats by non-reference allele frequency, depth distribution, stats by quality and per-sample counts, singleton stats, etc. For all samples: + **Output directory: `results/Reports/[SAMPLE]/BCFToolsStats`** - `VariantCaller_[SAMPLE].bcf.tools.stats.out` - - RAW statistics used by MultiQC + - Raw statistics used by `MultiQC` For further reading and documentation see the [bcftools stats manual](https://samtools.github.io/bcftools/bcftools.html#stats) #### VCFtools -[VCFtools](https://vcftools.github.io/) is a program package designed for working with VCF files. +[VCFtools](https://vcftools.github.io/) is a program package designed for working with `VCF` files. + Plots will show: -- the summary counts of each type of transition to transversion ratio for each FILTER category. +- the summary counts of each type of transition to transversion ratio for each `FILTER` category. - the transition to transversion ratio as a function of alternative allele count (using only bi-allelic SNPs). - the transition to transversion ratio as a function of SNP quality threshold (using only bi-allelic SNPs). For all samples: + **Output directory: `results/Reports/[SAMPLE]/VCFTools`** - `VariantCaller_[SAMPLE].FILTER.summary` - - RAW statistics used by MultiQC + - Raw statistics used by `MultiQC` - `VariantCaller_[SAMPLE].TsTv.count` - - RAW statistics used by MultiQC + - Raw statistics used by `MultiQC` - `VariantCaller_[SAMPLE].TsTv.qual` - - RAW statistics used by MultiQC + - Raw statistics used by `MultiQC` For further reading and documentation see the [VCFtools manual](https://vcftools.github.io/man_latest.html#OUTPUT%20OPTIONS) @@ -631,10 +699,11 @@ Plots will shows : - the quantity as function of the variant quality score. For all samples: + **Output directory: `results/Reports/[SAMPLE]/snpEff`** - `VariantCaller_Sample_snpEff.csv` - - RAW statistics used by MultiQC + - Raw statistics used by `MultiQC` - `VariantCaller_Sample_snpEff.html` - Statistics to be visualised with a web browser - `VariantCaller_Sample_snpEff.genes.txt` @@ -644,9 +713,10 @@ For further reading and documentation see the [snpEff manual](http://snpeff.sour #### VEP reports -[VEP (Variant Effect Predictor)](https://www.ensembl.org/info/docs/tools/vep/index.html), based on Ensembl, is a tools to determine the effects of all sorts of variants, including SNPs, indels, structural variants, CNVs. +[VEP (Variant Effect Predictor)](https://www.ensembl.org/info/docs/tools/vep/index.html), based on `Ensembl`, is a tools to determine the effects of all sorts of variants, including SNPs, indels, structural variants, CNVs. For all samples: + **Output directory: `results/Reports/[SAMPLE]/VEP`** - `VariantCaller_Sample_VEP.summary.html` @@ -658,17 +728,30 @@ For further reading and documentation see the [VEP manual](https://www.ensembl.o #### MultiQC -[MultiQC](http://multiqc.info) is a visualisation tool that generates a single HTML report summarising all samples in your project. -Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory. +[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarizing all samples in your project. +Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. + +The pipeline has special steps which also allow the software versions to be reported in the `MultiQC` output for future traceability. + +**Output files:** + +- `multiqc/` + - `multiqc_report.html` + - Standalone HTML file that can be viewed in your web browser + - `multiqc_data/` + - Directory containing parsed statistics from the different tools used in the pipeline + - `multiqc_plots/` + - Directory containing static images from the report in various formats + +For more information about how to use `MultiQC` reports, see [https://multiqc.info](https://multiqc.info). -The pipeline has special steps which allow the software versions used to be reported in the MultiQC output for future traceability. +## Pipeline information -For the whole Sarek run: -**Output directory: `results/Reports/MultiQC`** +[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. -- `multiqc_report.html` - - MultiQC report - a standalone HTML file that can be viewed in your web browser -- `multiqc_data/` - - Directory containing parsed statistics from the different tools used in the pipeline +**Output files:** -For further reading and documentation see the [MultiQC website](http://multiqc.info) +- `pipeline_info/` + - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. + - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.csv`. + - Documentation for interpretation of results in HTML format: `results_description.html`. diff --git a/docs/reference.md b/docs/reference.md deleted file mode 100644 index 741c61bd1b..0000000000 --- a/docs/reference.md +++ /dev/null @@ -1,62 +0,0 @@ -# Genomes and reference files - -## AWS iGenomes - -Sarek is using [AWS iGenomes](https://ewels.github.io/AWS-iGenomes/), which facilitate storing and sharing references. -Sarek currently uses `GRCh38` by default. -`GRCh37`, `GRCh38` and `GRCm38` are available with `--genome GRCh37`, `--genome GRCh38` or `--genome GRCm38` respectively with any profile using the `conf/igenomes.config` file, or you can specify it with `-c conf/igenomes.config`. -Use `--genome smallGRCh37` to map against a small reference genome based on GRCh37. -Settings in `igenomes.config` can be tailored to your needs. - -### Intervals - -To speed up some preprocessing and variant calling processes, the reference is chopped into smaller pieces. -The intervals are chromosomes cut at their centromeres (so each chromosome arm processed separately) also additional unassigned contigs. -We are ignoring the `hs37d5` contig that contains concatenated decoy sequences. -Parts of preprocessing and variant calling are done by these intervals, and the different resulting files are then merged. -This can parallelize processes, and push down wall clock time significantly. - -The calling intervals can be defined using a `.list` or a `.bed` file. -A `.list` file contains one interval per line in the format `chromosome:start-end` (1-based coordinates). - -When the intervals file is in BED format, the file must be a tab-separated text file with one interval per line. -There must be at least three columns: chromosome, start, and end. -In BED format, the coordinates are 0-based, so the interval `chrom:1-10` becomes `chrom010`. - -Additionally, the "score" column of the BED file can be used to provide an estimate of how many seconds it will take to call variants on that interval. -The fourth column remains unused. -Example (the fields would actually be tab-separated, this is not shown here): - -`chr1 10000 207666 NA 47.3` - -This indicates that variant calling on the interval chr1:10001-207666 takes approximately 47.3 seconds. - -The runtime estimate is used in two different ways. -First, when there are multiple consecutive intervals in the file that take little time to compute, they are processed as a single job, thus reducing the number of processes that needs to be spawned. -Second, the jobs with largest processing time are started first, which reduces wall-clock time. -If no runtime is given, a time of 1000 nucleotides per second is assumed. -Actual figures vary from 2 nucleotides/second to 30000 nucleotides/second. - -If no intervals files are specified, one will be automatically generated following: - -```bash -awk -v FS='\t' -v OFS='\t' '{ print \$1, \"0\", \$2 }' .fasta.fai > .bed -``` - -To disable this feature, please use [`--no_intervals`](usage.md#--no_intervals) - -### Working with whole exome (WES) or panel data - -The `--targetBED` parameter does _not_ imply that the workflow is running alignment or variant calling only for the supplied targets. -Instead, we are aligning for the whole genome, and selecting variants only at the very end by intersecting with the provided target file. -Adding every exon as an interval in case of WES can generate >200K processes or jobs, much more forks, and similar number of directories in the Nextflow work directory. -Furthermore, primers and/or baits are not 100% specific, (certainly not for MHC and KIR, etc.), quite likely there going to be reads mapping to multiple locations. -If you are certain that the target is unique for your genome (all the reads will certainly map to only one location), and aligning to the whole genome is an overkill, better to change the reference itself. - -### Working with other genomes - -> :warning: This is a new feature, in active development, so usage could change. - -Sarek can also do limited preprocessing from any genome, providing a `fasta` file as a reference genome, followed by limited variant calling using `mpileup`, `Manta` and `Strelka`. - -Limited support for `TAIR10`, `EB2`, `UMD3.1`, `bosTau8`, `WBcel235`, `ce10`, `CanFam3.1`, `canFam3`, `GRCz10`, `danRer10`, `BDGP6`, `dm6`, `EquCab2`, `equCab2`, `EB1`, `Galgal4`, `galGal4`, `Gm01`, `hg38`, `hg19`, `Mmul_1`, `mm10`, `IRGSP-1.0`, `CHIMP2.1.4`, `panTro4`, `Rnor_6.0`, `rn6`, `R64-1-1`, `sacCer3`, `EF2`, `Sbi1`, `Sscrofa10.2`, `susScr3`, `AGPv3`. diff --git a/docs/sentieon.md b/docs/sentieon.md deleted file mode 100644 index d675386df0..0000000000 --- a/docs/sentieon.md +++ /dev/null @@ -1,73 +0,0 @@ -# nf-core/sarek: Usage with sentieon - -- [Introduction](#introduction) -- [Sentieon Analysis Pipelines & Tools](#sentieon-analysis-pipelines--tools) - - [Alignment](#alignment) - - [Germline SNV/INDEL Variant Calling - DNAseq](#germline-snvindel-variant-calling---dnaseq) - - [Germline SNV/INDEL Variant Calling - DNAscope](#germline-snvindel-variant-calling---dnascope) - - [Somatic SNV/INDEL Variant Calling - TNscope](#somatic-snvindel-variant-calling---tnscope) - - [Structural Variant Calling](#structural-variant-calling) -- [usage](#usage) - - [--sentieon](#--sentieon) - - [--tools](#--tools) - -## Introduction - -[Sentieon](https://www.sentieon.com/) is a commercial solution to process genomics data with high computing efficiency, fast turnaround time, exceptional accuracy, and 100% consistency. - -If [Sentieon](https://www.sentieon.com/) is available, use this `--sentieon` params to enable with Sarek to use some Sentieon Analysis Pipelines & Tools. - -Please refer to the [nf-core/configs](https://github.com/nf-core/configs#adding-a-new-pipeline-specific-config) repository on how to make a pipeline-specific configuration file based on the [munin-sarek specific configuration file](https://github.com/nf-core/configs/blob/master/conf/pipeline/sarek/munin.config). - -Or ask us on the [nf-core Slack](http://nf-co.re/join/slack) on the following channels: [#sarek](https://nfcore.slack.com/channels/sarek) or [#configs](https://nfcore.slack.com/channels/configs). - -## Sentieon Analysis Pipelines & Tools - -The following Sentieon Analysis Pipelines & Tools are available within Sarek. - -### Alignment - -> Sentieon BWA matches BWA-MEM with > 2X speedup. - -This tool is enabled by default within Sarek if `--sentieon` is specified and if the pipeline is started with the `mapping` [step](usage.md#--step). - -### Germline SNV/INDEL Variant Calling - DNAseq - -> Precision FDA award-winning software. -> Matches GATK 3.3-4.1, and without down-sampling. -> Results up to 10x faster and 100% consistent every time. - -This tool is enabled within Sarek if `--sentieon` is specified and if `--tools DNAseq` is specified cf [--tools](#--tools). - -### Germline SNV/INDEL Variant Calling - DNAscope - -> Improved accuracy and genome characterization. -> Machine learning enhanced filtering producing top variant calling accuracy. - -This tool is enabled within Sarek if `--sentieon` is specified and if `--tools DNAscope` is specified cf [--tools](#--tools). - -### Somatic SNV/INDEL Variant Calling - TNscope - -> Winner of ICGC-TCGA DREAM challenge. -> Improved accuracy, machine learning enhanced filtering. -> Supports molecular barcodes and unique molecular identifiers. - -This tool is enabled within Sarek if `--sentieon` is specified and if `--tools TNscope` is specified cf [--tools](#--tools). - -### Structural Variant Calling - -> Germline and somatic SV calling, including translocations, inversions, duplications and large INDELs - -This tool is enabled within Sarek if `--sentieon` is specified and if `--tools DNAscope` is specified cf [--tools](#--tools). - -## usage - -### --sentieon - -Adds the following tools for the [`--tools`](#--tools) options: `DNAseq`, `DNAscope` and `TNscope`. - -### --tools - -For main usage of tools, follow the [usage/tools](usage.md#--tools) documentation. - -With `--sentieon` the following tools options are also available within Sarek: `DNAseq`, `DNAscope` and `TNscope`. diff --git a/docs/usage.md b/docs/usage.md index 90461d1bfe..6a75ae6ba2 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -1,20 +1,52 @@ # nf-core/sarek: Usage -- [Introduction](#introduction) - [Running the pipeline](#running-the-pipeline) - [Updating the pipeline](#updating-the-pipeline) - [Reproducibility](#reproducibility) -- [Main arguments](#main-arguments) +- [Core Nextflow arguments](#core-nextflow-arguments) - [-profile](#-profile) - - [--input](#--input) + - [-resume](#-resume) + - [-c](#-c) + - [Custom resource requests](#custom-resource-requests) + - [Running in the background](#running-in-the-background) + - [Nextflow memory requirements](#nextflow-memory-requirements) +- [Pipeline specific arguments](#pipeline-specific-arguments) - [--step](#--step) + - [--input](#--input) + - [--input <FASTQ> --step mapping](#--input-fastq---step-mapping) + - [--input <uBAM> --step mapping](#--input-ubam---step-mapping) + - [--input <sample/> --step mapping](#--input-sample---step-mapping) + - [--input <TSV> --step prepare_recalibration](#--input-tsv---step-prepare_recalibration) + - [--input <TSV> --step prepare_recalibration --skip_markduplicates](#--input-tsv---step-prepare_recalibration---skip_markduplicates) + - [--input <TSV> --step recalibrate](#--input-tsv---step-recalibrate) + - [--input <TSV> --step recalibrate --skip_markduplicates](#--input-tsv---step-recalibrate---skip_markduplicates) + - [--input <TSV> --step variant_calling](#--input-tsv---step-variant_calling) + - [--input <TSV> --step Control-FREEC](#--input-tsv---step-control-freec) + - [--input <VCF> --step annotate](#--input-vcf---step-annotate) - [--help](#--help) - [--no_intervals](#--no_intervals) - [--nucleotides_per_second](#--nucleotides_per_second) - [--sentieon](#--sentieon) + - [Alignment](#alignment) + - [Germline SNV/INDEL Variant Calling - DNAseq](#germline-snvindel-variant-calling---dnaseq) + - [Germline SNV/INDEL Variant Calling - DNAscope](#germline-snvindel-variant-calling---dnascope) + - [Somatic SNV/INDEL Variant Calling - TNscope](#somatic-snvindel-variant-calling---tnscope) + - [Structural Variant Calling](#structural-variant-calling) - [--skip_qc](#--skip_qc) - [--target_bed](#--target_bed) - - [--tools](#--tools) + - [--tools for Variant Calling](#--tools-for-variant-calling) + - [Germline variant calling](#germline-variant-calling) + - [Somatic variant calling with tumor - normal pairs](#somatic-variant-calling-with-tumor---normal-pairs) + - [Somatic variant calling with tumor only samples](#somatic-variant-calling-with-tumor-only-samples) + - [--tools --sentieon](#--tools---sentieon) + - [--tools for Annotation](#--tools-for-annotation) + - [Annotation tools](#annotation-tools) + - [Using genome specific containers](#using-genome-specific-containers) + - [Download cache](#download-cache) + - [Using downloaded cache](#using-downloaded-cache) + - [Using VEP CADD plugin](#using-vep-cadd-plugin) + - [Downloading CADD files](#downloading-cadd-files) + - [Using VEP GeneSplicer plugin](#using-vep-genesplicer-plugin) - [Modify fastqs (trim/split)](#modify-fastqs-trimsplit) - [--trim_fastq](#--trim_fastq) - [--clip_r1](#--clip_r1) @@ -25,6 +57,7 @@ - [--save_trimmed](#--save_trimmed) - [--split_fastq](#--split_fastq) - [Preprocessing](#preprocessing) + - [--aligner](#--aligner) - [--markdup_java_options](#--markdup_java_options) - [--no_gatk_spark](#--no_gatk_spark) - [--save_bam_mapped](#--save_bam_mapped) @@ -39,6 +72,10 @@ - [--no_strelka_bp](#--no_strelka_bp) - [--pon](#--pon) - [--pon_index](#--pon_index) + - [--ignore_soft_clipped_bases](#--ignore_soft_clipped_bases) + - [--umi](#--umi) + - [--read_structure1](#--read_structure1) + - [--read_structure2](#--read_structure2) - [Annotation](#annotation) - [--annotate_tools](#--annotate_tools) - [--annotation_cache](#--annotation_cache) @@ -86,46 +123,24 @@ - [--plaintext_email](#--plaintext_email) - [--max_multiqc_email_size](#--max_multiqc_email_size) - [-name](#-name) - - [-resume](#-resume) - - [-c](#-c) - [--custom_config_version](#--custom_config_version) - [--custom_config_base](#--custom_config_base) - [Job resources](#job-resources) - [Automatic resubmission](#automatic-resubmission) - - [Custom resource requests](#custom-resource-requests) - [--max_memory](#--max_memory) - [--max_time](#--max_time) - [--max_cpus](#--max_cpus) - [--single_cpu_mem](#--single_cpu_mem) +- [Containers](#containers) + - [Building your owns](#building-your-owns) + - [Build with Conda](#build-with-conda) + - [Build with Docker](#build-with-docker) + - [Pull with Docker](#pull-with-docker) + - [Pull with Singularity](#pull-with-singularity) - [AWSBatch specific parameters](#awsbatch-specific-parameters) - [--awsqueue](#--awsqueue) - [--awsregion](#--awsregion) - [--awscli](#--awscli) -- [Deprecated params](#deprecated-params) - - [--annotateVCF](#--annotatevcf) - - [--noGVCF](#--nogvcf) - - [--noReports](#--noreports) - - [--noStrelkaBP](#--nostrelkabp) - - [--nucleotidesPerSecond](#--nucleotidespersecond) - - [--publishDirMode](#--publishdirmode) - - [--sample](#--sample) - - [--sampleDir](#--sampledir) - - [--skipQC](#--skipqc) - - [--targetBED](#--targetbed) - -## Introduction - -Nextflow handles job submissions on SLURM or other environments, and supervises running the jobs. -Thus the Nextflow process must run until the pipeline is finished. -We recommend that you put the process running in the background through `screen` / `tmux` or similar tool. -Alternatively you can run nextflow within a cluster job submitted your job scheduler. - -It is recommended to limit the Nextflow Java virtual machines memory. -We recommend adding the following line to your environment (typically in `~/.bashrc` or `~./bash_profile`): - -```bash -NXF_OPTS='-Xms1g -Xmx4g' -``` ## Running the pipeline @@ -150,11 +165,10 @@ results # Finished results (configurable, see below) The nf-core/sarek pipeline comes with more documentation about running the pipeline, found in the `docs/` directory: - [Output and how to interpret the results](output.md) -- [Extra Documentation on annotation](annotation.md) ### Updating the pipeline -When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. +When you run the above command, `Nextflow` automatically pulls the pipeline code from `GitHub` and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: @@ -168,36 +182,41 @@ It's a good idea to specify a pipeline version when running the pipeline on your This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. -First, go to the [nf-core/sarek releases page](https://github.com/nf-core/sarek/releases) and find the latest version number - numeric only (eg. `2.6`). -Then specify this when running the pipeline with `-r` (one hyphen) - eg. `-r 2.6`. +First, go to the [nf-core/sarek releases page](https://github.com/nf-core/sarek/releases) and find the latest version number - numeric only (eg. `2.6.1`). +Then specify this when running the pipeline with `-r` (one hyphen) - eg. `-r 2.6.1`. This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. -## Main arguments +## Core Nextflow arguments + +> **NB:** These options are part of `Nextflow` and use a _single_ hyphen (pipeline parameters use a double-hyphen). ### -profile -Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. +Use this parameter to choose a configuration profile. +Profiles can give configuration presets for different compute environments. -Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Conda) - see below. +Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (`Docker`, `Singularity`, `Conda`) - see below. -> We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported. +> We highly recommend the use of `Docker` or `Singularity` containers for full pipeline reproducibility, however when this is not possible, `Conda` is also supported. -The pipeline also dynamically loads configurations from [https://github.com/nf-core/configs](https://github.com/nf-core/configs) when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the [nf-core/configs documentation](https://github.com/nf-core/configs#documentation). +The pipeline also dynamically loads configurations from [github.com/nf-core/configs](https://github.com/nf-core/configs) when it runs, making multiple config profiles for various institutional clusters available at run time. +For more information and to see if your system is available in these configs please see the [nf-core/configs documentation](https://github.com/nf-core/configs#documentation). Note that multiple profiles can be loaded, for example: `-profile test,docker` - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles. -If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. This is _not_ recommended. +If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. +This is _not_ recommended. - `docker` - A generic configuration profile to be used with [Docker](http://docker.com/) - - Pulls software from dockerhub: [`nfcore/sarek`](http://hub.docker.com/r/nfcore/sarek/) + - Pulls software from DockerHub: [`nfcore/sarek`](http://hub.docker.com/r/nfcore/sarek/) - `singularity` - A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/) - Pulls software from DockerHub: [`nfcore/sarek`](http://hub.docker.com/r/nfcore/sarek/) - `conda` - - Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker or Singularity. + - Please only use `Conda` as a last resort i.e. when it's not possible to run the pipeline with `Docker` or `Singularity`. - A generic configuration profile to be used with [conda](https://conda.io/docs/) - Pulls most software from [Bioconda](https://bioconda.github.io/) - `test` @@ -205,103 +224,644 @@ If `-profile` is not specified, the pipeline will run locally and expect all sof - Includes links to test data so needs no other parameters - `test_annotation` - A profile with a complete configuration for automated testing - - Includes links to test data so needs no other parameters + - Input data is a `VCF` for testing annotation - `test_no_gatk_spark` - A profile with a complete configuration for automated testing - - Includes links to test data so needs no other parameters + - Specify `--no_gatk_spark` - `test_split_fastq` - A profile with a complete configuration for automated testing - - Includes links to test data so needs no other parameters + - Specify `--split_fastq 500` - `test_targeted` - A profile with a complete configuration for automated testing - - Includes links to test data so needs no other parameters + - Include link to a target `BED` file and use `Manta` and `Strelka` for Variant Calling - `test_tool` - A profile with a complete configuration for automated testing - - Includes links to test data so needs no other parameters + - Test directly Variant Calling with a specific TSV file and `--step variantcalling` - `test_trimming` - A profile with a complete configuration for automated testing - - Includes links to test data so needs no other parameters + - Test trimming options +- `test_umi_qiaseq` + - A profile with a complete configuration for automated testing + - Test a specific `UMI` structure +- `test_umi_tso` + - A profile with a complete configuration for automated testing + - Test a specific `UMI` structure + +### -resume + +Specify this when restarting a pipeline. +`Nextflow` will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. + +You can also supply a run name or a session ID to resume a specific run: `-resume [run-name/session id]`. +Use the `nextflow log` command to show previous run names and session IDs. + +### -c + +Specify the path to a specific config file (this is a core `Nextflow` command). +See the [nf-core website documentation](https://nf-co.re/usage/configuration) for more information. + +#### Custom resource requests + +Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. +For most of the steps in the pipeline, if the job exits with an error code of `143` (exceeded requested resources), it will automatically resubmit with higher requests (2 x original, then 3 x original). +If it still fails after three times then the pipeline is stopped. + +Whilst these default requirements will hopefully work for most people with most data, you may find that you want to customise the compute resources that the pipeline requests. +You can do this by creating a custom config file. +For example, to give the workflow process `VEP` 32GB of memory, you could use the following config: + +```nextflow +process { + withName: VEP { + memory = 32.GB + } +} +``` + +See the main [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for more information. + +If you are likely to be running `nf-core` pipelines regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. +Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter see [-c section](#-c). +You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile. + +If you have any questions or issues please send us a message on [Slack](https://nf-co.re/join/slack) on the [`#configs` channel](https://nfcore.slack.com/channels/configs). + +### Running in the background + +`Nextflow` handles job submissions and supervises the running jobs. +The `Nextflow` process must run until the pipeline is finished. + +The `Nextflow` `-bg` flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file. + +Alternatively, you can use `screen` / `tmux` or similar tool to create a detached session which you can log back into at a later time. +Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs). + +#### Nextflow memory requirements + +In some cases, the `Nextflow` Java virtual machines can start to request a large amount of memory. +We recommend adding the following line to your environment to limit this (typically in `~/.bashrc` or `~./bash_profile`): + +```bash +NXF_OPTS='-Xms1g -Xmx4g' +``` + +## Pipeline specific arguments + +### --step + +> **NB** only one step must be specified + +Use this to specify the starting step + +Default: `mapping` + +Available: `mapping`, `prepare_recalibration`, `recalibrate`, `variant_calling`, `annotate`, `Control-FREEC` + +> **NB** step can be specified with no concern for case, or the presence of `-` or `_` ### --input -Use this to specify the location of your input TSV file. -For example: -TSV file should correspond to the correct step, see [`--step`](#--step) and [input](input.md) documentation for more information +Use this to specify the location of your input `TSV` (Tab Separated Values) file. + +> **NB** Delimiter is the tab (`\t`) character, and no header is required + +There are different kinds of `TSV` files that can be used as input, depending on the input files available (`FASTQ`, `unmapped BAM`, `recalibrated BAM`...). +The `TSV` file should correspond to the correct step, see [`--step`](#--step) for more information. +For all possible `TSV` files, described in the next sections, here is an explanation of what the columns refer to: + +`Sarek` auto-generates `TSV` files for all and for each individual samples, depending of the options specified. + +- `subject` designates the subject, it should be the ID of the subject, and it must be unique for each subject, but one subject can have multiple samples (e.g. +normal and tumor) +- `sex` are the sex chromosomes of the subject, (ie `XX`, `XY`...) +- `status` is the status of the measured sample, (`0` for Normal or `1` for Tumor) +- `sample` designates the sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each subject, i.e. +a tumor and a relapse), it must be unique, but samples can have multiple lanes (which will later be merged) +- `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample (but does not need to be the original lane name), and must contain at least one character +- `fastq1` is the path to the first pair of the `FASTQ` file +- `fastq2` is the path to the second pair of the `FASTQ` file +- `bam` is the path to the `BAM` file +- `bai` is the path to the `BAM` index file +- `recaltable` is the path to the recalibration table +- `mpileup` is the path to the mpileup file + +It is recommended to add the absolute path of the files, but relative path should also work. + +If necessary, a tumor sample can be associated to a normal sample as a pair, if specified with the same `subject`and a different `sample`. +An additional tumor sample (such as a relapse for example), can be added if specified with the same `subject` and a different `sample`. + +`Sarek` will output results in a different directory for each sample. +If multiple samples are specified in the `TSV` file, `Sarek` will consider all files to be from different samples. +Multiple `TSV` files can be specified if the path is enclosed in quotes. + +Output from Variant Calling and/or Annotation will be in a specific directory for each sample (or normal/tumor pair if applicable). + +#### --input <FASTQ> --step mapping + +The `TSV` file to start with the step mapping with paired-end `FASTQs` should contain the columns: + +`subject sex status sample lane fastq1 fastq2` + +In this example (`example_fastq.tsv`), there are 3 read groups. + +| | | | | | | +|-|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID|1|/samples/normal1_1.fastq.gz|/samples/normal1_2.fastq.gz| +|SUBJECT_ID|XX|0|SAMPLE_ID|2|/samples/normal2_1.fastq.gz|/samples/normal2_2.fastq.gz| +|SUBJECT_ID|XX|0|SAMPLE_ID|3|/samples/normal3_1.fastq.gz|/samples/normal3_2.fastq.gz| ```bash ---input +--input example_fastq.tsv ``` -Multiple TSV files can be specified, using a [glob path](https://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob), if enclosed in quotes. +Or, for a normal/tumor pair: -Use this to specify the location to a directory with fastq files for the `mapping` step of single germline samples only. -For example: +In this example (`example_pair_fastq.tsv`), there are 3 read groups for the normal sample and 2 for the tumor sample. + +| | | | | | | +|-|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID1|1|/samples/normal1_1.fastq.gz|/samples/normal1_2.fastq.gz| +|SUBJECT_ID|XX|0|SAMPLE_ID1|2|/samples/normal2_1.fastq.gz|/samples/normal2_2.fastq.gz| +|SUBJECT_ID|XX|0|SAMPLE_ID1|3|/samples/normal3_1.fastq.gz|/samples/normal3_2.fastq.gz| +|SUBJECT_ID|XX|1|SAMPLE_ID2|1|/samples/tumor1_1.fastq.gz|/samples/tumor1_2.fastq.gz| +|SUBJECT_ID|XX|1|SAMPLE_ID2|2|/samples/tumor2_1.fastq.gz|/samples/tumor2_2.fastq.gz| ```bash ---input +--input example_pair_fastq.tsv +``` + +#### --input <uBAM> --step mapping + +The `TSV` file for starting the mapping from `unmapped BAM` files should contain the columns: + +- `subject sex status sample lane bam` + +In this example (`example_ubam.tsv`), there are 3 read groups. + +| | | | | | | +|-|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID|1|/samples/normal_1.bam| +|SUBJECT_ID|XX|0|SAMPLE_ID|2|/samples/normal_2.bam| +|SUBJECT_ID|XX|0|SAMPLE_ID|3|/samples/normal_3.bam| + +```bash +--input example_ubam.tsv +``` + +Or, for a normal/tumor pair: + +In this example (`example_pair_ubam.tsv`), there are 3 read groups for the normal sample and 2 for the tumor sample. + +| | | | | | | +|-|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID1|1|/samples/normal_1.bam| +|SUBJECT_ID|XX|0|SAMPLE_ID1|2|/samples/normal_2.bam| +|SUBJECT_ID|XX|0|SAMPLE_ID1|3|/samples/normal_3.bam| +|SUBJECT_ID|XX|1|SAMPLE_ID2|1|/samples/tumor_1.bam| +|SUBJECT_ID|XX|1|SAMPLE_ID2|2|/samples/tumor_2.bam| + +```bash +--input example_pair_ubam.tsv ``` -Use this to specify the location of your VCF input file on `annotate` step. +#### --input <sample/> --step mapping + +Use this to specify the location to a directory with `FASTQ` files for the `mapping` step of a single germline sample only. For example: ```bash ---input +--input +``` + +> **NB** All of the found `FASTQ` files are considered to belong to the same sample. + +The input folder, containing the `FASTQ` files for one subject (ID) should be organized into one sub-folder for every sample. +The given directory is searched recursively for `FASTQ` files that are named `*_R1_*.fastq.gz`, and a matching pair with the same name except `_R2_` instead of `_R1_` is expected to exist alongside. +All `FASTQ` files for that sample should be collected here. + +```text +ID ++--sample1 ++------sample1___lane1_R1_1000.fastq.gz ++------sample1___lane1_R2_1000.fastq.gz ++------sample1___lane2_R1_1000.fastq.gz ++------sample1___lane2_R2_1000.fastq.gz ++--sample2 ++------sample2___lane1_R1_1000.fastq.gz ++------sample2___lane1_R2_1000.fastq.gz ++--sample3 ++------sample3___lane1_R1_1000.fastq.gz ++------sample3___lane1_R2_1000.fastq.gz ++------sample3___lane2_R1_1000.fastq.gz ++------sample3___lane2_R2_1000.fastq.gz ``` -Multiple VCF files can be specified, using a [glob path](https://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob), if enclosed in quotes. +`FASTQ` filename structure: -### --step +- `____R1_.fastq.gz` and +- `____R2_.fastq.gz` -Use this to specify the starting step: -Default `mapping` -Available: `mapping`, `prepare_recalibration`, `recalibrate`, `variant_calling`, `annotate`, `Control-FREEC` +Where: + +- `sample` = sample id +- `lib` = identifier of library preparation +- `flowcell-index` = identifier of flow cell for the sequencing run +- `lane` = identifier of the lane of the sequencing run + +Read group information will be parsed from `FASTQ` file names according to this: + +- `RGID` = "sample_lib_flowcell_index_lane" +- `RGPL` = "Illumina" +- `PU` = sample +- `RGLB` = lib + +Each `FASTQ` file pair gets its own read group (`@RG`) in the resulting `BAM` file in the following way. + +- The sample name (`SM`) is derived from the the last component of the path given to `--input`. +That is, you should make sure that that directory has a meaningful name! For example, with `--input=/my/fastqs/sample123`, the sample name will be `sample123`. +- The read group id is set to *flowcell.samplename.lane*. +The flowcell id and lane number are auto-detected from the name of the first read in the `FASTQ` file. + +#### --input <TSV> --step prepare_recalibration + +To start from the preparation of the recalibration step (`--step prepare_recalibration`), a `TSV` file needs to be given as input containing the paths to the `non-recalibrated BAM` files. +The `Sarek`-generated `TSV` file is stored under `results/Preprocessing/TSV/duplicates_marked_no_table.tsv` and will automatically be used as an input when specifying the parameter `--step prepare_recalibration`. + +The `TSV` contains the following columns: + +- `subject sex status sample bam bai` + +| | | | | | | +|-|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID|/samples/normal.md.bam|/samples/normal.md.bai| + +Or, for a normal/tumor pair: + +| | | | | | | +|-|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID1|/samples/normal.md.bam|/samples/normal.md.bai| +|SUBJECT_ID|XX|1|SAMPLE_ID2|/samples/tumor.md.bam|/samples/tumor.md.bai| + +#### --input <TSV> --step prepare_recalibration --skip_markduplicates + +The `Sarek`-generated `TSV` file is stored under `results/Preprocessing/TSV/mapped.tsv` and will automatically be used as an input when specifying the parameter `--step prepare_recalibration --skip_markduplicates`. +The `TSV` file contains the same columns, but the content is slightly different: + +| | | | | | | +|-|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID|/samples/normal.bam|/samples/normal.bai| + +Or, for a normal/tumor pair: + +| | | | | | | +|-|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID1|/samples/normal.bam|/samples/normal.bai| +|SUBJECT_ID|XX|1|SAMPLE_ID2|/samples/tumor.bam|/samples/tumor.bai| + +#### --input <TSV> --step recalibrate + +To start from the recalibrate step (`--step recalibrate`), a `TSV` file needs to be given as input containing the paths to the `non-recalibrated BAM` file and the associated recalibration table. +The `Sarek`-generated `TSV` file is stored under `results/Preprocessing/TSV/duplicates_marked.tsv` and will automatically be used as an input when specifying the parameter `--step recalibrate`. + +The `TSV` contains the following columns: + +- `subject sex status sample bam bai recaltable` + +| | | | | | | | +|-|-|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID|/samples/normal.md.bam|/samples/normal.md.bai|/samples/normal.recal.table| + +Or, for a normal/tumor pair: + +| | | | | | | +|-|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID1|/samples/normal.md.bam|/samples/normal.md.bai|/samples/normal.recal.table| +|SUBJECT_ID|XX|1|SAMPLE_ID2|/samples/tumor.md.bam|/samples/tumor.md.bai|/samples/tumor.recal.table| + +#### --input <TSV> --step recalibrate --skip_markduplicates + +The `Sarek`-generated `TSV` file is stored under `results/Preprocessing/TSV/mapped_no_duplicates_marked.tsv` and will automatically be used as an input when specifying the parameter `--step recalibrate --skip_markduplicates`. +The `TSV` file contains the same columns, but the content is slightly different: + +| | | | | | | | +|-|-|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID|/samples/normal.bam|/samples/normal.bai|/samples/normal.recal.table| + +Or, for a normal/tumor pair: + +| | | | | | | | +|-|-|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID1|/samples/normal.bam|/samples/normal.bai|/samples/normal.recal.table| +|SUBJECT_ID|XX|1|SAMPLE_ID2|/samples/tumor.bam|/samples/tumor.bai|/samples/tumor.recal.table| + +#### --input <TSV> --step variant_calling + +To start from the variant calling step (`--step variant_calling`), a `TSV` file needs to be given as input containing the paths to the `recalibrated BAM` file and the associated index. +The `Sarek`-generated `TSV` file is stored under `results/Preprocessing/TSV/recalibrated.tsv` and will automatically be used as an input when specifying the parameter `--step variant_calling`. + +The `TSV` file should contain the columns: + +- `subject sex status sample bam bai` + +Here is an example for two samples from the same subject: + +| | | | | | | +|-|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID|/samples/normal.recal.bam|/samples/normal.recal.bai| + +Or, for a normal/tumor pair: + +| | | | | | | +|-|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID1|/samples/normal.recal.bam|/samples/normal.recal.bai| +|SUBJECT_ID|XX|1|SAMPLE_ID2|/samples/tumor.recal.bam|/samples/tumor.recal.bai| + +#### --input <TSV> --step Control-FREEC + +To start from the Control-FREEC step (`--step Control-FREEC`), a `TSV` file needs to be given as input containing the paths to the mpileup files. +The `Sarek`-generated `TSV` file is stored under `results/VariantCalling/TSV/control-freec_mpileup.tsv` and will automatically be used as an input when specifying the parameter `--step Control-FREEC`. + +The `TSV` file should contain the columns: + +- `subject sex status sample mpileup` + +Here is an example for one normal/tumor pair from one subjects: + +| | | | | | +|-|-|-|-|-| +|SUBJECT_ID|XX|0|SAMPLE_ID1|/samples/normal.pileup| +|SUBJECT_ID|XX|1|SAMPLE_ID2|/samples/tumor.pileup| + +#### --input <VCF> --step annotate + +Input files for Sarek can be specified using the path to a `VCF` file given to the `--input` command only with the annotation step (`--step annotate`). +As `Sarek` will use `bgzip` and `tabix` to compress and index `VCF` files annotated, it expects `VCF` files to be sorted. +Multiple `VCF` files can be specified, using a [glob path](https://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob), if enclosed in quotes. +For example: + +```bash +--step annotate --input "results/VariantCalling/*/{HaplotypeCaller,Manta,Mutect2,Strelka,TIDDIT}/*.vcf.gz" +``` ### --help -Will display the help message +Will display the help message. ### --no_intervals -Disable usage of intervals file, and disable automatic generation of intervals file when none are provided. +Disable usage of [`intervals`](#--intervals) file. ### --nucleotides_per_second -Use this to estimate of how many seconds it will take to call variants on any interval, the default value is `1000` is it's not specified in the [`intervals`](#--intervals) file. +Use this to estimate of how many seconds it will take to call variants on any interval, the default value is `1000` when not specified in the [`intervals`](#--intervals) file. ### --sentieon -If [Sentieon](https://www.sentieon.com/) is available, use this to enable it for preprocessing, and variant calling. -Adds the following tools for the [`--tools`](#--tools) options: `DNAseq`, `DNAscope` and `TNscope`. +[Sentieon](https://www.sentieon.com/) is a commercial solution to process genomics data with high computing efficiency, fast turnaround time, exceptional accuracy, and 100% consistency. + +If [Sentieon](https://www.sentieon.com/) is available, use this `--sentieon` params to enable with `Sarek` to use some `Sentieon Analysis Pipelines & Tools`. +Adds the following tools for the [`--tools`](#--tools---sentieon) options: `DNAseq`, `DNAscope` and `TNscope`. + +Please refer to the [nf-core/configs](https://github.com/nf-core/configs#adding-a-new-pipeline-specific-config) repository on how to make a pipeline-specific configuration file based on the [munin-sarek specific configuration file](https://github.com/nf-core/configs/blob/master/conf/pipeline/sarek/munin.config). + +Or ask us on the [nf-core Slack](http://nf-co.re/join/slack) on the following channels: [#sarek](https://nfcore.slack.com/channels/sarek) or [#configs](https://nfcore.slack.com/channels/configs). + +The following `Sentieon Analysis Pipelines & Tools` are available within `Sarek`: + +#### Alignment + +> Sentieon BWA matches BWA-MEM with > 2X speedup. + +This tool is enabled by default within `Sarek` if `--sentieon` is specified and if the pipeline is started with the `mapping` [step](usage.md#--step). + +#### Germline SNV/INDEL Variant Calling - DNAseq + +> Precision FDA award-winning software. +> Matches GATK 3.3-4.1, and without down-sampling. +> Results up to 10x faster and 100% consistent every time. + +This tool is enabled within `Sarek` if `--sentieon` is specified and if `--tools DNAseq` is specified cf [--tools --sentieon](#--tools--sentieon). + +#### Germline SNV/INDEL Variant Calling - DNAscope -More information in the [sentieon](sentieon.md) documentation. +> Improved accuracy and genome characterization. +> Machine learning enhanced filtering producing top variant calling accuracy. + +This tool is enabled within `Sarek` if `--sentieon` is specified and if `--tools DNAscope` is specified cf [--tools --sentieon](#--tools--sentieon). + +#### Somatic SNV/INDEL Variant Calling - TNscope + +> Winner of ICGC-TCGA DREAM challenge. +> Improved accuracy, machine learning enhanced filtering. +> Supports molecular barcodes and unique molecular identifiers. + +This tool is enabled within `Sarek` if `--sentieon` is specified and if `--tools TNscope` is specified cf [--tools --sentieon](#--tools--sentieon). + +#### Structural Variant Calling + +> Germline and somatic SV calling, including translocations, inversions, duplications and large INDELs + +This tool is enabled within `Sarek` if `--sentieon` is specified and if `--tools DNAscope` is specified cf [--tools --sentieon](#--tools--sentieon). ### --skip_qc Use this to disable specific QC and Reporting tools. Multiple tools can be specified, separated by commas. -Available: `all`, `bamQC`, `BaseRecalibrator`, `BCFtools`, `Documentation`, `FastQC`, `MultiQC`, `samtools`, `vcftools`, `versions` + +Available: `all`, `bamQC`, `BaseRecalibrator`, `BCFtools`, `Documentation`, `FastQC`, `MarkDuplicates`, `MultiQC`, `samtools`, `vcftools`, `versions` + Default: `None` +> **NB** `--skip_qc MarkDuplicates` does not skip `MarkDuplicates` but prevent the collection of duplicate metrics that slows down performance + ### --target_bed -Use this to specify the target BED file for targeted or whole exome sequencing. +Use this to specify the target `BED` file for targeted or whole exome sequencing. + +The `--target_bed` parameter does _not_ imply that the workflow is running alignment or variant calling only for the supplied targets. +Instead, we are aligning for the whole genome, and selecting variants only at the very end by intersecting with the provided target file. +Adding every exon as an interval in case of `WES` can generate >200K processes or jobs, much more forks, and similar number of directories in the Nextflow work directory. +Furthermore, primers and/or baits are not 100% specific, (certainly not for MHC and KIR, etc.), quite likely there going to be reads mapping to multiple locations. +If you are certain that the target is unique for your genome (all the reads will certainly map to only one location), and aligning to the whole genome is an overkill, better to change the reference itself. -### --tools +The recommended flow for targeted sequencing data is to use the workflow as it is, but also provide a `BED` file containing targets for all steps using the `--target_bed` option. +The workflow will pick up these intervals, and activate any `--exome` flag in any tools that allow it to process deeper coverage. +It is advised to pad the variant calling regions (exons or target) to some extent before submitting to the workflow. +To add the target `BED` file configure the command line like: + +```bash +--target_bed +``` + +### --tools for Variant Calling Use this parameter to specify the variant calling and annotation tools to be used. Multiple tools can be specified, separated by commas. For example: ```bash ---tools 'Strelka,mutect2,SnpEff' +--tools Strelka,mutect2,SnpEff ``` -Available variant callers: `ASCAT`, `ControlFREEC`, `FreeBayes`, `HaplotypeCaller`, `Manta`, `mpileup`, `MSIsensor`, `Mutect2`, `Strelka`, `TIDDIT`. +Available variant callers: `ASCAT`, `Control-FREEC`, `FreeBayes`, `HaplotypeCaller`, `Manta`, `mpileup`, `MSIsensor`, `Mutect2`, `Strelka`, `TIDDIT`. + +> **NB** Tools can be specified with no concern for case + +For more information on the individual variant callers, and where to find the variant calling results, check the [output](output.md) documentation. > **WARNING** Not all variant callers are available for both germline and somatic variant calling. -For more details please check the [variant calling](variant_calling.md) extra documentation. -Available annotation tools: `VEP`, `SnpEff`, `merge`. For more details, please check the [annotation](annotation.md) extra documentation. +For more information, please read the following documentation on [Germline variant calling](#germline-variant-calling), [Somatic variant calling with tumor - normal pairs](#somatic-variant-calling-with-tumor---normal-pairs) and [Somatic variant calling with tumor only samples](#somatic-variant-calling-with-tumor-only-samples) + +#### Germline variant calling + +Using `Sarek`, germline variant calling will always be performed if a variant calling tool with a germline mode is selected. +Germline variant calling can currently only be performed with the following variant callers: + +- *FreeBayes* +- *HaplotypeCaller* +- *Manta* +- *mpileup* +- *Sentieon* +- *Strelka* +- *TIDDIT* + +For more information on the individual variant callers, and where to find the variant calling results, check the [output](output.md) documentation. + +#### Somatic variant calling with tumor - normal pairs + +Using `Sarek`, somatic variant calling will be performed, if your input tsv file contains tumor / normal pairs (see [input](#--input) documentation for more information). +Different samples belonging to the same patient, where at least one is marked as normal (`0` in the `Status` column) and at least one is marked as tumor (`1` in the `Status` column) are treated as tumor / normal pairs. + +If tumor-normal pairs are provided, both germline variant calling and somatic variant calling will be performed, provided that the selected variant caller allows for it. +If the selected variant caller allows only for somatic variant calling, then only somatic variant calling results will be generated. + +Here is a list of the variant calling tools that support somatic variant calling: + +- *ASCAT* +- *Control-FREEC* +- *FreeBayes* +- *Manta* +- *MSIsensor* +- *Mutect2* +- *Sentieon* +- *Strelka* + +For more information on the individual variant callers, and where to find the variant calling results, check the [output](output.md) documentation. + +#### Somatic variant calling with tumor only samples + +Somatic variant calling with only tumor samples (no matching normal sample), is not recommended by the `GATK best practices`. +This is just supported for a limited variant callers. + +Here is a list of the variant calling tools that support tumor-only somatic variant calling: + +- *Manta* +- *mpileup* +- *Mutect2* +- *TIDDIT* + +### --tools --sentieon + +> **WARNING** Only with `--sentieon` + +If [Sentieon](https://www.sentieon.com/) is available, use this `--sentieon` params to enable with `Sarek` to use some `Sentieon Analysis Pipelines & Tools`. +Adds the following tools for the [`--tools`](#--tools) options: `DNAseq`, `DNAscope` and `TNscope`. + +### --tools for Annotation + +Available annotation tools: `VEP`, `SnpEff`, `merge`. +For more details, please check the [annotation](#annotation-tools) documentation. + +#### Annotation tools + +With `Sarek`, annotation is done using `snpEff`, `VEP`, or even both consecutively: + +- `--tools snpEff` + - To annotate using `snpEff` +- `--tools VEP` + - To annotate using `VEP` +- `--tools snpEff,VEP` + - To annotate using `snpEff` and `VEP` +- `--tools merge` + - To annotate using `snpEff` followed by `VEP` + +`VCF` produced by `Sarek` will be annotated if `snpEff` or `VEP` are specified with the `--tools` command. +As Sarek will use `bgzip` and `tabix` to compress and index VCF files annotated, it expects VCF files to be sorted. + +In these examples, all command lines will be launched starting with `--step annotate`. +It can of course be started directly from any other step instead. + +#### Using genome specific containers + +`Sarek` has already designed containers with `snpEff` and `VEP` files for Human (`GRCh37`, `GRCh38`), Mouse (`GRCm38`), Dog (`CanFam3.1`) and Roundworm (`WBcel235`). + +Default settings will run using these containers. +The main `Sarek` container has also `snpEff` and `VEP` installed, but without the cache files that can be downloaded separately. +See [containers documentation](#containers) for more information. + +#### Download cache + +A `Nextflow` helper script has been designed to help downloading `snpEff` and `VEP` caches. +Such files are meant to be shared between multiple users, so this script is mainly meant for people administrating servers, clusters and advanced users. + +```bash +nextflow run download_cache.nf --snpeff_cache --snpeff_db --genome +nextflow run download_cache.nf --vep_cache --species --vep_cache_version --genome +``` + +#### Using downloaded cache + +Both `snpEff` and `VEP` enable usage of cache. +If cache is available on the machine where `Sarek` is run, it is possible to run annotation using cache. +You need to specify the cache directory using `--snpeff_cache` and `--vep_cache` in the command lines or within configuration files. +The cache will only be used when `--annotation_cache` and cache directories are specified (either in command lines or in a configuration file). + +Example: + +```bash +nextflow run nf-core/sarek --tools snpEff --step annotate --sample --snpeff_cache --annotation_cache +nextflow run nf-core/sarek --tools VEP --step annotate --sample --vep_cache --annotation_cache +``` + +#### Using VEP CADD plugin + +To enable the use of the `VEP` `CADD` plugin: + +- Download the `CADD` files +- Specify them (either on the command line, like in the example or in a configuration file) +- use the `--cadd_cache` flag + +Example: + +```bash +nextflow run nf-core/sarek --step annotate --tools VEP --sample --cadd_cache \ + --cadd_indels \ + --cadd_indels_tbi \ + --cadd_wg_snvs \ + --cadd_wg_snvs_tbi +``` + +#### Downloading CADD files + +An helper script has been designed to help downloading `CADD` files. +Such files are meant to be share between multiple users, so this script is mainly meant for people administrating servers, clusters and advanced users. + +```bash +nextflow run download_cache.nf --cadd_cache --cadd_version --genome +``` + +#### Using VEP GeneSplicer plugin + +To enable the use of the `VEP` `GeneSplicer` plugin: + +- use the `--genesplicer` flag + +Example: + +```bash +nextflow run nf-core/sarek --step annotate --tools VEP --sample --genesplicer +``` ## Modify fastqs (trim/split) @@ -311,36 +871,36 @@ Use this to perform adapter trimming with [Trim Galore](https://github.com/Felix ### --clip_r1 -Instructs Trim Galore to remove a number of bp from the 5' end of read 1 (or single-end reads). +Instructs `Trim Galore` to remove a number of bp from the 5' end of read 1 (or single-end reads). This may be useful if the qualities were very poor, or if there is some sort of unwanted bias at the 5' end. ### --clip_r2 -Instructs Trim Galore to remove a number of bp from the 5' end of read 2 (paired-end reads only). +Instructs `Trim Galore` to remove a number of bp from the 5' end of read 2 (paired-end reads only). This may be useful if the qualities were very poor, or if there is some sort of unwanted bias at the 5' end. ### --three_prime_clip_r1 -Instructs Trim Galore to remove a number of bp from the 3' end of read 1 (or single-end reads) AFTER adapter/quality trimming has been performed. +Instructs `Trim Galore` to remove a number of bp from the 3' end of read 1 (or single-end reads) AFTER adapter/quality trimming has been performed. This may remove some unwanted bias from the 3' end that is not directly related to adapter sequence or basecall quality. ### --three_prime_clip_r2 -Instructs Trim Galore to remove a number of bp from the 3' end of read 2 AFTER adapter/quality trimming has been performed. +Instructs `Trim Galore` to remove a number of bp from the 3' end of read 2 AFTER adapter/quality trimming has been performed. This may remove some unwanted bias from the 3' end that is not directly related to adapter sequence or basecall quality. ### --trim_nextseq This enables the option `--nextseq-trim=3'CUTOFF` within `Cutadapt`, which will set a quality cutoff (that is normally given with `-q` instead), but qualities of G bases are ignored. -This trimming is in common for the NextSeq and NovaSeq-platforms, where basecalls without any signal are called as high-quality G bases. +This trimming is common for the `NextSeq` and `NovaSeq` platforms, where basecalls without any signal are called as high-quality G bases. ### --save_trimmed -Option to keep trimmed FASTQs +Option to keep trimmed `FASTQs` ### --split_fastq -Use the Nextflow [`splitFastq`](https://www.nextflow.io/docs/latest/operator.html#splitfastq) operator to specify how many reads should be contained in the split fastq file. +Use the `Nextflow` [`splitFastq`](https://www.nextflow.io/docs/latest/operator.html#splitfastq) operator to specify how many reads should be contained in the split fastq file. For example: ```bash @@ -349,9 +909,33 @@ For example: ## Preprocessing +### --aligner + +To control which aligner is used for mapping the reads. + +Available: `bwa-mem` and `bwa-mem2` + +Default: `bwa-mem` + +Example: + +```bash +--aligner "bwa-mem" +``` + +> **WARNING** Current indices for `bwa` in AWS iGenomes are not compatible with `bwa-mem2`. +> Use `--bwa=false` to have `Sarek` build them automatically. + +Example: + +```bash +--aligner "bwa-mem2" --bwa=false +``` + ### --markdup_java_options -To control the java options necessary for the GATK `MarkDuplicates` process, you can set this parameter. +To control the java options necessary for the `GATK MarkDuplicates` process, you can set this parameter. + Default: "-Xms4000m -Xmx7g" For example: @@ -361,41 +945,45 @@ For example: ### --no_gatk_spark -Use this to disable usage of GATK Spark implementation of their tools in local mode. +Use this to disable usage of `Spark` implementation of the `GATK` tools in local mode. ### --save_bam_mapped -Will save mapped BAMs. +Will save `mapped BAMs`. ### --skip_markduplicates -Will skip MarkDuplicates. This params will also save the mapped BAMS, to enable restart from step `prepare_recalibration` +Will skip `GATK MarkDuplicates`. +This params will also save the `mapped BAMS`, to enable restart from step `prepare_recalibration` ## Variant Calling ### --ascat_ploidy -Use this parameter to overwrite default behavior from ASCAT regarding ploidy. -Requires that [`--ascat_purity`](#--ascat_purity) is set +Use this parameter to overwrite default behavior from `ASCAT` regarding `ploidy`. +Requires that [`--ascat_purity`](#--ascat_purity) is set. ### --ascat_purity -Use this parameter to overwrite default behavior from ASCAT regarding purity. -Requires that [`--ascat_ploidy`](#--ascat_ploidy) is set +Use this parameter to overwrite default behavior from `ASCAT` regarding `purity`. +Requires that [`--ascat_ploidy`](#--ascat_ploidy) is set. ### --cf_coeff -Control-FREEC `coefficientOfVariation` -Default: 0.015 +Use this parameter to overwrite default behavior from `Control-FREEC` regarding `coefficientOfVariation` + +Default: `0.015` ### --cf_ploidy -Control-FREEC `ploidy` -Default: 2 +Use this parameter to overwrite default behavior from `Control-FREEC` regarding `ploidy` + +Default: `2` ### --cf_window -Control-FREEC `window size` +Use this parameter to overwrite default behavior from `Control-FREEC` regarding `window size` + Default: Disabled ### --no_gvcf @@ -408,39 +996,67 @@ Use this not to use `Manta` `candidateSmallIndels` for `Strelka` (not recommende ### --pon -When a panel of normals [PON](https://gatkforums.broadinstitute.org/gatk/discussion/24057/how-to-call-somatic-mutations-using-gatk4-mutect2#latest) is defined, it will be use to filter somatic calls. -Without PON, there will be no calls with PASS in the INFO field, only an _unfiltered_ VCF is written. -It is recommended to make your own panel-of-normals, as it depends on sequencer and library preparation. -For tests in iGenomes there is a dummy PON file in the Annotation/GermlineResource directory, but it _should not be used_ as a real panel-of-normals file. -Provide your PON by: +When a [panel of normals (PON)](https://gatk.broadinstitute.org/hc/en-us/articles/360035890631-Panel-of-Normals-PON) is defined, it will be use to filter somatic calls. +Without `PON`, there will be no calls with `PASS` in the `INFO` field, only an _unfiltered_ `VCF` is written. +It is recommended to make your own `PON`, as it depends on sequencer and library preparation. +For tests in `iGenomes` there is a dummy `PON` file in the Annotation/GermlineResource directory, but it _should not be used_ as a real panel-of-normals file. +Provide your `PON` with: ```bash --pon ``` -PON file should be bgzipped. +`PON` file should be bgzipped. ### --pon_index -Tabix index of the panel-of-normals bgzipped VCF file. -If none provided, will be generated automatically from the panel-of-normals bgzipped VCF file. +Tabix index of the `PON` bgzipped VCF file. +If none provided, will be generated automatically from the `PON` bgzipped VCF file. + +### --ignore_soft_clipped_bases + +Do not analyze soft clipped bases in the reads for `GATK Mutect2` with the `--dont-use-soft-clipped-bases` params. + +### --umi + +If provided, `Unique Molecular Identifiers (UMIs)` steps will be run to extract and annotate the reads with `UMIs` and create consensus reads. +This part of the pipeline uses [fgbio](https://github.com/fulcrumgenomics/fgbio) to convert the `FASTQ` files into a `unmapped BAM`, where reads are tagged with the `UMIs` extracted from the `FASTQ` sequences. +In order to allow the correct tagging, the `UMI` sequence must be contained in the read sequence itself, and not in the `FASTQ` filename. +Following this step, the `unmapped BAM` is aligned and reads are then grouped based on mapping position and `UMI` tag. +Finally, reads in the same groups are collapsed to create a consensus read. +To create consensus, we have chosen to use the *adjacency method* [ref](https://cgatoxford.wordpress.com/2015/08/14/unique-molecular-identifiers-the-problem-the-solution-and-the-proof/). +In order for the correct tagging to be performed, a read structure needs to be specified as indicated below. + +### --read_structure1 + +When processing `UMIs`, a read structure should always be provided for each of the `FASTQ` files, to allow the correct annotation of the `BAM` file. +If the read does not contain any `UMI`, the structure will be +T (i.e. only template of any length). +The read structure follows a format adopted by different tools, and described [here](https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures) + +### --read_structure2 + +When processing `UMIs`, a read structure should always be provided for each of the `FASTQ` files, to allow the correct annotation of the `BAM` file. +If the read does not contain any UMI, the structure will be +T (i.e. only template of any length). +The read structure follows a format adopted by different tools, and described [here](https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures) ## Annotation ### --annotate_tools -Specify from which tools Sarek should look for VCF files to annotate, only for step `Annotate`. +Specify from which tools `Sarek` should look for `VCF` files to annotate, only for step `Annotate`. + Available: `HaplotypeCaller`, `Manta`, `Mutect2`, `Strelka`, `TIDDIT` + Default: `None` ### --annotation_cache -Enable usage of annotation cache, and disable usage of already built containers within Sarek. -For more information, follow the [annotation guidelines](annotation.md#using-downloaded-cache). +Enable usage of annotation cache, and disable usage of already built containers within `Sarek`. +For more information, follow the [downloaded cache guidelines](#using-downloaded-cache). ### --snpeff_cache -To be used conjointly with [`--annotation_cache`](#--annotation_cache), specify the cache snpEff directory: +To be used conjointly with [`--annotation_cache`](#--annotation_cache), specify the cache `snpEff` directory: ```bash --snpeff_cache @@ -448,7 +1064,7 @@ To be used conjointly with [`--annotation_cache`](#--annotation_cache), specify ### --vep_cache -To be used conjointly with [`--annotation_cache`](#--annotation_cache), specify the cache VEP directory: +To be used conjointly with [`--annotation_cache`](#--annotation_cache), specify the cache `VEP` directory: ```bash --vep_cache @@ -456,39 +1072,39 @@ To be used conjointly with [`--annotation_cache`](#--annotation_cache), specify ### --cadd_cache -Enable CADD cache. +Enable `CADD` cache. ### --cadd_indels -Path to CADD InDels file. +Path to `CADD InDels` file. ### --cadd_indels_tbi -Path to CADD InDels index. +Path to `CADD InDels` index. ### --cadd_wg_snvs -Path to CADD SNVs file. +Path to `CADD SNVs` file. ### --cadd_wg_snvs_tbi -Path to CADD SNVs index. +Path to `CADD SNVs` index. ### --genesplicer -Enable genesplicer within VEP. +Enable `genesplicer` within `VEP`. ## Reference genomes -The pipeline config files come bundled with paths to the Illumina iGenomes reference index files. -If running with docker or AWS, the configuration is set up to use the [AWS-iGenomes](https://ewels.github.io/AWS-iGenomes/) resource. +The pipeline config files come bundled with paths to the `Illumina iGenomes` reference index files. +The configuration is set up to use the [AWS-iGenomes](https://ewels.github.io/AWS-iGenomes/) resource. ### --genome (using iGenomes) -There are 2 different species supported by Sarek in the iGenomes references. +Sarek is using [AWS iGenomes](https://ewels.github.io/AWS-iGenomes/), which facilitate storing and sharing references. To run the pipeline, you must specify which to use with the `--genome` flag. -You can find the keys to specify the genomes in the [iGenomes config file](../conf/igenomes.config). +You can find the keys to specify the genomes in the [iGenomes config file](../conf/igenomes.config)). Genomes that are supported are: - Homo sapiens @@ -515,70 +1131,70 @@ Limited support for: - `--genome ce10` (UCSC) - Canis familiaris - - `--genome CanFam3.1` (Ensembl) - - `--genome canFam3` (UCSC) + - `--genome CanFam3.1` (Ensembl) + - `--genome canFam3` (UCSC) - Danio rerio - - `--genome GRCz10` (Ensembl) - - `--genome danRer10` (UCSC) + - `--genome GRCz10` (Ensembl) + - `--genome danRer10` (UCSC) - Drosophila melanogaster - - `--genome BDGP6` (Ensembl) - - `--genome dm6` (UCSC) + - `--genome BDGP6` (Ensembl) + - `--genome dm6` (UCSC) - Equus caballus - - `--genome EquCab2` (Ensembl) - - `--genome equCab2` (UCSC) + - `--genome EquCab2` (Ensembl) + - `--genome equCab2` (UCSC) - Escherichia coli K 12 DH10B - - `--genome EB1` (Ensembl) + - `--genome EB1` (Ensembl) - Gallus gallus - - `--genome Galgal4` (Ensembl) - - `--genome galgal4` (UCSC) + - `--genome Galgal4` (Ensembl) + - `--genome galgal4` (UCSC) - Glycine max - - `--genome Gm01` (Ensembl) + - `--genome Gm01` (Ensembl) - Homo sapiens - - `--genome hg19` (UCSC) - - `--genome hg38` (UCSC) + - `--genome hg19` (UCSC) + - `--genome hg38` (UCSC) - Macaca mulatta - - `--genome Mmul_1` (Ensembl) + - `--genome Mmul_1` (Ensembl) - Mus musculus - - `--genome mm10` (Ensembl) + - `--genome mm10` (Ensembl) - Oryza sativa japonica - - `--genome IRGSP-1.0` (Ensembl) + - `--genome IRGSP-1.0` (Ensembl) - Pan troglodytes - - `--genome CHIMP2.1.4` (Ensembl) - - `--genome panTro4` (UCSC) + - `--genome CHIMP2.1.4` (Ensembl) + - `--genome panTro4` (UCSC) - Rattus norvegicus - - `--genome Rnor_6.0` (Ensembl) - - `--genome rn6` (UCSC) + - `--genome Rnor_6.0` (Ensembl) + - `--genome rn6` (UCSC) - Saccharomyces cerevisiae - - `--genome R64-1-1` (Ensembl) - - `--genome sacCer3` (UCSC) + - `--genome R64-1-1` (Ensembl) + - `--genome sacCer3` (UCSC) - Schizosaccharomyces pombe - - `--genome EF2` (Ensembl) + - `--genome EF2` (Ensembl) - Sorghum bicolor - - `--genome Sbi1` (Ensembl) + - `--genome Sbi1` (Ensembl) - Sus scrofa - - `--genome Sscrofa10.2` (Ensembl) - - `--genome susScr3` (UCSC) + - `--genome Sscrofa10.2` (Ensembl) + - `--genome susScr3` (UCSC) - Zea mays - - `--genome AGPv3` (Ensembl) + - `--genome AGPv3` (Ensembl) -Note that you can use the same configuration setup to save sets of reference files for your own use, even if they are not part of the iGenomes resource. +Note that you can use the same configuration setup to save sets of reference files for your own use, even if they are not part of the `AWS iGenomes` resource. See the [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for instructions on where to save such a file. The syntax for this reference configuration is as follows: @@ -615,6 +1231,7 @@ params { ### --igenomes_base Specify base path to AWS iGenomes + Default: `s3://ngi-igenomes/igenomes/` ### --igenomes_ignore @@ -624,21 +1241,13 @@ You may choose this option if you observe clashes between custom parameters and This option will load the `genomes.config` file instead. You can then specify the `--genome custom` and specify any reference file on the command line or within a config file. -```bash ---igenomes_ignore -``` - ### --genomes_base Specify base path to reference genome ### --save_reference -Enable saving reference indexes and other files built within Sarek. - -```bash ---save_reference -``` +Enable saving reference indexes and other files built within `Sarek`. ### --ac_loci @@ -694,7 +1303,7 @@ If you prefer, you can specify the full path to your reference genome when you r If you prefer, you can specify the full path to your reference genome when you run the pipeline: -> If none provided, will be generated automatically from the fasta reference. +> If none provided, will be generated automatically from dbsnp `VCF` file. ```bash --dbsnp_index @@ -731,8 +1340,8 @@ If you prefer, you can specify the full path to your reference genome when you r ### --germline_resource The [germline resource VCF file](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php#--germline-resource) (bgzipped and tabixed) needed by GATK4 Mutect2 is a collection of calls that are likely present in the sample, with allele frequencies. -The AF info field must be present. -You can find a smaller, stripped gnomAD VCF file (most of the annotation is removed and only calls signed by PASS are stored) in the iGenomes Annotation/GermlineResource folder. +The `AF` info field must be present. +You can find a smaller, stripped gnomAD `VCF` file (most of the annotation is removed and only calls signed by PASS are stored) in the `AWS iGenomes` `Annotation/GermlineResource` folder. If you prefer, you can specify the full path to your reference genome when you run the pipeline: ```bash @@ -744,7 +1353,7 @@ If you prefer, you can specify the full path to your reference genome when you r Tabix index of the germline resource specified at [`--germline_resource`](#--germline_resource). If you prefer, you can specify the full path to your reference genome when you run the pipeline: -> If none provided, will be generated automatically from the fasta reference. +> If none provided, will be generated automatically from the germline resource `VCF` file. ```bash --germline_resource_index @@ -752,8 +1361,33 @@ If you prefer, you can specify the full path to your reference genome when you r ### --intervals -Used to speed up Preprocessing and/or Variant Calling, for more information, read the [intervals section in the extra documentation on reference](reference.md#Intervals). +To speed up some preprocessing and variant calling processes, the reference is chopped into smaller pieces. +The intervals are chromosomes cut at their centromeres (so each chromosome arm processed separately) also additional unassigned contigs. +We are ignoring the `hs37d5` contig that contains concatenated decoy sequences. +Parts of preprocessing and variant calling are done by these intervals, and the different resulting files are then merged. +This can parallelize processes, and push down wall clock time significantly. + +The calling intervals can be defined using a `.list` or a `.bed` file. +A `.list` file contains one interval per line in the format `chromosome:start-end` (1-based coordinates). + +When the intervals file is in `BED` format, the file must be a tab-separated text file with one interval per line. +There must be at least three columns: chromosome, start, and end. +In `BED` format, the coordinates are 0-based, so the interval `chrom:1-10` becomes `chrom010`. + +Additionally, the `score` column of the `BED` file can be used to provide an estimate of how many seconds it will take to call variants on that interval. +The fourth column remains unused. + +|||||| +|-|-|-|-|-| +|chr1|10000|207666|NA|47.3| + +This indicates that variant calling on the interval chr1:10001-207666 takes approximately 47.3 seconds. +The runtime estimate is used in two different ways. +First, when there are multiple consecutive intervals in the file that take little time to compute, they are processed as a single job, thus reducing the number of processes that needs to be spawned. +Second, the jobs with largest processing time are started first, which reduces wall-clock time. +If no runtime is given, a time of 1000 nucleotides per second is assumed. +Actual figures vary from 2 nucleotides/second to 30000 nucleotides/second. If you prefer, you can specify the full path to your reference genome when you run the pipeline: > If none provided, will be generated automatically from the fasta reference. @@ -775,7 +1409,7 @@ If you prefer, you can specify the full path to your reference genome when you r If you prefer, you can specify the full path to your reference genome when you run the pipeline: -> If none provided, will be generated automatically from the fasta reference. +> If none provided, will be generated automatically from the known indels `VCF` file. ```bash --known_indels_index @@ -799,7 +1433,12 @@ If you prefer, you can specify the DB version when you run the pipeline: ### --species -This specifies the species used for running VEP annotation. For human data, this needs to be set to `homo_sapiens`, for mouse data `mus_musculus` as the annotation needs to know where to look for appropriate annotation references. If you use iGenomes or a local resource with `genomes.conf`, this has already been set for you appropriately. +This specifies the species used for running `VEP` annotation. +If you use iGenomes or a local resource with `genomes.conf`, this has already been set for you appropriately. + +```bash +--species +``` ### --vep_cache_version @@ -814,21 +1453,24 @@ If you prefer, you can specify the cache version when you run the pipeline: ### --outdir The output directory where the results will be saved. + Default: `results/` ### --publish_dir_mode The file publishing method. + Available: `symlink`, `rellink`, `link`, `copy`, `copyNoFollow`, `move` + Default: `copy` ### --sequencing_center -The sequencing center that will be used in the BAM CN field +The sequencing center that will be used in the `BAM` `CN` field ### --multiqc_config -Specify a path to a custom MultiQC configuration file. +Specify a path to a custom `MultiQC` configuration file. ### --monochrome_logs @@ -849,39 +1491,25 @@ Set to receive plain-text e-mails instead of HTML formatted. ### --max_multiqc_email_size -Threshold size for MultiQC report to be attached in notification email. If file generated by pipeline exceeds the threshold, it will not be attached (Default: 25MB). +Threshold size for `MultiQC` report to be attached in notification email. +If file generated by pipeline exceeds the threshold, it will not be attached. + +Default: `25MB`. ### -name Name for the pipeline run. -If not specified, Nextflow will automatically generate a random mnemonic. - -This is used in the MultiQC report (if not default) and in the summary HTML / e-mail (always). +If not specified, `Nextflow` will automatically generate a random mnemonic. -**NB:** Single hyphen (core Nextflow option) +This is used in the `MultiQC` report (if not default) and in the summary HTML / e-mail (always). -### -resume - -Specify this when restarting a pipeline. -Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. - -You can also supply a run name to resume a specific run: `-resume [run-name]`. -Use the `nextflow log` command to show previous run names. - -**NB:** Single hyphen (core Nextflow option) - -### -c - -Specify the path to a specific config file (this is a core NextFlow command). - -**NB:** Single hyphen (core Nextflow option) - -Note - you can use this to override pipeline defaults. +**NB:** Single hyphen (core `Nextflow` option) ### --custom_config_version Provide git commit id for custom Institutional configs hosted at `nf-core/configs`. This was implemented for reproducibility purposes. + Default is set to `master`. ```bash @@ -891,25 +1519,17 @@ Default is set to `master`. ### --custom_config_base -If you're running offline, nextflow will not be able to fetch the institutional config files +If you're running offline, `Nextflow` will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. -If you do need them, you should download the files from the repo and tell nextflow where to find them with the `custom_config_base` option. +If you do need them, you should download the files from the repository and tell `Nextflow` where to find them with the `custom_config_base` option. For example: ```bash -## Download and unzip the config files -cd /path/to/my/configs -wget https://github.com/nf-core/configs/archive/master.zip -unzip master.zip - -## Run the pipeline -cd /path/to/my/data -nextflow run /path/to/pipeline/ --custom_config_base /path/to/my/configs/configs-master/ +NXF_OPTS='-Xms1g -Xmx4g' ``` -> Note that the nf-core/tools helper package has a `download` command to download all required pipeline -> files + singularity containers + institutional configs in one go for you, to make this process easier. +> Note that the nf-core/tools helper package has a `download` command to download all required pipeline files + singularity containers + institutional configs in one go for you, to make this process easier. ## Job resources @@ -919,17 +1539,6 @@ Each step in the pipeline has a default set of requirements for number of CPUs, For most of the steps in the pipeline, if the job exits with an error code of `143` (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped. -### Custom resource requests - -Wherever process-specific requirements are set in the pipeline, the default value can be changed by creating a custom config file. -See the files hosted at [`nf-core/configs`](https://github.com/nf-core/configs/tree/master/conf) for examples. - -If you are likely to be running `nf-core` pipelines regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. -Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter (see definition below). -You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile. - -If you have any questions or issues please send us a message on [Slack](https://nf-co.re/join/slack). - ### --max_memory Use to set a top-limit for the default memory requirement for each process. @@ -950,67 +1559,177 @@ Should be a string in the format integer-unit eg. `--max_cpus 1` Use to set memory for a single CPU. Should be a string in the format integer-unit eg. `--single_cpu_mem '8.GB'` -## AWSBatch specific parameters +## Containers -Running the pipeline on AWSBatch requires a couple of specific parameters to be set according to your AWSBatch configuration. -Please use [`-profile awsbatch`](https://github.com/nf-core/configs/blob/master/conf/awsbatch.config) and then specify all of the following parameters. +`sarek`, our main container is designed using [Conda](https://conda.io/). -### --awsqueue +[![sarek-docker status](https://img.shields.io/docker/automated/nfcore/sarek.svg)](https://hub.docker.com/r/nfcore/sarek) -The JobQueue that you intend to use on AWSBatch. +Based on [nfcore/base:1.10.2](https://hub.docker.com/r/nfcore/base/tags), it contains: -### --awsregion +- **[ASCAT](https://github.com/Crick-CancerGenomics/ascat)** 2.5.2 +- **[AlleleCount](https://github.com/cancerit/alleleCount)** 4.0.2 +- **[BCFTools](https://github.com/samtools/bcftools)** 1.9 +- **[bwa](https://github.com/lh3/bwa)** 0.7.17 +- **[bwa-mem2](https://github.com/bwa-mem2/bwa-mem2)** 2.0 +- **[CNVkit](https://github.com/etal/cnvkit)** 0.9.6 +- **[Control-FREEC](https://github.com/BoevaLab/FREEC)** 11.5 +- **[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)** 0.11.9 +- **[fgbio](https://github.com/fulcrumgenomics/fgbio)** 1.1.0 +- **[FreeBayes](https://github.com/ekg/freebayes)** 1.3.2 +- **[GATK4-spark](https://github.com/broadinstitute/gatk)** 4.1.6.0 +- **[GeneSplicer](https://ccb.jhu.edu/software/genesplicer/)** 1.0 +- **[ggplot2](https://github.com/tidyverse/ggplot2)** 3.3.0 +- **[HTSlib](https://github.com/samtools/htslib)** 1.9 +- **[Manta](https://github.com/Illumina/manta)** 1.6.0 +- **[msisensor](https://github.com/ding-lab/msisensor)** 0.5 +- **[MultiQC](https://github.com/ewels/MultiQC/)** 1.8 +- **[Qualimap](http://qualimap.bioinfo.cipf.es)** 2.2.2d +- **[SAMBLASTER](https://github.com/GregoryFaust/samblaster)** 0.1.24 +- **[samtools](https://github.com/samtools/samtools)** 1.9 +- **[snpEff](http://snpeff.sourceforge.net/)** 4.3.1t +- **[Strelka2](https://github.com/Illumina/strelka)** 2.9.10 +- **[TIDDIT](https://github.com/SciLifeLab/TIDDIT)** 2.7.1 +- **[pigz](https://zlib.net/pigz/)** 2.3.4 +- **[Trim Galore](https://github.com/FelixKrueger/TrimGalore)** 0.6.5 +- **[VCFanno](https://github.com/brentp/vcfanno)** 0.3.2 +- **[VCFtools](https://vcftools.github.io/index.html)** 0.1.16 +- **[VEP](https://github.com/Ensembl/ensembl-vep)** 99.2 -The AWS region to run your job in. -Default is set to `eu-west-1` but can be adjusted to your needs. +For annotation, the main container can be used, but the cache has to be downloaded, or additional containers are available with cache (see [annotation documentation](#using-downloaded-cache)): -### --awscli +`sareksnpeff`, our `snpeff` container is designed using [Conda](https://conda.io/). -The [AWS CLI](https://www.nextflow.io/docs/latest/awscloud.html#aws-cli-installation) path in your custom AMI. -Default: `/home/ec2-user/miniconda/bin/aws`. +[![sareksnpeff-docker status](https://img.shields.io/docker/automated/nfcore/sareksnpeff.svg)](https://hub.docker.com/r/nfcore/sareksnpeff) -Please make sure to also set the `-w/--work-dir` and `--outdir` parameters to a S3 storage bucket of your choice - you'll get an error message notifying you if you didn't. +Based on [nfcore/base:1.10.2](https://hub.docker.com/r/nfcore/base/tags), it contains: + +- **[snpEff](http://snpeff.sourceforge.net/)** 4.3.1t +- Cache for `GRCh37`, `GRCh38`, `GRCm38`, `CanFam3.1` or `WBcel235` + +`sarekvep`, our `vep` container is designed using [Conda](https://conda.io/). + +[![sarekvep-docker status](https://img.shields.io/docker/automated/nfcore/sarekvep.svg)](https://hub.docker.com/r/nfcore/sarekvep) + +Based on [nfcore/base:1.10.2](https://hub.docker.com/r/nfcore/base/tags), it contains: + +- **[GeneSplicer](https://ccb.jhu.edu/software/genesplicer/)** 1.0 +- **[VEP](https://github.com/Ensembl/ensembl-vep)** 99.2 +- Cache for `GRCh37`, `GRCh38`, `GRCm38`, `CanFam3.1` or `WBcel235` + +### Building your owns -## Deprecated params +Our containers are designed using [Conda](https://conda.io/). +The [`environment.yml`](../environment.yml) file can be modified if particular versions of tools are more suited to your needs. -> **WARNING** These params are deprecated -- They will be removed in a future release. +The following commands can be used to build/download containers on your own system: -### --annotateVCF +- Adjust `VERSION` for sarek version (typically a release or `dev`). -> Please check: [`--input`](#--input) +#### Build with Conda -### --noGVCF +```Bash +conda env create -f environment.yml +``` + +#### Build with Docker + +- `sarek` + +```Bash +docker build -t nfcore/sarek: . +``` + +- `sareksnpeff` + +Adjust arguments for `GENOME` version and snpEff `CACHE_VERSION` + +```Bash +docker build -t nfcore/sareksnpeff:. containers/snpeff/. --build-arg GENOME= --build-arg CACHE_VERSION= +``` + +- `sarekvep` + +Adjust arguments for `GENOME` version, `SPECIES` name and VEP `VEP_VERSION` + +```Bash +docker build -t nfcore/sarekvep:. containers/vep/. --build-arg GENOME= --build-arg SPECIES= --build-arg VEP_VERSION= +``` + +#### Pull with Docker + +- `sarek` + +```Bash +docker pull nfcore/sarek: +``` + +- `sareksnpeff` + +Adjust arguments for `GENOME` version + +```Bash +docker pull nfcore/sareksnpeff:. +``` + +- `sarekvep` + +Adjust arguments for `GENOME` version -> Please check: [`--no_gvcf`](#--no_gvcf) +```Bash +docker pull nfcore/sarekvep:. +``` + +#### Pull with Singularity -### --noReports +You can directly pull singularity image, in the path used by the Nextflow ENV variable `NXF_SINGULARITY_CACHEDIR`, ie: -> Please check: [`--skipQC`](#--skipQC) +```Bash +cd $NXF_SINGULARITY_CACHEDIR +singularity build ... +``` -### --noStrelkaBP +- `sarek` -> Please check: [`--no_strelka_bp`](#--no_strelka_bp) +```Bash +singularity build nfcore-sarek-.img docker://nfcore/sarek: +``` -### --nucleotidesPerSecond +- `sareksnpeff` -> Please check: [`--nucleotides_per_second`](#--nucleotides_per_second) +Adjust arguments for `GENOME` version -### --publishDirMode +```Bash +singularity build nfcore-sareksnpeff-..img docker://nfcore/sareksnpeff:. +``` -> Please check: [`--publish_dir_mode`](#--publish_dir_mode) +- `sarekvep` -### --sample +Adjust arguments for `GENOME` version -> Please check: [`--input`](#--input) +```Bash +singularity build nfcore-sarekvep-..img docker://nfcore/sarekvep:. +``` -### --sampleDir +## AWSBatch specific parameters -> Please check: [`--input`](#--input) +Running the pipeline on AWSBatch requires a couple of specific parameters to be set according to your AWSBatch configuration. +Please use [`-profile awsbatch`](https://github.com/nf-core/configs/blob/master/conf/awsbatch.config) and then specify all of the following parameters. -### --skipQC +### --awsqueue -> Please check: [`--skip_qc`](#--skip_qc) +The JobQueue that you intend to use on AWSBatch. -### --targetBED +### --awsregion -> Please check: [`--target_bed`](#--target_bed) +The AWS region to run your job in. + +Default is set to `eu-west-1` but can be adjusted to your needs. + +### --awscli + +The [AWS CLI](https://www.nextflow.io/docs/latest/awscloud.html#aws-cli-installation) path in your custom AMI. + +Default: `/home/ec2-user/miniconda/bin/aws`. + +Please make sure to also set the `-w/--work-dir` and `--outdir` parameters to a S3 storage bucket of your choice - you'll get an error message notifying you if you didn't. diff --git a/docs/use_cases.md b/docs/use_cases.md deleted file mode 100644 index f49838c567..0000000000 --- a/docs/use_cases.md +++ /dev/null @@ -1,125 +0,0 @@ -# Use cases - -The workflow has two pre-processing options: `mapping` and `recalibrate`. -Using the `mapping` directive one will have a pair of mapped, deduplicated and recalibrated BAM files in the `Preprocessing/Recalibrated/` directory. -This is the usual option you have to give when you are starting from raw FASTQ data: - -```bash -nextflow run nf-core/sarek --input --tools -``` - -`mapping` will start by default, you do not have to give any additional parameters, only the TSV file describing the sample (see below). - -During the execution of the workflow a `execution_trace.txt`, a `execution_timeline.html` and a `execution_report.html` files are generated automatically. -These files contain statistics about resources used and processes finished. -If you start a new workflow or restart/resume a sample, the previous version will be renamed as `execution_trace.txt.1`, `execution_timeline.html.1` and `execution_report.html.1` respectively. -Also, older version are renamed with incremented numbers. - -## Starting from raw FASTQ - pair of FASTQ files - -The workflow should be started in this case with the smallest set of options as written above: - -```bash -nextflow run nf-core/sarek --input --tools -``` - -The TSV file should look like: - -| | | | | | | | -|-|-|-|-|-|-|-| -|SUBJECT_ID|XX|0|SAMPLE_ID|1|/samples/normal_1.fastq.gz|/samples/normal_2.fastq.gz| - -See the [input files documentation](input.md) for more information. - -## Starting from raw FASTQ - a directory with normal sample only - -The `--input` option can be also used to point Sarek to a directory with FASTQ files: - -```bash -nextflow run nf-core/sarek --input --tools -``` - -The given directory is searched recursively for FASTQ files that are named `*_R1_*.fastq.gz`, and a matching pair with the same name except `_R2_` instead of `_R1_` is expected to exist alongside. -All of the found FASTQ files are considered to belong to the sample. -Each FASTQ file pair gets its own read group (`@RG`) in the resulting BAM file. - -### Metadata when using `--input` with a directory - -When using `--input` with a directory, the metadata about the sample that are written to the BAM header in the `@RG` tag are determined in the following way. - -- The sample name (`SM`) is derived from the the last component of the path given to `--input`. -That is, you should make sure that that directory has a meaningful name! For example, with `--input=/my/fastqs/sample123`, the sample name will be `sample123`. -- The read group id is set to *flowcell.samplename.lane*. -The flowcell id and lane number are auto-detected from the name of the first read in the FASTQ file. - -## Starting from raw FASTQ - pair of FASTQ files for tumor/normal samples - -The workflow command line is just the same as before, but the TSV contains extra lines. -You can see the second column is used to distinguish normal and tumor samples. -You can add as many relapse samples as many you have, providing their name in the third column is different. -Each will be compared to the normal one-by-one. -Usually there are more read groups - sequencing lanes - for a single sequencing run, and in a flowcell different lanes have to be recalibrated separately. -This is captured in the TSV file only in the following manner, adding read group numbers or IDs in the fourth column. -All lanes belonging to the same Sample will be merged together after the FASTQ pairs are mapped to the reference genome. -Obviously, if you do not have relapse samples, you can leave out the two last lines. - -| | | | | | | | -|-|-|-|-|-|-|-| -|SUBJECT_ID|XX|0|SAMPLE_ID_N|1|/path/to/normal1_1.fastq.gz|/path/to/normal1_2.fastq.gz| -|SUBJECT_ID|XX|0|SAMPLE_ID_N|2|/path/to/normal2_1.fastq.gz|/path/to/normal2_2.fastq.gz| -|SUBJECT_ID|XX|1|SAMPLE_ID_T|3|/path/to/tumor3_1.fastq.gz|/path/to/tumor3_2.fastq.gz| -|SUBJECT_ID|XX|1|SAMPLE_ID_T|4|/path/to/tumor4_1.fastq.gz|/path/to/tumor4_2.fastq.gz| -|SUBJECT_ID|XX|1|SAMPLE_ID_T|5|/path/to/tumor5_1.fastq.gz|/path/to/tumor5_2.fastq.gz| -|SUBJECT_ID|XX|1|SAMPLE_ID_R|7|/path/to/relapse7_1.fastq.gz|/path/to/relapse7_2.fastq.gz| -|SUBJECT_ID|XX|1|SAMPLE_ID_R|9|/path/to/relapse9_1.fastq.gz|/path/to/relapse9_2.fastq.gz| - -See the [input files documentation](input.md) for more information. - -## Starting from recalibration - -```bash -nextflow run nf-core/sarek --input --step recalibrate --tools -``` - -And the corresponding TSV file should be like: -Obviously, if you do not have tumor or relapse samples, you can leave out the two last lines. - -| | | | | | | | -|-|-|-|-|-|-|-| -|SUBJECT_ID|XX|0|SAMPLE_ID_N|/path/to/SAMPLE_ID_N.bam|/path/to/SAMPLE_ID_N.bai|/path/to/SAMPLE_ID_N.recal.table| -|SUBJECT_ID|XX|1|SAMPLE_ID_T|/path/to/SAMPLE_ID_T.bam|/path/to/SAMPLE_ID_T.bai|/path/to/SAMPLE_ID_T.recal.table| -|SUBJECT_ID|XX|1|SAMPLE_ID_R|/path/to/SAMPLE_ID_R.bam|/path/to/SAMPLE_ID_R.bai|/path/to/SAMPLE_ID_R.recal.table| - -See the [input files documentation](input.md) for more information. - -## Starting from a recalibrated BAM file - -At this step we are assuming that all the required preprocessing is over, we only want to run variant callers or other tools using recalibrated BAM files. - -```bash -nextflow run nf-core/sarek --step variantcalling --tools -``` - -And the corresponding TSV file should be like: - -| | | | | | | -|-|-|-|-|-|-| -|SUBJECT_ID|XX|0|SAMPLE_ID_N|/path/to/SAMPLE_ID_N.bam|/path/to/SAMPLE_ID_N.bai| -|SUBJECT_ID|XX|1|SAMPLE_ID_T|/path/to/SAMPLE_ID_T.bam|/path/to/SAMPLE_ID_T.bai| -|SUBJECT_ID|XX|1|SAMPLE_ID_R|/path/to/SAMPLE_ID_R.bam|/path/to/SAMPLE_ID_R.bai| - -See the [input files documentation](input.md) for more information. - -If you want to restart a previous run of the pipeline, you may not have a recalibrated BAM file. -In this case, you need to start with `--step=recalibrate` (see previous section). - -## Using Sarek with targeted (whole exome or panel) sequencing data - -The recommended flow for targeted sequencing data is to use the workflow as it is, but also provide a BED file containing targets for all steps using the `--targetBED` option. -The workflow will pick up these intervals, and activate any `--exome` flag in any tools that allow it to process deeper coverage. -It is advised to pad the variant calling regions (exons or target) to some extent before submitting to the workflow. -To add the target BED file configure the command line like: - -```bash -nextflow run nf-core/sarek --tools haplotypecaller,strelka,mutect2 --target_bed --input -``` diff --git a/docs/variant_calling.md b/docs/variant_calling.md deleted file mode 100644 index 5b4d4a72b6..0000000000 --- a/docs/variant_calling.md +++ /dev/null @@ -1,57 +0,0 @@ -# Variant calling - -- [Germline variant calling](#germline-variant-calling) -- [Somatic variant calling with tumor - normal pairs](#somatic-variant-calling-with-tumor---normal-pairs) -- [Somatic variant calling with tumor only samples](#somatic-variant-calling-with-tumor-only-samples) - -## Germline variant calling - -Using Sarek, germline variant calling will always be performed if a variant calling tool with a germline mode is selected. -You can specify the variant caller to use with the `--tools` parameter (see [usage](./usage.md) for more information). - -Germline variant calling can currently only be performed with the following variant callers: - -- FreeBayes -- HaplotypeCaller -- Manta -- mpileup -- Sentieon (check the specific [sentieon](sentieon.md) documentation) -- Strelka -- TIDDIT - -For more information on the individual variant callers, and where to find the variant calling results, check the [output](output.md) documentation. - -## Somatic variant calling with tumor - normal pairs - -Using Sarek, somatic variant calling will be performed, if your input tsv file contains tumor / normal pairs (see [input](input.md) documentation for more information). -Different samples belonging to the same patient, where at least one is marked as normal (`0` in the `Status` column) and at least one is marked as tumor (`1` in the `Status` column) are treated as tumor / normal pairs. - -If tumor-normal pairs are provided, both germline variant calling and somatic variant calling will be performed, provided that the selected variant caller allows for it. -If the selected variant caller allows only for somatic variant calling, then only somatic variant calling results will be generated. - -Here is a list of the variant calling tools that support somatic variant calling: - -- ASCAT (check the specific [ASCAT](ascat.md) documentation) -- ControlFREEC -- FreeBayes -- Manta -- MSIsensor -- Mutect2 -- Sentieon (check the specific [sentieon](sentieon.md) documentation) -- Strelka - -For more information on the individual variant callers, and where to find the variant calling results, check the [output](output.md) documentation. - -## Somatic variant calling with tumor only samples - -Somatic variant calling with only tumor samples (no matching normal sample), is not recommended by the GATK best practices. -This is just supported for a limited variant callers. - -Here is a list of the variant calling tools that support tumor-only somatic variant calling: - -- Manta -- mpileup -- Mutect2 -- TIDDIT - -For more information on the individual variant callers, and where to find the variant calling results, check the [output](output.md) documentation. From 3fca619b3f915401d23700abe52170a48b085d9d Mon Sep 17 00:00:00 2001 From: MaxUlysse Date: Mon, 21 Sep 2020 16:04:42 +0200 Subject: [PATCH 02/11] update Nextflow required version --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 9d486710a5..3646930512 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ > **An open-source analysis pipeline to detect germline or somatic variants from whole genome or targeted sequencing** -[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A519.10.0-brightgreen.svg)](https://www.nextflow.io/) +[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A520.07.1-brightgreen.svg)](https://www.nextflow.io/) [![nf-core](https://img.shields.io/badge/nf--core-pipeline-brightgreen.svg)](https://nf-co.re/) [![DOI](https://zenodo.org/badge/184289291.svg)](https://zenodo.org/badge/latestdoi/184289291) From 3b1d3a81105e6c4914c959ea5e55a6adec2cb106 Mon Sep 17 00:00:00 2001 From: MaxUlysse Date: Mon, 21 Sep 2020 16:07:16 +0200 Subject: [PATCH 03/11] add nextflow_schema.json --- nextflow.config | 35 +- nextflow_schema.json | 736 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 753 insertions(+), 18 deletions(-) create mode 100644 nextflow_schema.json diff --git a/nextflow.config b/nextflow.config index ce60594cf7..4d96e55f67 100644 --- a/nextflow.config +++ b/nextflow.config @@ -5,16 +5,6 @@ * Default config options for all environments. */ -manifest { - name = 'nf-core/sarek' - author = 'Maxime Garcia, Szilveszter Juhos' - homePage = 'https://github.com/nf-core/sarek' - description = 'An open-source analysis pipeline to detect germline or somatic variants from whole genome or targeted sequencing' - mainScript = 'main.nf' - nextflowVersion = '>=20.07.0' - version = '3.0dev' -} - // Global default params, used in configs params { // Workflow flags: @@ -47,10 +37,10 @@ params { three_prime_clip_r2 = 0 trim_nextseq = 0 save_trimmed = false - single_end = false // No single end split_fastq = null // Fastq files will not be split by default // Preprocessing + aligner = 'bwa-mem' markdup_java_options = '"-Xms4000m -Xmx7g"' // Established values for markDuplicates memory consumption, see https://github.com/SciLifeLab/Sarek/pull/689 for details no_gatk_spark = null // GATK Spark implementation of their tools in local mode used by default save_bam_mapped = null // Mapped BAMs not saved @@ -107,7 +97,7 @@ params { // Base specifications // Defaults only, expecting to be overwritten - cpus = 8 + cpus = 8 max_cpus = 16 max_memory = 128.GB max_time = 240.h @@ -122,9 +112,6 @@ process.container = 'nfcore/sarek:dev' // Load base.config by default for all pipelines includeConfig 'conf/base.config' -// Load modules.config by default for all pipelines -includeConfig 'conf/modules.config' - // Load nf-core custom profiles from different Institutions try { includeConfig "${params.custom_config_base}/nfcore_custom.config" @@ -165,8 +152,8 @@ profiles { test_targeted { includeConfig 'conf/test_targeted.config' } test_tool { includeConfig 'conf/test_tool.config' } test_trimming { includeConfig 'conf/test_trimming.config' } - test_haplotypecaller { includeConfig 'conf/test_germline_variantcalling.config' } - + test_umi_tso { includeConfig 'conf/test_umi_tso.config' } + test_umi_qiaseq { includeConfig 'conf/test_umi_qiaseq.config' } } // Load genomes.config or igenomes.config @@ -176,9 +163,11 @@ if (!params.igenomes_ignore) { includeConfig 'conf/genomes.config' } -// Export this variable to prevent local Python libraries from conflicting with those in the container +// Export these variables to prevent local Python/R libraries from conflicting with those in the container env { PYTHONNOUSERSITE = 1 + R_PROFILE_USER = "/.Rprofile" + R_ENVIRON_USER = "/.Renviron" } // Capture exit codes from upstream processes when piping @@ -201,6 +190,16 @@ trace { file = "${params.tracedir}/execution_trace.txt" } +manifest { + name = 'nf-core/sarek' + author = 'Maxime Garcia, Szilveszter Juhos' + homePage = 'https://github.com/nf-core/sarek' + description = 'An open-source analysis pipeline to detect germline or somatic variants from whole genome or targeted sequencing' + mainScript = 'main.nf' + nextflowVersion = '>=20.07.1' + version = '3.0dev' +} + // Return the minimum between requirements and a maximum limit to ensure that resource requirements don't go over def check_resource(obj) { try { diff --git a/nextflow_schema.json b/nextflow_schema.json new file mode 100644 index 0000000000..577e125a78 --- /dev/null +++ b/nextflow_schema.json @@ -0,0 +1,736 @@ +{ + "$schema": "https://json-schema.org/draft-07/schema", + "$id": "https://raw.githubusercontent.com/nf-core/sarek/master/nextflow_schema.json", + "title": "nf-core/sarek pipeline parameters", + "description": "An open-source analysis pipeline to detect germline or somatic variants from whole genome or targeted sequencing", + "type": "object", + "definitions": { + "input_output_options": { + "title": "Input/output options", + "type": "object", + "fa_icon": "fas fa-terminal", + "description": "Define where the pipeline should find input data and save output data.", + "required": [ + "input", + "step" + ], + "properties": { + "input": { + "type": "string", + "fa_icon": "fas fa-dna", + "description": "Path to input file.", + "help_text": "Use this to specify the location of your input TSV file. Input TSV file on `mapping`, `prepare_recalibration`, `recalibrate`, `variant_calling` and `Control-FREEC` steps\nMultiple TSV files can be specified with quotes\nWorks also with the path to a directory on `mapping` step with a single germline sample only\nAlternatively, path to VCF input file on `annotate` step\nMultiple VCF files can be specified with quotes." + }, + "step": { + "type": "string", + "default": "mapping", + "fa_icon": "fas fa-play", + "description": "Starting step", + "help_text": "(only one)", + "enum": [ + "mapping", + "prepare_recalibration", + "recalibrate", + "variant_calling", + "annotate", + "Control-FREEC" + ] + }, + "outdir": { + "type": "string", + "description": "The output directory where the results will be saved.", + "default": "./results", + "fa_icon": "fas fa-folder-open" + }, + "email": { + "type": "string", + "description": "Email address for completion summary.", + "fa_icon": "fas fa-envelope", + "help_text": "Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (`~/.nextflow/config`) then you don't need to specify this on the command line for every run.", + "pattern": "^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$" + } + } + }, + "main_options": { + "title": "Main options", + "type": "object", + "description": "Option used for most of the pipeline", + "default": "", + "properties": { + "tools": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-toolbox", + "description": "Specify tools to use for variant calling and/or for annotation", + "help_text": "Multiple separated with commas\n\n`DNAseq`, `DNAscope` and `TNscope` are only available with `--sentieon`", + "enum": [ + "null", + "ASCAT", + "CNVkit", + "ControlFREEC", + "FreeBayes", + "HaplotypeCaller", + "Manta", + "mpileup", + "MSIsensor", + "Mutect2", + "Strelka", + "TIDDIT", + "snpEff", + "VEP", + "merge", + "DNAseq", + "DNAscope", + "TNscope" + ] + }, + "no_intervals": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-ban", + "description": "Disable usage of intervals", + "help_text": "Intervals are part of the genome chopped up, used to speed up preprocessing and variant calling" + }, + "nucleotides_per_second": { + "type": "string", + "fa_icon": "fas fa-clock", + "description": "Estimate interval size", + "help_text": "Intervals are part of the genome chopped up, used to speed up preprocessing and variant calling" + }, + "sentieon": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-tools", + "description": "Enable Sentieon if available", + "help_text": "Adds the following options for --tools: DNAseq, DNAscope and TNscope" + }, + "skip_qc": { + "type": "string", + "fa_icon": "fas fa-forward", + "description": "Specify which QC tools to skip", + "help_text": "Multiple separated with commas\n\n`--skip_qc BaseRecalibrator` does not skip the process, but is actually just not saving the reports", + "enum": [ + "null", + "all", + "bamQC", + "BaseRecalibrator", + "BCFtools", + "Documentation", + "FastQC", + "MultiQC", + "samtools", + "vcftools", + "versions" + ] + }, + "target_bed": { + "type": "string", + "fa_icon": "fas fa-crosshairs", + "description": "Target BED file for whole exome or targeted sequencing" + } + }, + "fa_icon": "fas fa-user-cog" + }, + "trim_split_fastq": { + "title": "Trim/split FASTQ", + "type": "object", + "description": "", + "default": "", + "fa_icon": "fas fa-cut", + "properties": { + "trim_fastq": { + "type": "string", + "fa_icon": "fas fa-cut", + "description": "Run Trim Galore" + }, + "clip_r1": { + "type": "integer", + "fa_icon": "fas fa-cut", + "description": "Remove bp from the 5' end of read 1", + "help_text": "With Trim Galore" + }, + "clip_r2": { + "type": "integer", + "description": "Remove bp from the 5' end of read 5", + "help_text": "With Trim Galore" + }, + "three_prime_clip_r1": { + "type": "integer", + "fa_icon": "fas fa-cut", + "description": "Remove bp from the 3' end of read 1 AFTER adapter/quality trimming has been performed", + "help_text": "With Trim Galore" + }, + "three_prime_clip_r2": { + "type": "integer", + "fa_icon": "fas fa-cut", + "description": "Remove bp from the 3' end of read 2 AFTER adapter/quality trimming has been performed", + "help_text": "With Trim Galore" + }, + "trim_nextseq": { + "type": "integer", + "fa_icon": "fas fa-cut", + "description": "Apply the --nextseq=X option, to trim based on quality after removing poly-G tails", + "help_text": "With Trim Galore" + }, + "save_trimmed": { + "type": "string", + "fa_icon": "fas fa-save", + "description": "Specify how many reads should be contained in the split FastQ file", + "help_text": "If none specified, FastQs won't be split" + }, + "split_fastq": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-cut", + "description": "Save trimmed FastQ file intermediates", + "help_text": "If not used, FastQs won't be split" + } + } + }, + "preprocessing": { + "title": "Preprocessing", + "type": "object", + "description": "", + "default": "", + "fa_icon": "fas fa-toolbox", + "properties": { + "aligner": { + "type": "string", + "default": "bwa-mem", + "fa_icon": "fas fa-puzzle-piece" + }, + "markdup_java_options": { + "type": "string", + "default": "-Xms4000m -Xmx7g", + "fa_icon": "fas fa-memory", + "description": "Establish values for GATK MarkDuplicates memory consumption", + "help_text": "See https://github.com/SciLifeLab/Sarek/pull/689" + }, + "no_gatk_spark": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-ban", + "description": "Disable usage of GATK Spark implementation" + }, + "save_bam_mapped": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-download", + "description": "Save Mapped BAMs" + }, + "skip_markduplicates": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-forward", + "description": "Skip GATK MarkDuplicates" + } + }, + "required": [ + "aligner" + ] + }, + "variant_calling": { + "title": "Variant Calling", + "type": "object", + "description": "", + "default": "", + "fa_icon": "fas fa-toolbox", + "properties": { + "ascat_ploidy": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-wrench", + "description": "Overwrite ASCAT ploidy", + "help_text": "Requires that --ascat_purity is set" + }, + "ascat_purity": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-wrench", + "description": "Overwrite ASCAT purity", + "help_text": "Requires that --ascat_ploidy is set" + }, + "cf_coeff": { + "type": "number", + "default": 0.015, + "fa_icon": "fas fa-wrench", + "description": "Overwrite Control-FREEC coefficientOfVariation" + }, + "cf_ploidy": { + "type": "integer", + "default": 2, + "fa_icon": "fas fa-wrench", + "description": "Overwrite Control-FREEC ploidy" + }, + "cf_window": { + "type": "string", + "fa_icon": "fas fa-wrench", + "description": "Overwrite Control-FREEC window size" + }, + "no_gvcf": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-ban", + "description": "No g.vcf output from GATK HaplotypeCaller" + }, + "no_strelka_bp": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-ban", + "description": "Will not use Manta candidateSmallIndels for Strelka", + "help_text": "Not recommended by Best Practices" + }, + "pon": { + "type": "string", + "fa_icon": "fas fa-file", + "description": "Panel-of-normals VCF (bgzipped) for GATK Mutect2 / Sentieon TNscope", + "help_text": "See https://gatk.broadinstitute.org/hc/en-us/articles/360042479112-CreateSomaticPanelOfNormals-BETA" + }, + "pon_index": { + "type": "string", + "fa_icon": "fas fa-file", + "description": "Index of PON panel-of-normals VCF", + "help_text": "If none provided, will be generated automatically from the PON" + }, + "ignore_soft_clipped_bases": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-ban", + "description": "Do not analyze soft clipped bases in the reads for GATK Mutect2" + }, + "umi": { + "type": "boolean", + "default": "false", + "fa_icon": "fas fa-tape", + "description": "If provided, UMIs steps will be run to extract and annotate the reads with UMI and create consensus reads" + }, + "read_structure1": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-clipboard-list", + "description": "When processing UMIs, a read structure should always be provided for each of the fastq files.", + "help_text": "If the read does not contain any UMI, the structure will be +T (i.e. only template of any length).\nhttps://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures" + }, + "read_structure2": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-clipboard-list", + "description": "When processing UMIs, a read structure should always be provided for each of the fastq files.", + "help_text": "If the read does not contain any UMI, the structure will be +T (i.e. only template of any length).\nhttps://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures" + } + } + }, + "annotation": { + "title": "Annotation", + "type": "object", + "description": "", + "default": "", + "fa_icon": "fas fa-toolbox", + "properties": { + "annotate_tools": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-hammer", + "description": "Specify from which tools Sarek should look for VCF files to annotate", + "help_text": "Only for step `annotate`", + "enum": [ + "null", + "HaplotypeCaller", + "Manta", + "Mutect2", + "Strelka", + "TIDDIT" + ] + }, + "annotation_cache": { + "type": "boolean", + "default": "false", + "fa_icon": "fas fa-database", + "description": "Enable the use of cache for annotation", + "help_text": "To be used with `--snpeff_cache` and/or `--vep_cache`" + }, + "cadd_cache": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-database", + "description": "Enable CADD cache" + }, + "cadd_indels": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-file", + "description": "Path to CADD InDels file" + }, + "cadd_indels_tbi": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-file", + "description": "Path to CADD InDels index" + }, + "cadd_wg_snvs": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-file", + "description": "Path to CADD SNVs file" + }, + "cadd_wg_snvs_tbi": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-file", + "description": "Path to CADD SNVs index" + }, + "genesplicer": { + "type": "boolean", + "default": "false", + "fa_icon": "fas fa-gavel", + "description": "Enable genesplicer within VEP" + }, + "snpeff_cache": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-database", + "description": "Path to snpEff cache", + "help_text": "To be used with `--annotation_cache`" + }, + "vep_cache": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-database", + "description": "Path to VEP cache", + "help_text": "To be used with `--annotation_cache`" + } + } + }, + "reference_genome_options": { + "title": "Reference genome options", + "type": "object", + "fa_icon": "fas fa-dna", + "description": "Options for the reference genome indices used to align reads.", + "properties": { + "genome": { + "type": "string", + "description": "Name of iGenomes reference.", + "fa_icon": "fas fa-book", + "help_text": "If using a reference genome configured in the pipeline using iGenomes, use this parameter to give the ID for the reference. This is then used to build the full paths for all required reference genome files e.g. `--genome GRCh38`.\n\nSee the [nf-core website docs](https://nf-co.re/usage/reference_genomes) for more details." + }, + "ac_loci": { + "type": "string", + "fa_icon": "fas fa-file" + }, + "ac_loci_gc": { + "type": "string", + "fa_icon": "fas fa-file" + }, + "bwa": { + "type": "string", + "fa_icon": "fas fa-copy" + }, + "chr_dir": { + "type": "string", + "fa_icon": "fas fa-folder-open" + }, + "chr_length": { + "type": "string", + "fa_icon": "fas fa-file" + }, + "dbsnp": { + "type": "string", + "fa_icon": "fas fa-file" + }, + "dbsnp_index": { + "type": "string", + "fa_icon": "fas fa-file" + }, + "dict": { + "type": "string", + "fa_icon": "fas fa-file" + }, + "fasta": { + "type": "string", + "fa_icon": "fas fa-font", + "description": "Path to FASTA genome file.", + "help_text": "If you have no genome reference available, the pipeline can build one using a FASTA file. This requires additional time and resources, so it's better to use a pre-build index if possible." + }, + "fasta_fai": { + "type": "string", + "fa_icon": "fas fa-file" + }, + "germline_resource": { + "type": "string", + "fa_icon": "fas fa-file" + }, + "germline_resource_index": { + "type": "string", + "fa_icon": "fas fa-file" + }, + "intervals": { + "type": "string", + "fa_icon": "fas fa-file-alt" + }, + "known_indels": { + "type": "string", + "fa_icon": "fas fa-copy" + }, + "known_indels_index": { + "type": "string", + "fa_icon": "fas fa-copy" + }, + "mappability": { + "type": "string", + "fa_icon": "fas fa-file" + }, + "snpeff_db": { + "type": "string", + "fa_icon": "fas fa-database" + }, + "species": { + "type": "string", + "fa_icon": "fas fa-microscope" + }, + "vep_cache_version": { + "type": "string", + "fa_icon": "fas fa-tag" + }, + "save_reference": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-download", + "description": "Save built references" + }, + "igenomes_base": { + "type": "string", + "description": "Directory / URL base for iGenomes references.", + "default": "s3://ngi-igenomes/igenomes/", + "fa_icon": "fas fa-cloud-download-alt", + "hidden": true + }, + "genomes_base": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-map-marker-alt", + "description": "Directory / URL base for genomes references.", + "help_text": "All files are supposed to be in the same folder", + "hidden": true + }, + "igenomes_ignore": { + "type": "boolean", + "description": "Do not load the iGenomes reference config.", + "fa_icon": "fas fa-ban", + "hidden": true, + "help_text": "Do not load `igenomes.config` when running the pipeline. You may choose this option if you observe clashes between custom parameters and those supplied in `igenomes.config`." + } + } + }, + "generic_options": { + "title": "Generic options", + "type": "object", + "fa_icon": "fas fa-file-import", + "description": "Less common options for the pipeline, typically set in a config file.", + "help_text": "These options are common to all nf-core pipelines and allow you to customise some of the core preferences for how the pipeline runs.\n\nTypically these options would be set in a Nextflow config file loaded for all pipeline runs, such as `~/.nextflow/config`.", + "properties": { + "help": { + "type": "boolean", + "description": "Display help text.", + "hidden": true, + "fa_icon": "fas fa-question-circle" + }, + "publish_dir_mode": { + "type": "string", + "default": "copy", + "hidden": true, + "description": "Method used to save pipeline results to output directory.", + "help_text": "The Nextflow `publishDir` option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See [Nextflow docs](https://www.nextflow.io/docs/latest/process.html#publishdir) for details.", + "fa_icon": "fas fa-copy", + "enum": [ + "symlink", + "rellink", + "link", + "copy", + "copyNoFollow", + "mov" + ] + }, + "name": { + "type": "string", + "description": "Workflow name.", + "fa_icon": "fas fa-fingerprint", + "hidden": true, + "help_text": "A custom name for the pipeline run. Unlike the core nextflow `-name` option with one hyphen this parameter can be reused multiple times, for example if using `-resume`. Passed through to steps such as MultiQC and used for things like report filenames and titles." + }, + "email_on_fail": { + "type": "string", + "description": "Email address for completion summary, only when pipeline fails.", + "fa_icon": "fas fa-exclamation-triangle", + "pattern": "^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$", + "hidden": true, + "help_text": "This works exactly as with `--email`, except emails are only sent if the workflow is not successful." + }, + "plaintext_email": { + "type": "boolean", + "description": "Send plain-text email instead of HTML.", + "fa_icon": "fas fa-remove-format", + "hidden": true, + "help_text": "Set to receive plain-text e-mails instead of HTML formatted." + }, + "max_multiqc_email_size": { + "type": "string", + "description": "File size limit when attaching MultiQC reports to summary emails.", + "default": "25.MB", + "fa_icon": "fas fa-file-upload", + "hidden": true, + "help_text": "If file generated by pipeline exceeds the threshold, it will not be attached." + }, + "monochrome_logs": { + "type": "boolean", + "description": "Do not use coloured log outputs.", + "fa_icon": "fas fa-palette", + "hidden": true, + "help_text": "Set to disable colourful command line output and live life in monochrome." + }, + "multiqc_config": { + "type": "string", + "description": "Custom config file to supply to MultiQC.", + "fa_icon": "fas fa-cog", + "hidden": true + }, + "tracedir": { + "type": "string", + "description": "Directory to keep pipeline Nextflow logs and reports.", + "default": "${params.outdir}/pipeline_info", + "fa_icon": "fas fa-cogs", + "hidden": true + }, + "sequencing_center": { + "type": "string", + "default": "null", + "fa_icon": "fas fa-university", + "description": "Name of sequencing center to be displayed in BAM file" + } + } + }, + "max_job_request_options": { + "title": "Max job request options", + "type": "object", + "fa_icon": "fab fa-acquisitions-incorporated", + "description": "Set the top limit for requested resources for any single job.", + "help_text": "If you are running on a smaller system, a pipeline step requesting more resources than are available may cause the Nextflow to stop the run with an error. These options allow you to cap the maximum resources requested by any single job so that the pipeline will run on your system.\n\nNote that you can not _increase_ the resources requested by any job using these options. For that you will need your own configuration file. See [the nf-core website](https://nf-co.re/usage/configuration) for details.", + "properties": { + "cpus": { + "type": "integer", + "default": 8, + "fa_icon": "fas fa-microchip" + }, + "single_cpu_mem": { + "type": "string", + "default": "7 GB", + "fa_icon": "fas fa-sd-card" + }, + "max_cpus": { + "type": "integer", + "description": "Maximum number of CPUs that can be requested for any single job.", + "default": 16, + "fa_icon": "fas fa-microchip", + "hidden": true, + "help_text": "Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. `--max_cpus 1`" + }, + "max_memory": { + "type": "string", + "description": "Maximum amount of memory that can be requested for any single job.", + "default": "128.GB", + "fa_icon": "fas fa-memory", + "hidden": true, + "help_text": "Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. `--max_memory '8.GB'`" + }, + "max_time": { + "type": "string", + "description": "Maximum amount of time that can be requested for any single job.", + "default": "240.h", + "fa_icon": "far fa-clock", + "hidden": true, + "help_text": "Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. `--max_time '2.h'`" + } + } + }, + "institutional_config_options": { + "title": "Institutional config options", + "type": "object", + "fa_icon": "fas fa-university", + "description": "Parameters used to describe centralised config profiles. These should not be edited.", + "help_text": "The centralised nf-core configuration profiles use a handful of pipeline parameters to describe themselves. This information is then printed to the Nextflow log when you run a pipeline. You should not need to change these values when you run a pipeline.", + "properties": { + "custom_config_version": { + "type": "string", + "description": "Git commit id for Institutional configs.", + "default": "master", + "hidden": true, + "fa_icon": "fas fa-users-cog", + "help_text": "Provide git commit id for custom Institutional configs hosted at `nf-core/configs`. This was implemented for reproducibility purposes. Default: `master`.\n\n```bash\n## Download and use config file with following git commit id\n--custom_config_version d52db660777c4bf36546ddb188ec530c3ada1b96\n```" + }, + "custom_config_base": { + "type": "string", + "description": "Base directory for Institutional configs.", + "default": "https://raw.githubusercontent.com/nf-core/configs/master", + "hidden": true, + "help_text": "If you're running offline, nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell nextflow where to find them with the `custom_config_base` option. For example:\n\n```bash\n## Download and unzip the config files\ncd /path/to/my/configs\nwget https://github.com/nf-core/configs/archive/master.zip\nunzip master.zip\n\n## Run the pipeline\ncd /path/to/my/data\nnextflow run /path/to/pipeline/ --custom_config_base /path/to/my/configs/configs-master/\n```\n\n> Note that the nf-core/tools helper package has a `download` command to download all required pipeline files + singularity containers + institutional configs in one go for you, to make this process easier.", + "fa_icon": "fas fa-users-cog" + }, + "hostnames": { + "type": "string", + "description": "Institutional configs hostname.", + "hidden": true, + "fa_icon": "fas fa-users-cog" + }, + "config_profile_description": { + "type": "string", + "description": "Institutional config description.", + "hidden": true, + "fa_icon": "fas fa-users-cog" + }, + "config_profile_contact": { + "type": "string", + "description": "Institutional config contact information.", + "hidden": true, + "fa_icon": "fas fa-users-cog" + }, + "config_profile_url": { + "type": "string", + "description": "Institutional config URL link.", + "hidden": true, + "fa_icon": "fas fa-users-cog" + } + } + } + }, + "allOf": [ + { + "$ref": "#/definitions/input_output_options" + }, + { + "$ref": "#/definitions/main_options" + }, + { + "$ref": "#/definitions/trim_split_fastq" + }, + { + "$ref": "#/definitions/preprocessing" + }, + { + "$ref": "#/definitions/variant_calling" + }, + { + "$ref": "#/definitions/annotation" + }, + { + "$ref": "#/definitions/reference_genome_options" + }, + { + "$ref": "#/definitions/generic_options" + }, + { + "$ref": "#/definitions/max_job_request_options" + }, + { + "$ref": "#/definitions/institutional_config_options" + } + ] +} \ No newline at end of file From 6c9a2fa86b302b395d4278013d0a392d29071bfa Mon Sep 17 00:00:00 2001 From: MaxUlysse Date: Mon, 21 Sep 2020 16:09:15 +0200 Subject: [PATCH 04/11] remove empty space --- nextflow.config | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/nextflow.config b/nextflow.config index 4d96e55f67..9fda24ca97 100644 --- a/nextflow.config +++ b/nextflow.config @@ -97,7 +97,7 @@ params { // Base specifications // Defaults only, expecting to be overwritten - cpus = 8 + cpus = 8 max_cpus = 16 max_memory = 128.GB max_time = 240.h From d288fe05bc4e661b9da7ae53d3879a1450fc1e0e Mon Sep 17 00:00:00 2001 From: MaxUlysse Date: Mon, 21 Sep 2020 16:10:58 +0200 Subject: [PATCH 05/11] update config --- nextflow.config | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/nextflow.config b/nextflow.config index 9fda24ca97..e19049d21e 100644 --- a/nextflow.config +++ b/nextflow.config @@ -112,6 +112,9 @@ process.container = 'nfcore/sarek:dev' // Load base.config by default for all pipelines includeConfig 'conf/base.config' +// Load modules.config by default for all pipelines +includeConfig 'conf/modules.config' + // Load nf-core custom profiles from different Institutions try { includeConfig "${params.custom_config_base}/nfcore_custom.config" @@ -145,15 +148,16 @@ profiles { singularity.autoMounts = true singularity.enabled = true } - test { includeConfig 'conf/test.config' } - test_annotation { includeConfig 'conf/test_annotation.config' } - test_no_gatk_spark { includeConfig 'conf/test_no_gatk_spark.config' } - test_split_fastq { includeConfig 'conf/test_split_fastq.config' } - test_targeted { includeConfig 'conf/test_targeted.config' } - test_tool { includeConfig 'conf/test_tool.config' } - test_trimming { includeConfig 'conf/test_trimming.config' } - test_umi_tso { includeConfig 'conf/test_umi_tso.config' } - test_umi_qiaseq { includeConfig 'conf/test_umi_qiaseq.config' } + test { includeConfig 'conf/test.config' } + test_annotation { includeConfig 'conf/test_annotation.config' } + test_no_gatk_spark { includeConfig 'conf/test_no_gatk_spark.config' } + test_split_fastq { includeConfig 'conf/test_split_fastq.config' } + test_targeted { includeConfig 'conf/test_targeted.config' } + test_tool { includeConfig 'conf/test_tool.config' } + test_trimming { includeConfig 'conf/test_trimming.config' } + test_haplotypecaller { includeConfig 'conf/test_germline_variantcalling.config' } + test_umi_tso { includeConfig 'conf/test_umi_tso.config' } + test_umi_qiaseq { includeConfig 'conf/test_umi_qiaseq.config' } } // Load genomes.config or igenomes.config From e9a3eb08b9e64e5b1e55d522689b18fb8cc61e1e Mon Sep 17 00:00:00 2001 From: MaxUlysse Date: Mon, 21 Sep 2020 16:12:51 +0200 Subject: [PATCH 06/11] update CI tests --- .github/workflows/ci.yml | 75 +++++++++++++++++++++++++++++++--------- 1 file changed, 58 insertions(+), 17 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index ecaf2237ae..8138cb1002 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -1,30 +1,50 @@ name: nf-core CI -# This workflow is triggered on pushes and PRs to the repository. -# It runs the pipeline with the minimal test dataset to check that it completes without any syntax errors. -on: [push, pull_request] +# This workflow runs the pipeline with the minimal test dataset to check that it completes without any syntax errors +on: + push: + branches: + - dev + pull_request: + release: + types: [published] jobs: test: + name: Run workflow tests + # Only run on push if this is the nf-core dev branch (merged PRs) + if: ${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/sarek') }} + runs-on: ubuntu-latest env: NXF_VER: ${{ matrix.nxf_ver }} NXF_ANSI_LOG: false - runs-on: ubuntu-latest strategy: matrix: # Nextflow versions: check pipeline minimum and current latest - nxf_ver: ['20.07.0-RC1', ''] + nxf_ver: ['20.07.1', ''] steps: - - uses: actions/checkout@v2 + - name: Check out pipeline code + uses: actions/checkout@v2 + - name: Check if Dockerfile or Conda environment changed + uses: technote-space/get-diff-action@v1 + with: + PREFIX_FILTER: | + Dockerfile + environment.yml + - name: Build new docker image + if: env.GIT_DIFF + run: docker build --no-cache . -t nfcore/sarek:dev + - name: Pull docker image + if: ${{ !env.GIT_DIFF }} + run: | + docker pull nfcore/sarek:dev + docker tag nfcore/sarek:dev nfcore/sarek:dev - name: Install Nextflow run: | wget -qO- get.nextflow.io | bash sudo mv nextflow /usr/local/bin/ - - name: Pull docker image + - name: Run pipeline with test data run: | - docker pull nfcore/sarek:dev - docker tag nfcore/sarek:dev nfcore/sarek:dev - - name: Run test - run: nextflow run ${GITHUB_WORKSPACE} -profile test,docker + nextflow run ${GITHUB_WORKSPACE} -profile test,docker annotation: env: @@ -42,7 +62,7 @@ jobs: sudo mv nextflow /usr/local/bin/ env: # Only check Nextflow pipeline minimum version - NXF_VER: '20.07.0-RC1' + NXF_VER: '20.07.1' - name: Pull docker image run: | docker pull nfcore/sarek:dev @@ -65,7 +85,7 @@ jobs: sudo mv nextflow /usr/local/bin/ env: # Only check Nextflow pipeline minimum version - NXF_VER: '20.07.0-RC1' + NXF_VER: '20.07.1' - name: Pull docker image run: docker pull nfcore/sarek:dev - name: Get test data @@ -93,7 +113,7 @@ jobs: sudo mv nextflow /usr/local/bin/ env: # Only check Nextflow pipeline minimum version - NXF_VER: '20.07.0-RC1' + NXF_VER: '20.07.1' - name: Pull docker image run: docker pull nfcore/sarek:dev - name: Run test for minimal genomes @@ -105,7 +125,7 @@ jobs: runs-on: ubuntu-latest strategy: matrix: - profile: [test_split_fastq, test_targeted, test_trimming, test_no_gatk_spark] + profile: [test_split_fastq, test_targeted, test_trimming, test_no_gatk_spark, test_umi_tso, test_umi_qiaseq] steps: - uses: actions/checkout@v2 - name: Install Nextflow @@ -114,12 +134,33 @@ jobs: sudo mv nextflow /usr/local/bin/ env: # Only check Nextflow pipeline minimum version - NXF_VER: '20.07.0-RC1' + NXF_VER: '20.07.1' - name: Pull docker image run: docker pull nfcore/sarek:dev - name: Run ${{ matrix.profile }} test run: nextflow run ${GITHUB_WORKSPACE} -profile ${{ matrix.profile }},docker + aligner: + env: + NXF_ANSI_LOG: false + runs-on: ubuntu-latest + strategy: + matrix: + aligner: [bwa-mem, bwa-mem2] + steps: + - uses: actions/checkout@v2 + - name: Install Nextflow + run: | + wget -qO- get.nextflow.io | bash + sudo mv nextflow /usr/local/bin/ + env: + # Only check Nextflow pipeline minimum version + NXF_VER: '20.07.1' + - name: Pull docker image + run: docker pull nfcore/sarek:dev + - name: Run ${{ matrix.profile }} test + run: nextflow run ${GITHUB_WORKSPACE} -profile test,docker --aligner ${{ matrix.aligner }} + tools: env: NXF_ANSI_LOG: false @@ -145,7 +186,7 @@ jobs: sudo mv nextflow /usr/local/bin/ env: # Only check Nextflow pipeline minimum version - NXF_VER: '20.07.0-RC1' + NXF_VER: '20.07.1' - name: Pull docker image run: docker pull nfcore/sarek:dev - name: Run ${{ matrix.tool }} test From 8568f725601dae7fdf32574ef2d88a3616d86411 Mon Sep 17 00:00:00 2001 From: MaxUlysse Date: Mon, 21 Sep 2020 17:04:46 +0200 Subject: [PATCH 07/11] add --aligner to choose between bwa-mem and bwa-mem2 --- conf/modules.config | 14 +++++ main.nf | 14 ++++- modules/local/process/bwa_mem.nf | 44 ++++++++++++++ modules/local/subworkflow/build_indices.nf | 6 +- .../nf-core/software/bwa/index/functions.nf | 59 +++++++++++++++++++ modules/nf-core/software/bwa/index/main.nf | 31 ++++++++++ modules/nf-core/software/bwa/index/meta.yml | 52 ++++++++++++++++ 7 files changed, 217 insertions(+), 3 deletions(-) create mode 100644 modules/local/process/bwa_mem.nf create mode 100644 modules/nf-core/software/bwa/index/functions.nf create mode 100644 modules/nf-core/software/bwa/index/main.nf create mode 100644 modules/nf-core/software/bwa/index/meta.yml diff --git a/conf/modules.config b/conf/modules.config index 6d1df81e7d..98e5fb54a3 100644 --- a/conf/modules.config +++ b/conf/modules.config @@ -18,6 +18,20 @@ params { publish_dir = "trimgalore" publish_results = "all" } + 'bwa_index' { + args = "" + suffix = "" + publish_dir = "genome/bwa_index" + publish_results = "all" + } + 'bwa_mem' { + args = "-K 100000000 -M" + args2 = "" + extra = "" + suffix = "" + publish_dir = "" + publish_results = "all" + } 'bwamem2_index' { args = "" suffix = "" diff --git a/main.nf b/main.nf index e2c0ff5eca..1772ae9902 100644 --- a/main.nf +++ b/main.nf @@ -261,6 +261,7 @@ if (params.sentieon) log.warn "[nf-core/sarek] Sentieon will be used, only works */ include { BWAMEM2_MEM } from './modules/local/process/bwamem2_mem' +include { BWA_MEM } from './modules/local/process/bwa_mem' include { GET_SOFTWARE_VERSIONS } from './modules/local/process/get_software_versions' include { OUTPUT_DOCUMENTATION } from './modules/local/process/output_documentation' include { MERGE_BAM as MERGE_BAM_MAPPED } from './modules/local/process/merge_bam' @@ -390,9 +391,18 @@ workflow { // STEP 1: MAPPING READS TO REFERENCE GENOME WITH BWA MEM - BWAMEM2_MEM(reads_input, bwa, fasta, fai, params.modules['bwamem2_mem']) + bam_bwamem2 = Channel.empty() + bwa_mem_out = Channel.empty() - bam_bwamem2 = BWAMEM2_MEM.out + if (params.aligner == "bwa-mem") { + BWA_MEM(reads_input, bwa, fasta, fai, params.modules['bwa_mem']) + bwa_mem_out = BWA_MEM.out.bam + } else { + BWAMEM2_MEM(reads_input, bwa, fasta, fai, params.modules['bwamem2_mem']) + bam_bwamem2 = BWAMEM2_MEM.out + } + + bam_bwamem2 = bam_bwamem2.mix(bwa_mem_out) bam_bwamem2.map{ meta, bam -> patient = meta.patient diff --git a/modules/local/process/bwa_mem.nf b/modules/local/process/bwa_mem.nf new file mode 100644 index 0000000000..5a7c38ab08 --- /dev/null +++ b/modules/local/process/bwa_mem.nf @@ -0,0 +1,44 @@ +process BWA_MEM { + tag "${meta.id}" + label 'process_high' + + publishDir "${params.outdir}/bwa/${meta.sample}", + mode: params.publish_dir_mode, + saveAs: { filename -> + if (filename.endsWith('.version.txt')) null + else filename } + + container "quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40:eabfac3657eda5818bae4090db989e3d41b01542-0" + //container "https://depot.galaxyproject.org/singularity/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40:eabfac3657eda5818bae4090db989e3d41b01542-0" + + conda (params.conda ? "bioconda::bwa=0.7.17 bioconda::samtools=1.10" : null) + + input: + tuple val(meta), path(reads) + path bwa + path fasta + path fai + val options + + output: + tuple val(meta), path("*.bam"), emit: bam + path "*.version.txt" , emit: version + + script: + CN = params.sequencing_center ? "CN:${params.sequencing_center}\\t" : "" + readGroup = "@RG\\tID:${meta.run}\\t${CN}PU:${meta.run}\\tSM:${meta.sample}\\tLB:${meta.sample}\\tPL:ILLUMINA" + extra = meta.status == 1 ? "-B 3" : "" + """ + bwa mem \ + ${options.args} \ + -R \"${readGroup}\" \ + ${extra} \ + -t ${task.cpus} \ + ${fasta} ${reads} | \ + samtools sort --threads ${task.cpus} -m 2G - > ${meta.id}.bam + + # samtools index ${meta.id}.bam + + echo \$(bwa version 2>&1) > bwa.version.txt + """ +} diff --git a/modules/local/subworkflow/build_indices.nf b/modules/local/subworkflow/build_indices.nf index 2e5724c446..f31128dd9b 100644 --- a/modules/local/subworkflow/build_indices.nf +++ b/modules/local/subworkflow/build_indices.nf @@ -7,6 +7,7 @@ // And then initialize channels based on params or indexes that were just built include { BUILD_INTERVALS } from '../process/build_intervals.nf' +include { BWA_INDEX } from '../../nf-core/software/bwa/index/main.nf' include { BWAMEM2_INDEX } from '../../nf-core/software/bwamem2_index.nf' include { CREATE_INTERVALS_BED } from '../process/create_intervals_bed.nf' include { GATK_CREATESEQUENCEDICTIONARY as GATK_DICT } from '../../nf-core/software/gatk_createsequencedictionary.nf' @@ -29,8 +30,10 @@ workflow BUILD_INDICES{ main: result_bwa = Channel.empty() + version_bwa = Channel.empty() if (!(params.bwa) && params.fasta && 'mapping' in step) - result_bwa = BWAMEM2_INDEX(fasta) + if (params.aligner == "bwa-mem") (result_bwa, version_bwa) = BWA_INDEX(fasta, params.modules['bwa_index']) + else result_bwa = BWAMEM2_INDEX(fasta) result_dict = Channel.empty() if (!(params.dict) && params.fasta && !('annotate' in step) && !('controlfreec' in step)) @@ -86,6 +89,7 @@ workflow BUILD_INDICES{ emit: bwa = result_bwa + bwa_version = version_bwa dbsnp_tbi = result_dbsnp_tbi dict = result_dict fai = result_fai diff --git a/modules/nf-core/software/bwa/index/functions.nf b/modules/nf-core/software/bwa/index/functions.nf new file mode 100644 index 0000000000..b3ac38015b --- /dev/null +++ b/modules/nf-core/software/bwa/index/functions.nf @@ -0,0 +1,59 @@ +/* + * ----------------------------------------------------- + * Utility functions used in nf-core DSL2 module files + * ----------------------------------------------------- + */ + +/* + * Extract name of software tool from process name using $task.process + */ +def getSoftwareName(task_process) { + return task_process.tokenize(':')[-1].tokenize('_')[0].toLowerCase() +} + +/* + * Function to initialise default values and to generate a Groovy Map of available options for nf-core modules + */ +def initOptions(Map args) { + def Map options = [:] + options.args = args.args ?: '' + options.args2 = args.args2 ?: '' + options.publish_by_id = args.publish_by_id ?: false + options.publish_dir = args.publish_dir ?: '' + options.publish_files = args.publish_files + options.suffix = args.suffix ?: '' + return options +} + +/* + * Tidy up and join elements of a list to return a path string + */ +def getPathFromList(path_list) { + def paths = path_list.findAll { item -> !item?.trim().isEmpty() } // Remove empty entries + paths = paths.collect { it.trim().replaceAll("^[/]+|[/]+\$", "") } // Trim whitespace and trailing slashes + return paths.join('/') +} + +/* + * Function to save/publish module results + */ +def saveFiles(Map args) { + if (!args.filename.endsWith('.version.txt')) { + def ioptions = initOptions(args.options) + def path_list = [ ioptions.publish_dir ?: args.publish_dir ] + if (ioptions.publish_by_id) { + path_list.add(args.publish_id) + } + if (ioptions.publish_files instanceof Map) { + for (ext in ioptions.publish_files) { + if (args.filename.endsWith(ext.key)) { + def ext_list = path_list.collect() + ext_list.add(ext.value) + return "${getPathFromList(ext_list)}/$args.filename" + } + } + } else if (ioptions.publish_files == null) { + return "${getPathFromList(path_list)}/$args.filename" + } + } +} diff --git a/modules/nf-core/software/bwa/index/main.nf b/modules/nf-core/software/bwa/index/main.nf new file mode 100644 index 0000000000..7dbdbd3158 --- /dev/null +++ b/modules/nf-core/software/bwa/index/main.nf @@ -0,0 +1,31 @@ +// Import generic module functions +include { initOptions; saveFiles; getSoftwareName } from './functions' + +process BWA_INDEX { + tag "$fasta" + label 'process_high' + publishDir "${params.outdir}", + mode: params.publish_dir_mode, + saveAs: { filename -> saveFiles(filename:filename, options:options, publish_dir:getSoftwareName(task.process), publish_id:'') } + + container "biocontainers/bwa:v0.7.17_cv1" + //container "https://depot.galaxyproject.org/singularity/bwa:0.7.17--hed695b0_7" + + conda (params.conda ? "bioconda::bwa=0.7.17" : null) + + input: + path fasta + val options + + output: + path "${fasta}.*" , emit: index + path "*.version.txt", emit: version + + script: + def software = getSoftwareName(task.process) + def ioptions = initOptions(options) + """ + bwa index $ioptions.args $fasta + echo \$(bwa 2>&1) | sed 's/^.*Version: //; s/Contact:.*\$//' > ${software}.version.txt + """ +} diff --git a/modules/nf-core/software/bwa/index/meta.yml b/modules/nf-core/software/bwa/index/meta.yml new file mode 100644 index 0000000000..a2f5b1ed66 --- /dev/null +++ b/modules/nf-core/software/bwa/index/meta.yml @@ -0,0 +1,52 @@ +name: bwa_index +description: Create BWA index for reference genome +keywords: + - index + - fasta + - genome + - reference +tools: + - bwa: + description: | + BWA is a software package for mapping DNA sequences against + a large reference genome, such as the human genome. + homepage: http://bio-bwa.sourceforge.net/ + documentation: http://www.htslib.org/doc/samtools.html + arxiv: arXiv:1303.3997 +params: + - outdir: + type: string + description: | + The pipeline's output directory. By default, the module will + output files into `$params.outdir/` + - publish_dir_mode: + type: string + description: | + Value for the Nextflow `publishDir` mode parameter. + Available: symlink, rellink, link, copy, copyNoFollow, move. + - conda: + type: boolean + description: | + Run the module with Conda using the software specified + via the `conda` directive +input: + - fasta: + type: file + description: Input genome fasta file + - options: + type: map + description: | + Groovy Map containing module options for passing command-line arguments and + output file paths. +output: + - index: + type: file + description: BWA genome index files + pattern: "*.{amb,ann,bwt,pac,sa}" + - version: + type: file + description: File containing software version + pattern: "*.{version.txt}" +authors: + - "@drpatelh" + - "@maxulysse" From ed3e680147948a3e58b2ada3934ccb08e3f65b92 Mon Sep 17 00:00:00 2001 From: Maxime Garcia Date: Tue, 22 Sep 2020 10:58:30 +0200 Subject: [PATCH 08/11] Apply suggestions from code review Co-authored-by: FriederikeHanssen --- docs/output.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/output.md b/docs/output.md index ce5758a240..b44f7d7ce3 100644 --- a/docs/output.md +++ b/docs/output.md @@ -11,7 +11,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d - [Preprocessing](#preprocessing) - [Map to Reference](#map-to-reference) - - [bwa](#bwa) + - [BWA](#BWA) - [BWA-mem2](#bwa-mem2) - [Mark Duplicates](#mark-duplicates) - [GATK MarkDuplicates](#gatk-markduplicates) @@ -65,9 +65,9 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d ### Map to Reference -#### bwa +#### BWA -[bwa](https://github.com/lh3/bwa) is a software package for mapping low-divergent sequences against a large reference genome. +[BWA](https://github.com/lh3/bwa) is a software package for mapping low-divergent sequences against a large reference genome. Such files are intermediate and not kept in the final files delivered to users. From 7e111dfa955470f79db2441ec2f7d944e9f8e309 Mon Sep 17 00:00:00 2001 From: Maxime Garcia Date: Tue, 22 Sep 2020 11:02:26 +0200 Subject: [PATCH 09/11] Update docs/usage.md --- docs/usage.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/usage.md b/docs/usage.md index 6a75ae6ba2..5136b5fdbf 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -926,6 +926,10 @@ Example: > **WARNING** Current indices for `bwa` in AWS iGenomes are not compatible with `bwa-mem2`. > Use `--bwa=false` to have `Sarek` build them automatically. +> **WARNING** BWA-mem2 is in active development +> Sarek might not be able to require the right amount of resources for it at the moment +> We recommend to use pre-built indexes + Example: ```bash From 7bbffd652ec584620b6468f591ab9d2052a5fde2 Mon Sep 17 00:00:00 2001 From: Maxime Garcia Date: Tue, 22 Sep 2020 11:16:11 +0200 Subject: [PATCH 10/11] Update docs/usage.md --- docs/usage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/usage.md b/docs/usage.md index 5136b5fdbf..0d058cd8f6 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -925,7 +925,7 @@ Example: > **WARNING** Current indices for `bwa` in AWS iGenomes are not compatible with `bwa-mem2`. > Use `--bwa=false` to have `Sarek` build them automatically. - +> > **WARNING** BWA-mem2 is in active development > Sarek might not be able to require the right amount of resources for it at the moment > We recommend to use pre-built indexes From b86d5435874287a2ef299f3ee3deeff0b8776eac Mon Sep 17 00:00:00 2001 From: Maxime Garcia Date: Tue, 22 Sep 2020 11:18:55 +0200 Subject: [PATCH 11/11] Apply suggestions from code review --- docs/usage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/usage.md b/docs/usage.md index 0d058cd8f6..520fea039a 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -925,7 +925,7 @@ Example: > **WARNING** Current indices for `bwa` in AWS iGenomes are not compatible with `bwa-mem2`. > Use `--bwa=false` to have `Sarek` build them automatically. -> +> > **WARNING** BWA-mem2 is in active development > Sarek might not be able to require the right amount of resources for it at the moment > We recommend to use pre-built indexes