Merge pull request #208 from MaxUlysse/dev

Getting dev up to date
nf-core · May 26, 2020 · 25dd584 · 25dd584
2 parents e047b7d + 6da920a
commit 25dd584
Show file tree

Hide file tree

Showing 22 changed files with 349 additions and 267 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,7 +4,22 @@ All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
 
-## [2.6dev] - Piellorieppe
+## [dev]
+
+### Added
+
+### Changed
+
+- [#208](https://github.com/nf-core/sarek/pull/208) - Merge changes from the release PR
+- [#208](https://github.com/nf-core/sarek/pull/208) - Bump version to `3.0dev`
+
+### Fixed
+
+### Deprecated
+
+### Removed
+
+## [2.6] - Piellorieppe
 
 Piellorieppe is one of the main massif in the Sarek National Park.
 
@@ -63,8 +78,8 @@ Piellorieppe is one of the main massif in the Sarek National Park.
 - [#164](https://github.com/nf-core/sarek/pull/164) - Update `gatk4-spark` from `4.1.4.1` to `4.1.6.0`
 - [#180](https://github.com/nf-core/sarek/pull/180), [#195](https://github.com/nf-core/sarek/pull/195) - Improve minimal setting
 - [#183](https://github.com/nf-core/sarek/pull/183), [#204](https://github.com/nf-core/sarek/pull/204) - Update `input.md` documentation
-- [#197](https://github.com/nf-core/sarek/pull/197) - Output directory `DuplicateMarked` is now replaced by`DuplicatesMarked`
-- [#204](https://github.com/nf-core/sarek/pull/204) - Output directory `controlFREEC` is now replaced by`Control-FREEC`
+- [#197](https://github.com/nf-core/sarek/pull/197) - Output directory `DuplicateMarked` is now replaced by `DuplicatesMarked`
+- [#204](https://github.com/nf-core/sarek/pull/204) - Output directory `controlFREEC` is now replaced by `Control-FREEC`
 
 ### Fixed
 

diff --git a/Dockerfile b/Dockerfile
@@ -7,7 +7,7 @@ COPY environment.yml /
 RUN conda env create -f /environment.yml && conda clean -a
 
 # Add conda installation dir to PATH (instead of doing 'conda activate')
-ENV PATH /opt/conda/envs/nf-core-sarek-2.6dev/bin:$PATH
+ENV PATH /opt/conda/envs/nf-core-sarek-3.0dev/bin:$PATH
 
 # Dump the details of the installed packages to a file for posterity
-RUN conda env export --name nf-core-sarek-2.6dev > nf-core-sarek-2.6dev.yml
+RUN conda env export --name nf-core-sarek-3.0dev > nf-core-sarek-3.0dev.yml
diff --git a/README.md b/README.md
@@ -18,7 +18,9 @@
 
 ## Introduction
 
-Sarek is a workflow designed to run analyses on whole genome or targeted sequencing data from regular samples or tumour / normal pairs and could include additional relapses.
+Sarek is a workflow designed to detect variants on whole genome or targeted sequencing data.
+Initially designed for Human, and Mouse, it can work on any species with a reference genome.
+Sarek can also handle tumour / normal pairs and could include additional relapses.
 
 It's built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner.
 It comes with docker containers making installation trivial and results highly reproducible.
@@ -93,16 +95,19 @@ Helpful contributors:
 * [Francesco L](https://github.com/nibscles)
 * [Friederike Hanssen](https://github.com/FriederikeHanssen)
 * [Gisela Gabernet](https://github.com/ggabernet)
+* [Harshil Patel](https://github.com/drpatelh)
+* [James A. Fellows Yates](https://github.com/jfy133)
 * [Jesper Eisfeldt](https://github.com/J35P312)
 * [Johannes Alneberg](https://github.com/alneberg)
-* [Tobias Koch](https://github.com/KochTobi)
 * [Lucia Conde](https://github.com/lconde-ucl)
 * [Malin Larsson](https://github.com/malinlarsson)
 * [Marcel Martin](https://github.com/marcelm)
 * [Nilesh Tawari](https://github.com/nilesh-tawari)
+* [Olga Botvinnik](https://github.com/olgabot)
 * [Phil Ewels](https://github.com/ewels)
 * [Sabrina Krakau](https://github.com/skrakau)
 * [Sebastian-D](https://github.com/Sebastian-D)
+* [Tobias Koch](https://github.com/KochTobi)
 * [Winni Kretzschmar](https://github.com/winni2k)
 * [arontommi](https://github.com/arontommi)
 * [bjornnystedt](https://github.com/bjornnystedt)
@@ -130,7 +135,7 @@ For further information or help, don't hesitate to get in touch on [Slack](https
 ## Citation
 
 If you use `nf-core/sarek` for your analysis, please cite the `Sarek` article as follows:
-> Garcia M, Juhos S, Larsson M et al. **Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants [version 1; peer review: 2 approved]** *F1000Research* 2020, 9:63 [doi: 10.12688/f1000research.16665.1](https://f1000research.com/articles/9-63/v1).
+> Garcia M, Juhos S, Larsson M et al. **Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants [version 1; peer review: 2 approved]** *F1000Research* 2020, 9:63 [doi: 10.12688/f1000research.16665.1](http://dx.doi.org/10.12688/f1000research.16665.1).
 
 You can cite the sarek zenodo record for a specific version using the following [doi: 10.5281/zenodo.3476426](https://zenodo.org/badge/latestdoi/184289291)
 

diff --git a/bin/scrape_software_versions.py b/bin/scrape_software_versions.py
@@ -8,6 +8,8 @@
     'ASCAT': ['v_ascat.txt', r"Version:       (\S+)"],
     'bcftools': ['v_bcftools.txt', r"bcftools (\S+)"],
     'BWA': ['v_bwa.txt', r"Version: (\S+)"],
+    'CNVkit': ['v_cnvkit.txt', r"(\S+)"],
+    'Control-FREEC': ['v_controlfreec.txt', r"Control-FREEC\s(\S+)"],
     'FastQC': ['v_fastqc.txt', r"FastQC v(\S+)"],
     'FreeBayes': ['v_freebayes.txt', r"version:  v(\d\.\d\.\d+)"],
     'GATK': ['v_gatk.txt', r"Version:(\S+)"],
@@ -17,31 +19,33 @@
     'MultiQC': ['v_multiqc.txt', r"multiqc, version (\S+)"],
     'Nextflow': ['v_nextflow.txt', r"(\S+)"],
     'nf-core/sarek': ['v_pipeline.txt', r"(\S+)"],
-    'Qualimap': ['v_qualimap.txt', r"QualiMap v.(\S+)"],
+    'QualiMap': ['v_qualimap.txt', r"QualiMap v.(\S+)"],
     'R': ['v_r.txt', r"R version (\S+)"],
     'samtools': ['v_samtools.txt', r"samtools (\S+)"],
-    'SnpEff': ['v_snpeff.txt', r"version SnpEff (\S+)"],
+    'SnpEff': ['v_snpeff.txt', r"SnpEff\s(\S+)"],
     'Strelka': ['v_strelka.txt', r"([0-9.]+)"],
     'TIDDIT': ['v_tiddit.txt', r"TIDDIT-(\S+)"], 
     'Trim Galore': ['v_trim_galore.txt', r"version (\S+)"], 
     'vcftools': ['v_vcftools.txt', r"([0-9.]+)"],
-    'VEP': ['v_vep.txt', r"ensembl-vep          : (\S+)"],
+    'VEP': ['v_vep.txt', r"ensembl-vep          : (\S+)"]
 }
 results = OrderedDict()
 results['nf-core/sarek'] = '<span style="color:#999999;\">N/A</span>'
 results['Nextflow'] = '<span style="color:#999999;\">N/A</span>'
-results['AlleleCount'] = '<span style="color:#999999;\">N/A</span>'
 results['ASCAT'] = '<span style="color:#999999;\">N/A</span>'
+results['AlleleCount'] = '<span style="color:#999999;\">N/A</span>'
 results['bcftools'] = '<span style="color:#999999;\">N/A</span>'
 results['BWA'] = '<span style="color:#999999;\">N/A</span>'
+results['CNVkit'] = '<span style="color:#999999;\">N/A</span>'
+results['Control-FREEC'] = '<span style="color:#999999;\">N/A</span>'
 results['FastQC'] = '<span style="color:#999999;\">N/A</span>'
 results['FreeBayes'] = '<span style="color:#999999;\">N/A</span>'
 results['GATK'] = '<span style="color:#999999;\">N/A</span>'
 results['htslib'] = '<span style="color:#999999;\">N/A</span>'
 results['Manta'] = '<span style="color:#999999;\">N/A</span>'
 results['msisensor'] = '<span style="color:#999999;\">N/A</span>'
 results['MultiQC'] = '<span style="color:#999999;\">N/A</span>'
-results['Qualimap'] = '<span style="color:#999999;\">N/A</span>'
+results['QualiMap'] = '<span style="color:#999999;\">N/A</span>'
 results['R'] = '<span style="color:#999999;\">N/A</span>'
 results['samtools'] = '<span style="color:#999999;\">N/A</span>'
 results['SnpEff'] = '<span style="color:#999999;\">N/A</span>'

diff --git a/conf/base.config b/conf/base.config
@@ -15,7 +15,7 @@ process {
   time = {check_resource(24.h * task.attempt)}
   shell = ['/bin/bash', '-euo', 'pipefail']
 
-  errorStrategy = {task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
+  errorStrategy = {task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish'}
   maxErrors = '-1'
   maxRetries = 3
 

diff --git a/conf/test.config b/conf/test.config
@@ -23,15 +23,18 @@ params {
   igenomes_ignore = true
   genome = 'smallGRCh37'
   genomes_base = "https://raw.githubusercontent.com/nf-core/test-datasets/sarek/reference"
+  snpeff_db         = 'WBcel235.86'
+  species           = 'caenorhabditis_elegans'
+  vep_cache_version = '99'
 }
 
 process {
   withName:Snpeff {
-    container = 'nfcore/sareksnpeff:dev.GRCh37'
+    container = 'nfcore/sareksnpeff:dev.WBcel235'
     maxForks = 1
   }
   withLabel:VEP {
-    container = 'nfcore/sarekvep:dev.GRCh37'
+    container = 'nfcore/sarekvep:dev.WBcel235'
     maxForks = 1
   }
 }
diff --git a/conf/test_annotation.config b/conf/test_annotation.config
@@ -10,7 +10,5 @@
 includeConfig 'test.config'
 
 params {
-  igenomes_ignore   = false
-  genome            = 'WBcel235'
   input             = 'https://raw.githubusercontent.com/nf-core/test-datasets/sarek/testdata/vcf/Strelka_1234N_variants.vcf.gz'
 }
diff --git a/containers/snpeff/Dockerfile b/containers/snpeff/Dockerfile
@@ -9,7 +9,7 @@ COPY environment.yml /
 RUN conda env create -f /environment.yml && conda clean -a
 
 # Add conda installation dir to PATH (instead of doing 'conda activate')
-ENV PATH /opt/conda/envs/nf-core-sarek-snpeff-2.6dev/bin:$PATH
+ENV PATH /opt/conda/envs/nf-core-sarek-snpeff-3.0dev/bin:$PATH
 
 # Setup default ARG variables
 ARG GENOME=GRCh38
@@ -19,4 +19,4 @@ ARG SNPEFF_CACHE_VERSION=86
 RUN snpEff download -v ${GENOME}.${SNPEFF_CACHE_VERSION}
 
 # Dump the details of the installed packages to a file for posterity
-RUN conda env export --name nf-core-sarek-snpeff-2.6dev > nf-core-sarek-snpeff-2.6dev.yml
+RUN conda env export --name nf-core-sarek-snpeff-3.0dev > nf-core-sarek-snpeff-3.0dev.yml
diff --git a/containers/snpeff/environment.yml b/containers/snpeff/environment.yml
@@ -1,6 +1,6 @@
 # You can use this file to create a conda environment for this pipeline:
 # conda env create -f environment.yml
-name: nf-core-sarek-snpeff-2.6dev
+name: nf-core-sarek-snpeff-3.0dev
 channels:
   - conda-forge
   - bioconda

diff --git a/containers/vep/Dockerfile b/containers/vep/Dockerfile
@@ -9,7 +9,7 @@ COPY environment.yml /
 RUN conda env create -f /environment.yml && conda clean -a
 
 # Add conda installation dir to PATH (instead of doing 'conda activate')
-ENV PATH /opt/conda/envs/nf-core-sarek-vep-2.6dev/bin:$PATH
+ENV PATH /opt/conda/envs/nf-core-sarek-vep-3.0dev/bin:$PATH
 
 # Setup default ARG variables
 ARG GENOME=GRCh38
@@ -27,4 +27,4 @@ RUN vep_install \
   --NO_BIOPERL --NO_HTSLIB --NO_TEST --NO_UPDATE
 
 # Dump the details of the installed packages to a file for posterity
-RUN conda env export --name nf-core-sarek-vep-2.6dev > nf-core-sarek-vep-2.6dev.yml
+RUN conda env export --name nf-core-sarek-vep-3.0dev > nf-core-sarek-vep-3.0dev.yml
diff --git a/containers/vep/environment.yml b/containers/vep/environment.yml
@@ -1,6 +1,6 @@
 # You can use this file to create a conda environment for this pipeline:
 # conda env create -f environment.yml
-name: nf-core-sarek-vep-2.6dev
+name: nf-core-sarek-vep-3.0dev
 channels:
   - conda-forge
   - bioconda

diff --git a/docs/annotation.md b/docs/annotation.md
@@ -28,12 +28,12 @@ The main Sarek container has also `snpEff` and `VEP` installed, but without the
 
 ## Download cache
 
-A Nextflow helper script has been designed to help downloading `snpEff` and `VEP` cache.
+A Nextflow helper script has been designed to help downloading `snpEff` and `VEP` caches.
 Such files are meant to be shared between multiple users, so this script is mainly meant for people administrating servers, clusters and advanced users.
 
 ```bash
-nextflow run download_cache.nf --snpeff_cache </Path/To/snpEffCache> --snpeff_db <snpEff DB version> --genome <GENOME>
-nextflow run download_cache.nf --vep_cache </Path/To/VEPcache> --species <species> --vep_cache_version <VEP cache version> --genome <GENOME>
+nextflow run download_cache.nf --snpeff_cache </path/to/snpEff/cache> --snpeff_db <snpEff DB version> --genome <GENOME>
+nextflow run download_cache.nf --vep_cache </path/to/VEP/cache> --species <species> --vep_cache_version <VEP cache version> --genome <GENOME>
 ```
 
 ## Using downloaded cache
@@ -46,8 +46,8 @@ The cache will only be used when `--annotation_cache` and cache directories are
 Example:
 
 ```bash
-nextflow run nf-core/sarek --tools snpEff --step annotate --sample file.vcf.gz --snpeff_cache </Path/To/snpEffCache> --annotation_cache
-nextflow run nf-core/sarek --tools VEP --step annotate --sample file.vcf.gz --vep_cache </Path/To/vepCache> --annotation_cache
+nextflow run nf-core/sarek --tools snpEff --step annotate --sample <file.vcf.gz> --snpeff_cache </path/to/snpEff/cache> --annotation_cache
+nextflow run nf-core/sarek --tools VEP --step annotate --sample <file.vcf.gz> --vep_cache </path/to/VEP/cache> --annotation_cache
 ```
 
 ## Using VEP CADD plugin
@@ -61,11 +61,11 @@ To enable the use of the VEP CADD plugin:
 Example:
 
 ```bash
-nextflow run nf-core/sarek --step annotate --tools VEP --sample file.vcf.gz --cadd_cache \
-    --cadd_InDels </PathToCADD/InDels.tsv.gz> \
-    --cadd_InDels_tbi </PathToCADD/InDels.tsv.gz.tbi> \
-    --cadd_WG_SNVs </PathToCADD/whole_genome_SNVs.tsv.gz> \
-    --cadd_WG_SNVs_tbi </PathToCADD/whole_genome_SNVs.tsv.gz.tbi>
+nextflow run nf-core/sarek --step annotate --tools VEP --sample <file.vcf.gz> --cadd_cache \
+    --cadd_indels </path/to/CADD/cache/InDels.tsv.gz> \
+    --cadd_indels_tbi </path/to/CADD/cache/InDels.tsv.gz.tbi> \
+    --cadd_wg_snvs </path/to/CADD/cache/whole_genome_SNVs.tsv.gz> \
+    --cadd_wg_snvs_tbi </path/to/CADD/cache/whole_genome_SNVs.tsv.gz.tbi>
 ```
 
 ### Downloading CADD files
@@ -74,7 +74,7 @@ An helper script has been designed to help downloading CADD files.
 Such files are meant to be share between multiple users, so this script is mainly meant for people administrating servers, clusters and advanced users.
 
 ```bash
-nextflow run download_cache.nf --cadd_cache </Path/To/CADDcache> --cadd_version <CADD version> --genome <GENOME>
+nextflow run download_cache.nf --cadd_cache </path/to/CADD/cache> --cadd_version <CADD version> --genome <GENOME>
 ```
 
 ## Using VEP GeneSplicer plugin
@@ -86,5 +86,5 @@ To enable the use of the VEP GeneSplicer plugin:
 Example:
 
 ```bash
-nextflow run nf-core/sarek --step annotate --tools VEP --sample file.vcf.gz --genesplicer
+nextflow run nf-core/sarek --step annotate --tools VEP --sample <file.vcf.gz> --genesplicer
 ```
diff --git a/docs/ascat.md b/docs/ascat.md
@@ -7,7 +7,7 @@ ASCAT is written in R and available here: [github.com/Crick-CancerGenomics/ascat
 
 To run ASCAT on NGS data we need BAM files for the tumor and normal samples, as well as a loci file with SNP positions.
 If ASCAT is run on SNP array data, the loci file contains the SNPs on the chip.
-When runnig ASCAT on NGS data we can use the same loci file, for exampe the one corresponding to the AffymetrixGenome-Wide Human SNP Array 6.0, but we can also choose a loci file of our choice with i.e. SNPs detected in the 1000 Genomes project.
+When runnig ASCAT on NGS data we can use the same loci file, for example the one corresponding to the AffymetrixGenome-Wide Human SNP Array 6.0, but we can also choose a loci file of our choice with i.e. SNPs detected in the 1000 Genomes project.
 
 ### BAF and LogR values
 
@@ -132,29 +132,28 @@ Names of the chromosomes in chrom.sizes file must be the same as in the genome r
 
 This created GC correction files with the following column headers:
 
-```text
-Chr Position    25  50  100 200 500 1000    2000    5000    10000   20000   50000   100000  200000  500000  1M  2M  5M  10M
-```
+| | | | | | | | | | | | | | | | | | | | |
+|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
+|Chr|Position|25|50|100|200|500|1000|2000|5000|10000|20000|50000|100000|200000|500000|1M|2M|5M|10M|
 
 This file gave an error when running ASCAT, and the error message suggested that it had to do with the column headers.
 The Readme.txt in <https://github.com/Crick-CancerGenomics/ascat/tree/master/gcProcessing> suggested that the column headers should be:
 
-```text
-Chr Position    25bp    50bp    100bp   200bp   500bp   1000bp  2000bp  5000bp  10000bp 20000bp 50000bp 100000bp    200000bp    500000bp    1M  2M  5M  10M
-```
+| | | | | | | | | | | | | | | | | | | | |
+|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
+|Chr|Position|25bp|50bp|100bp|200bp|500bp|1000bp|2000bp|5000bp|10000bp|20000bp|50000bp|100000bp|200000bp|500000bp|1M|2M|5M|10M|
 
 The column headers headers of the generated GC correction files were therefore manually edited.
 
 #### Format of GC correction file
 
 The final files are tab-delimited with the following columns (and some example data):
 
-```text
-Chr Position    25bp    50bp    100bp   200bp   500bp   1000bp  2000bp  5000bp  10000bp 20000bp 50000bp 100000bp    200000bp    500000bp    1M  2M  5M  10M
-snp1    1   14930   0.541667    0.58    0.61    0.585   0.614   0.62    0.6 0.5888  0.588   0.4277  0.395041    0.380702    0.383259    0.341592    0.339747    0.386343    0.500537    0.511514
-snp2    1   15211   0.625   0.64    0.67    0.63    0.61    0.612   0.6135  0.591   0.5922  0.4358  0.39616 0.380411    0.383167    0.34163 0.339771 0.386417   0.500558    0.511511
-snp3    1   15820   0.541667    0.56    0.62    0.655   0.65    0.612   0.5885  0.5936  0.5797  0.4511  0.397771    0.379945    0.382999    0.341791    0.339832    0.386554    0.500579    0.511504
-```
+|Chr|Position|25bp|50bp|100bp|200bp|500bp|1000bp|2000bp|5000bp|10000bp|20000bp|50000bp|100000bp|200000bp|500000bp|1M|2M|5M|10M|
+|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
+|snp1|1|14930|0.541667|0.58|0.61|0.585|0.614|0.62|0.6|0.5888|0.588|0.4277|0.395041|0.380702|0.383259|0.341592|0.339747|0.386343|0.500537|0.511514
+|snp2|1|15211|0.625|0.64|0.67|0.63|0.61|0.612|0.6135|0.591|0.5922|0.4358|0.39616|0.380411|0.383167|0.34163|0.339771|0.386417|0.500558|0.511511
+|snp3|1|15820|0.541667|0.56|0.62|0.655|0.65|0.612|0.5885|0.5936|0.5797|0.4511|0.397771|0.379945|0.382999|0.341791|0.339832|0.386554|0.500579|0.511504
 
 ### Output
 

diff --git a/docs/images/sarek_workflow.png b/docs/images/sarek_workflow.png