nf-core · FriederikeHanssen · Nov 11, 2022 · Nov 9, 2022 · Nov 9, 2022 · Nov 10, 2022
@@ -12,6 +12,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - [#735](https://github.com/nf-core/sarek/pull/735) - GATK Markduplicates now natively supports CRAM output
 - [#774](https://github.com/nf-core/sarek/pull/774) - Add logo for Danish National Genome Center
 - [#783](https://github.com/nf-core/sarek/pull/783) - Add paths for chr length used by controlfreec to GRCh38 config
+- [#820](https://github.com/nf-core/sarek/pull/820) - Improve documentation on scatter/gather effects
 
 ### Changed
 

@@ -619,6 +619,35 @@ For mapping, sarek follows the parameter suggestions provided in this [paper](ht
 
 In addition, currently the mismatch penalty for reads with tumor status in the sample sheet are mapped with a mismatch penalty of `-B 3`.
 
+## How to manage scatter/gathering (parallelization with-in each sample)
+
+While Nextflow ensures all samples are run in parallel, the pipeline can split input files for each sample into smaller chunks which are processes in parallel.
+This speeds up analysis for individual chunks, but might occupy more storage space.
+
+Therefore, the different scatter/gather options can be set by the user:
+
+### Split Fastq files
+
+By default, the input fastq files are split into smaller chunks with FASTP, mapped in parallel, and then merged and duplicate marked. This can be customized by setting the parameter `--split_fastq`.
+This parameter determines how many reads are within each split. Setting it to `0` will turn of any splitting and only one mapping process is run per input fastq file.
+
+> FastP creates as many chunks as CPUs are specified (by default 12) and subdivides them further, if the number of reads in a chunk is larger then the value specified in `--split_fastq`. Thus, the parameter `--split_fastq` is an upper bound, e.g. if 1/12th of the Fastq file exceeds the provided value another fastq file will be generated.
+
+### Intervals for Base Quality Score Recalibration and Variantcalling
+
+The pipeline can parallelize base quality score recalibration and variant calling across genomic chunks of roughly similar sizes.
+For this, a bed file containing genomic regions of interest is used, it's the intervals file.
+By default, the intervals file for WGS used is the one provided by GATK (details [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035889551-When-should-I-restrict-my-analysis-to-specific-intervals-)).
+When running targeted analysis, it is recommended to use the bed file containing the targeted regions.
+
+The amount of scatter/gathering can be customized by adjusting the parameter `--nucleotides_per_second`.
+
+> **NB:** The _same_ intervals are processed regardless of the number of groups. The number of groups however determines over how many compute nodes the analysis is scattered on.
+
+The default value is `1000`, increasing this value will _reduce_ the number of groups that are processed in parallel.
+Generally, smaller numbers of groups (each group has more regions), the slower the processing, and less storage space is consumed.
+In particular, in cloud computing setting it is often advisable to reduce the number of groups to be run in parallel to reduce data staging steps.
+
 ## How to create a panel-of-normals for Mutect2
 
 For a detailed tutorial on how to create a panel-of-normals, see [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531132).