Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add more docs about scatter/gather #820

Merged
merged 8 commits into from
Nov 11, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#735](https://github.com/nf-core/sarek/pull/735) - GATK Markduplicates now natively supports CRAM output
- [#774](https://github.com/nf-core/sarek/pull/774) - Add logo for Danish National Genome Center
- [#783](https://github.com/nf-core/sarek/pull/783) - Add paths for chr length used by controlfreec to GRCh38 config
- [#820](https://github.com/nf-core/sarek/pull/820) - Improve documentation on scatter/gather effects

### Changed

Expand Down
29 changes: 29 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -619,6 +619,35 @@ For mapping, sarek follows the parameter suggestions provided in this [paper](ht

In addition, currently the mismatch penalty for reads with tumor status in the sample sheet are mapped with a mismatch penalty of `-B 3`.

## How to manage scatter/gathering (parallelization with-in each sample)

While Nextflow ensures all samples are run in parallel, the pipeline can split input files for each sample into smaller chunks which are processes in parallel.
This speeds up analysis for individual chunks, but might occupy more storage space.

Therefore, the different scatter/gather options can be set by the user:

### Split Fastq files

By default, the input fastq files are split into smaller chunks with FASTP, mapped in parallel, and then merged and duplicate marked. This can be customized by setting the parameter `--split_fastq`.
This parameter determines how many reads are within each split. Setting it to `0` will turn of any splitting and only one mapping process is run per input fastq file.

> FastP creates as many chunks as CPUs are specified (by default 12) and subdivides them further, if the number of reads in a chunk is larger then the value specified in `--split_fastq`. Thus, the parameter `--split_fastq` is an upper bound, e.g. if 1/12th of the Fastq file exceeds the provided value another fastq file will be generated.

### Intervals for Base Quality Score Recalibration and Variantcalling

The pipeline can parallelize base quality score recalibration and variant calling across genomic chunks of roughly similar sizes.
For this, a bed file containing genomic regions of interest is used, it's the intervals file.
By default, the intervals file for WGS used is the one provided by GATK (details [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035889551-When-should-I-restrict-my-analysis-to-specific-intervals-)).
When running targeted analysis, it is recommended to use the bed file containing the targeted regions.

The amount of scatter/gathering can be customized by adjusting the parameter `--nucleotides_per_second`.

> **NB:** The _same_ intervals are processed regardless of the number of groups. The number of groups however determines over how many compute nodes the analysis is scattered on.

The default value is `1000`, increasing this value will _reduce_ the number of groups that are processed in parallel.
Generally, smaller numbers of groups (each group has more regions), the slower the processing, and less storage space is consumed.
In particular, in cloud computing setting it is often advisable to reduce the number of groups to be run in parallel to reduce data staging steps.

## How to create a panel-of-normals for Mutect2

For a detailed tutorial on how to create a panel-of-normals, see [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531132).
Expand Down