Added parameters to ensure reproducible results from tools #65

skrakau · 2020-07-10T14:00:38Z

So far it was not possible to generate reproducible results with the tools MEGAHIT, SPAdes and MetaBAT2 within this pipeline. For MetaBAT2 this could be simply changed by using its seed parameter. I set it to 1 by default and added this as a pipeline parameter --metabat_rng_seed.

For MEGAHIT and SPAdes there is unfortunately no seed option. MEGAHIT can only ensure reproducible results when run single-threaded (voutcn/megahit#170). SPAdes on the other hand was designed to generate reproducible results for a given number of threads (ablab/spades#111).

For this reason I added parameters to fix the number of cpus for those processes, for MEGAHIT to 1 (--megahit_fix_cpu_1) and for SPAdes to some fixed number (--spades_fix_cpus, --spadeshybrid_fix_cpus). For the sake of runtime efficiency, by default those parameters are not set in order to allow MEGAHIT running multi-threaded.

When these parameters are used, it is also ensured that the number of cpus will not be increased for retries. This is important for reproducing previous results but also to generate reproducible results, because otherwise a different number of cpus could be used for different samples.

I wrote new functions (defined in nextflow.config) to set the cpus accordingly for those processes, which are called in the base.config file. In theory, those settings can be overwritten by the user with an additional custom config file (-c), changing the cpus of the corresponding processes. (I did not find any way how to check for this) To make sure that only with, for example, --spades_fix_cpus specified cpus are used, this is checked in the actual process script. If the task.cpus does not fit to the parameters, an error is returned.

It is also additionally checked at the beginning of the pipeline if the number of cpus is available, i.e. if --spades_fix_cpus, --spadeshybrid_fix_cpus <= max_cpus (instead of allowing that the number of cpus is reduced to max_cpusin check_max()).

The parameters are added to the summary.

Any feedback welcome :)

PR checklist

This comment contains a description of changes (with reason)
If you've fixed a bug or added code that should be tested, add tests!
If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repo
Ensure the test suite passes (nextflow run . -profile test,docker).
Make sure your code lints (nf-core lint .).
Documentation in docs is updated
CHANGELOG.md is updated
README.md is updated

Learn more about contributing: https://github.com/nf-core/mag/tree/master/.github/CONTRIBUTING.md

docs/usage.md

main.nf

Add MetaBAT2 RNG seed parameter

9b2c39d

skrakau requested a review from d4straub July 10, 2020 14:00