Added parameters to ensure reproducible results from tools #65
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
So far it was not possible to generate reproducible results with the tools MEGAHIT, SPAdes and MetaBAT2 within this pipeline. For MetaBAT2 this could be simply changed by using its seed parameter. I set it to 1 by default and added this as a pipeline parameter
--metabat_rng_seed
.For MEGAHIT and SPAdes there is unfortunately no seed option. MEGAHIT can only ensure reproducible results when run single-threaded (voutcn/megahit#170). SPAdes on the other hand was designed to generate reproducible results for a given number of threads (ablab/spades#111).
For this reason I added parameters to fix the number of cpus for those processes, for MEGAHIT to 1 (
--megahit_fix_cpu_1
) and for SPAdes to some fixed number (--spades_fix_cpus
,--spadeshybrid_fix_cpus
). For the sake of runtime efficiency, by default those parameters are not set in order to allow MEGAHIT running multi-threaded.When these parameters are used, it is also ensured that the number of cpus will not be increased for retries. This is important for reproducing previous results but also to generate reproducible results, because otherwise a different number of cpus could be used for different samples.
I wrote new functions (defined in
nextflow.config
) to set the cpus accordingly for those processes, which are called in thebase.config
file. In theory, those settings can be overwritten by the user with an additional custom config file (-c
), changing the cpus of the corresponding processes. (I did not find any way how to check for this) To make sure that only with, for example,--spades_fix_cpus
specified cpus are used, this is checked in the actual process script. If thetask.cpus
does not fit to the parameters, an error is returned.It is also additionally checked at the beginning of the pipeline if the number of cpus is available, i.e. if
--spades_fix_cpus
,--spadeshybrid_fix_cpus
<=max_cpus
(instead of allowing that the number of cpus is reduced tomax_cpus
incheck_max()
).The parameters are added to the summary.
Any feedback welcome :)
PR checklist
nextflow run . -profile test,docker
).nf-core lint .
).docs
is updatedCHANGELOG.md
is updatedREADME.md
is updatedLearn more about contributing: https://github.com/nf-core/mag/tree/master/.github/CONTRIBUTING.md