Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added parameters to ensure reproducible results from tools #65

Merged
merged 3 commits into from
Jul 13, 2020

Conversation

skrakau
Copy link
Member

@skrakau skrakau commented Jul 10, 2020

So far it was not possible to generate reproducible results with the tools MEGAHIT, SPAdes and MetaBAT2 within this pipeline. For MetaBAT2 this could be simply changed by using its seed parameter. I set it to 1 by default and added this as a pipeline parameter --metabat_rng_seed.

For MEGAHIT and SPAdes there is unfortunately no seed option. MEGAHIT can only ensure reproducible results when run single-threaded (voutcn/megahit#170). SPAdes on the other hand was designed to generate reproducible results for a given number of threads (ablab/spades#111).

For this reason I added parameters to fix the number of cpus for those processes, for MEGAHIT to 1 (--megahit_fix_cpu_1) and for SPAdes to some fixed number (--spades_fix_cpus, --spadeshybrid_fix_cpus). For the sake of runtime efficiency, by default those parameters are not set in order to allow MEGAHIT running multi-threaded.

When these parameters are used, it is also ensured that the number of cpus will not be increased for retries. This is important for reproducing previous results but also to generate reproducible results, because otherwise a different number of cpus could be used for different samples.

I wrote new functions (defined in nextflow.config) to set the cpus accordingly for those processes, which are called in the base.config file. In theory, those settings can be overwritten by the user with an additional custom config file (-c), changing the cpus of the corresponding processes. (I did not find any way how to check for this) To make sure that only with, for example, --spades_fix_cpus specified cpus are used, this is checked in the actual process script. If the task.cpus does not fit to the parameters, an error is returned.

It is also additionally checked at the beginning of the pipeline if the number of cpus is available, i.e. if --spades_fix_cpus, --spadeshybrid_fix_cpus <= max_cpus (instead of allowing that the number of cpus is reduced to max_cpusin check_max()).

The parameters are added to the summary.

Any feedback welcome :)

PR checklist

  • This comment contains a description of changes (with reason)
  • If you've fixed a bug or added code that should be tested, add tests!
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repo
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Make sure your code lints (nf-core lint .).
  • Documentation in docs is updated
  • CHANGELOG.md is updated
  • README.md is updated

Learn more about contributing: https://github.com/nf-core/mag/tree/master/.github/CONTRIBUTING.md

@skrakau skrakau requested a review from d4straub July 10, 2020 14:00
main.nf Show resolved Hide resolved
main.nf Show resolved Hide resolved
main.nf Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
@skrakau skrakau merged commit 41cdd2a into nf-core:dev Jul 13, 2020
@skrakau skrakau deleted the handle_metabat_seed branch November 4, 2020 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants