Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor checkm2/databasedownload using aria2 #6654

Merged
merged 11 commits into from
Sep 25, 2024

Conversation

dialvarezs
Copy link
Contributor

@dialvarezs dialvarezs commented Sep 17, 2024

  • Use aria2 to download CheckM2 database
  • Update snapshots for both DATABASEDOWNLOAD and PREDICT

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

@dialvarezs dialvarezs requested a review from a team as a code owner September 17, 2024 02:30
@dialvarezs dialvarezs requested review from SPPearce and removed request for a team September 17, 2024 02:30
Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I've not seen this before, I will admit I'm not really happy that this is an official module as it's not actually a subcommand of the tool (despite the comment)... but I guess it's already here.

The changes generally look OK though (and clever of the remote md5 extraction1) - however have you considered having the zenodo_id as an input channel rather than hardcoding this? This would allow user choice of which version of the datbaase to download

@dialvarezs
Copy link
Contributor Author

Thanks for the review James.

I completely agree that the ideal approach would be to use the tool command to download the database. However, this is a design limitation, even the developer recommends downloading the database externally. That’s why I found it cleaner to implement a module that handles the download automatically.

I’ve added the zenodo_id as an input parameter, with a default value if none is provided.
While doing so, I considered whether using ext might be a better option. There are currently only two database versions, and users typically want to use the latest one. In a pipeline, this input would require an explicit parameter, but considering that a parameter to specify an already downloaded database could also be necessary for offline usage, we’d end up with two parameters serving nearly the same purpose. With ext this option will remain as an advanced configuration, and it will not require a param. What do you think about this?

@dialvarezs dialvarezs changed the title Refactor CHECKM2_DATABASEDOWNLOAD using aria2 Refactor checkm2/databasedownload using aria2 Sep 18, 2024
@jfy133
Copy link
Member

jfy133 commented Sep 18, 2024

Thanks for the review James.

I completely agree that the ideal approach would be to use the tool command to download the database. However, this is a design limitation, even the developer recommends downloading the database externally. That’s why I found it cleaner to implement a module that handles the download automatically.

Fair enough - in this case I would've just said use a aria2c module and supply a URL - but like I said, too late now!

I’ve added the zenodo_id as an input parameter, with a default value if none is provided. While doing so, I considered whether using ext might be a better option. There are currently only two database versions, and users typically want to use the latest one. In a pipeline, this input would require an explicit parameter, but considering that a parameter to specify an already downloaded database could also be necessary for offline usage, we’d end up with two parameters serving nearly the same purpose. With ext this option will remain as an advanced configuration, and it will not require a param. What do you think about this?

Mm, I might be getting confused with some of your terminology - you've added an input channel not a parameter.

Can you give examples of what you mean by 'the two parameters serving the same purpose'?

Do you mean e.g.:

--checkm2_db_version 123456
--checkm2_db_localpath /path/to/localcopy.tar.gz

or something? If so I think based on your implementation, this these two are not necessarily required.

A pipeline developer could chose to force a user to download the latest version by hardcoding the latest zenodo ID in the input channel of the module. But wrap that in an if statement to say if a path is given to --checkm2_db_localpath then don't donwload it.

I don't really follow what you mean with the ext example, could you provide some pseudo code...

@dialvarezs
Copy link
Contributor Author

dialvarezs commented Sep 18, 2024

Sorry for the lack of clarity.

Do you mean e.g.:

--checkm2_db_version 123456
--checkm2_db_localpath /path/to/localcopy.tar.gz

Yes, this is what I mean. And I know both are not required, but in this case, it's not exactly the database version, it's the Zenodo ID where the database is stored. And because of that, I'm wondering if the best idea to leave it as a param, because it could be a little confusing. I'm thinking in mag, because I plan to implement the migration from CheckM to CheckM2.

So, I was thinking on using ext like this, and removing the input:

zenodo_id = ext.zenodo_id ?: 5571251

This way, if the user want to change the database version, it would be possible as an advanced option via config

process {
  withName: CHECKM2_DATABASEDOWNLOAD {
    ext.zenodo_id = 54321 
  }
}

@dialvarezs
Copy link
Contributor Author

Another option could be implementing a function to get the zenodo id from the version, like this one:

def get_zenodo_id(db_version) {
    zenodo_ids = [
        '2'     : 5571251,
        '1.1'   : 4671167,
        '1'     : 4626519
    ]
    if (!zenodo_ids.containsKey(db_version)) {
        error("Error: Invalid database version '${db_version}'")
    }
    return zenodo_ids[db_version]
}

This way the input value could be the database version itself, and it would make more sense to me to have the param checkm2_db_version.

@jfy133
Copy link
Member

jfy133 commented Sep 19, 2024

OK, I think I'm following a little more:

  1. I dunno, I equate the Zenodo ID as an alternative ID for the database version, so it doesn't really matter
  2. I think the function while clever is unnecessary overkill :) In most cases a user won't touch this and just rely on the default (and someone who wants to change can put in the effort to understand that ;) )
  3. We don't like custom ext. in nf-core guidelines because the information can stored in different object names, in different pipelines, and documentation is not supported for it so becomes a mess etc. etc.

Overall, I don't think the two parameter thing is that much of an issue to be honest, this is very common in many cases already. I think your default Zenodo ID implementation is fine plus an override with the input channel 👍 (and then it's up to the pipeline develop to allow users to customise this or not).

Once the tests are passing I think I can give this a ✔️

@jfy133
Copy link
Member

jfy133 commented Sep 19, 2024

(oh and very excitied to see this in MAG!)

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh poop I realised I didn't finish my previous review properly (pending comments): I noticed that while there is a stub block, there is no stub test.

Please add that and then you can merge!

modules/nf-core/checkm2/databasedownload/meta.yml Outdated Show resolved Hide resolved
@dialvarezs dialvarezs added this pull request to the merge queue Sep 25, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 25, 2024
@dialvarezs dialvarezs added this pull request to the merge queue Sep 25, 2024
Merged via the queue into nf-core:master with commit e176526 Sep 25, 2024
16 checks passed
@dialvarezs dialvarezs deleted the dev-checkm2 branch September 25, 2024 19:30
herpov pushed a commit to herpov/modules that referenced this pull request Oct 2, 2024
* Use aria2 for CHECKM2_DATABASEDOWNLOAD, update snaps

* Fix linting

* Add bioconda to environment.yml

* Add zenodo_id as input to select which version to download

* Add input param to the test

* Add process input for predict

* Fix databasedownload meta

* Improve field description

---------

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants