Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bin QC Improvements #707

Open
wants to merge 18 commits into
base: dev
Choose a base branch
from
Open

Bin QC Improvements #707

wants to merge 18 commits into from

Conversation

dialvarezs
Copy link
Contributor

@dialvarezs dialvarezs commented Oct 27, 2024

This PR adds:

  • CheckM2 as an alternative for bin qc
  • A new BIN_QC subworkflow, integrating CHECKM, CHECKM2, BUSCO and GUNC.

Closes #607.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

- Update modules
- Update integration in mag and with other tools (bin_summary, gtdb-tk)
- Update test
- Update schema
@dialvarezs dialvarezs marked this pull request as draft October 27, 2024 21:31
@jfy133
Copy link
Member

jfy133 commented Oct 27, 2024

@nf-core-bot fix linting

@jfy133
Copy link
Member

jfy133 commented Oct 27, 2024

Before you continue (sorry this is a bit late):

I generally don't like to deprecate old version of tools for a while, but rather keep them as alternative tools.

In some cases people want to stick with the original version for compatibility with previous runs

Could you 'revert' (or reinstall) the old checkm module and wrap it in an if/else statement (but within the subworkflow :) )

@muabnezor did a similar thing when adding porechop_ABI hree: #674

@dialvarezs
Copy link
Contributor Author

@jfy133 that makes sense, I will revert the CheckM removal. Bad for me for not asking before 😅.

@dialvarezs dialvarezs changed the title feat: Replace from CheckM to CheckM2 feat: Add CheckM2 Oct 27, 2024
@dialvarezs dialvarezs marked this pull request as ready for review October 28, 2024 10:30
@dialvarezs
Copy link
Contributor Author

dialvarezs commented Oct 28, 2024

Sorry, I didn't catch your last comment about including both tools in a single workflow. With that in mind, would make sense to include BUSCO as well, and just make a "bin_qc" subworkflow?

@jfy133
Copy link
Member

jfy133 commented Oct 29, 2024

Sorry, I didn't catch your last comment about including both tools in a single workflow. With that in mind, would make sense to include BUSCO as well, and just make a "bin_qc" subworkflow?

Yes that would be perfect! We need to subworkflow the sh*t out of this monster 😅 thank you!!!

@dialvarezs dialvarezs force-pushed the dev-checkm2 branch 2 times, most recently from a4f42ef to da52285 Compare November 1, 2024 00:27
@dialvarezs dialvarezs changed the title feat: Add CheckM2 Bin QC Improvements Nov 1, 2024
@dialvarezs
Copy link
Contributor Author

dialvarezs commented Nov 1, 2024

It's ready for review now, @jfy133. I’ve added GUNC to the subworkflow to keep mag.nf cleaner.

The only issue I found is with GUNC_MERGECHECKM outputs not being saved correctly. Since the process runs per bin but uses the sample as ID, the output files are overwritten. However, once this PR is merged, we’ll be able to run GUNC with all bins simultaneously, which should solve this issue.

@jfy133
Copy link
Member

jfy133 commented Nov 7, 2024

Hi @dialvarezs just FYI I'm at another conference since yesterday and didn't have a chance beginning of this week to look at your PRs, sorry about that!

My response time should speed up from next week :)

@prototaxites
Copy link
Contributor

I think this looks good, though I don't have the latitude to run any manual tests of it myself at the moment - so probably wait for someone else to sign off on it!

My only question is whether it might be good to move all the binqc database preparation steps (e.g. CHECKM2_DATABASEDOWNLOAD, and the bits that initialise the files/channels) inside the BINQC workflow where they are consumed? I think keeping conceptually-related code together will probably help with maintenance down the line, and the main mag.nf workflow is already over-full. Plus, there's less subworkflow inputs and outputs to keep track of. Might be a job for another PR though!

@dialvarezs
Copy link
Contributor Author

@prototaxites Absolutely, that makes sense to me. That would help to simplify the mag workflow a bit. What do you think about this @jfy133 ?

@jfy133
Copy link
Member

jfy133 commented Nov 12, 2024

I've pondered that a few times, however many of these download steps take a very long time, and thus from a user PoV I think it makes sense to have it triggered right at the beginning of the pipeline so by the time assembly and binning is done it's ready to go rather than getting all the way to binning and then having to wait the same length of time again before you can start the binning QC.

In my mind to clear up the code I would rather have a dB download subworkflow for all DB Downloadsto make the code clearer. Maybe from a related module PoV it's not as efficient but functionally they are related

Any counter arguments?

@dialvarezs
Copy link
Contributor Author

In most cases, database downloads don’t depend on anything, so, if I'm not wrong, the processes should start immediately regardless of where they are in placed in the code. Or am I missing something?

@prototaxites
Copy link
Contributor

prototaxites commented Nov 12, 2024

In most cases, database downloads don’t depend on anything, so, if I'm not wrong, the processes should start immediately regardless of where they are in placed in the code. Or am I missing something?

Yes, that's my understanding - Nextflow processes kick off as soon as any inputs are available, so processes beginning with file/URL input from parameters should start as soon as the pipeline begins, no matter how many subworkflows deep they are (there might be a latency hit if you go 10,000 subworkflows deep...).

Also not opposed to a "database download" subWF - but that seems a little more limited in scope. In particular I'm thinking about all the steps that are like "if(params.some_db) { db = Channel.value(file(path)) } else { db = Channel.empty }" - there are a lot of these input databases, often just a zip or fasta file, and building these in a single WF would make something with a lot of outputs. Hence why I think it may be more maintainable to keep that code near to where the outputs are consumed (set the PhiX fasta in the "short read preprocessing" subWF, lambda in the "long read preprocessing", etc.). But either way cleaning up the main WF should help - this all just may be beyond the scope of this PR.

@jfy133
Copy link
Member

jfy133 commented Nov 13, 2024

In most cases, database downloads don’t depend on anything, so, if I'm not wrong, the processes should start immediately regardless of where they are in placed in the code. Or am I missing something?

Yes, that's my understanding - Nextflow processes kick off as soon as any inputs are available, so processes beginning with file/URL input from parameters should start as soon as the pipeline begins, no matter how many subworkflows deep they are (there might be a latency hit if you go 10,000 subworkflows deep...).

Hmm fair. I might be feeling over defensive due to the huge number of conditions mag has... so maybe this is indeed the case.

Also not opposed to a "database download" subWF - but that seems a little more limited in scope. In particular I'm thinking about all the steps that are like "if(params.some_db) { db = Channel.value(file(path)) } else { db = Channel.empty }" - there are a lot of these input databases, often just a zip or fasta file, and building these in a single WF would make something with a lot of outputs. Hence why I think it may be more maintainable to keep that code near to where the outputs are consumed (set the PhiX fasta in the "short read preprocessing" subWF, lambda in the "long read preprocessing", etc.). But either way cleaning up the main WF should help - this all just may be beyond the scope of this PR.

Fair point.

Ok - if you're feeling to up to it @dialvarezs go ahead and move the relevant database downloads to the BINQC subworklow(s) :)

Note you don't need a separate CheckM download CI check -> I would rather you just include an UNTAR step in the pipeline isntead.

I added the seaprte CheckM CI tests originally due to instable downloads when connecting to the servers in Australia, but as these databases are now on Zenodo this should rarely happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants