-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bin QC Improvements #707
base: dev
Are you sure you want to change the base?
Bin QC Improvements #707
Conversation
- Update modules - Update integration in mag and with other tools (bin_summary, gtdb-tk) - Update test - Update schema
@nf-core-bot fix linting |
Before you continue (sorry this is a bit late): I generally don't like to deprecate old version of tools for a while, but rather keep them as alternative tools. In some cases people want to stick with the original version for compatibility with previous runs Could you 'revert' (or reinstall) the old checkm module and wrap it in an if/else statement (but within the subworkflow :) ) @muabnezor did a similar thing when adding porechop_ABI hree: #674 |
@jfy133 that makes sense, I will revert the CheckM removal. Bad for me for not asking before 😅. |
c63e084
to
b1b6518
Compare
Sorry, I didn't catch your last comment about including both tools in a single workflow. With that in mind, would make sense to include BUSCO as well, and just make a "bin_qc" subworkflow? |
Also, simplify bin_summary regarding bin qc
Yes that would be perfect! We need to subworkflow the sh*t out of this monster 😅 thank you!!! |
a4f42ef
to
da52285
Compare
4007932
to
0eb167a
Compare
It's ready for review now, @jfy133. I’ve added GUNC to the subworkflow to keep mag.nf cleaner. The only issue I found is with GUNC_MERGECHECKM outputs not being saved correctly. Since the process runs per bin but uses the sample as ID, the output files are overwritten. However, once this PR is merged, we’ll be able to run GUNC with all bins simultaneously, which should solve this issue. |
Hi @dialvarezs just FYI I'm at another conference since yesterday and didn't have a chance beginning of this week to look at your PRs, sorry about that! My response time should speed up from next week :) |
I think this looks good, though I don't have the latitude to run any manual tests of it myself at the moment - so probably wait for someone else to sign off on it! My only question is whether it might be good to move all the binqc database preparation steps (e.g. CHECKM2_DATABASEDOWNLOAD, and the bits that initialise the files/channels) inside the BINQC workflow where they are consumed? I think keeping conceptually-related code together will probably help with maintenance down the line, and the main mag.nf workflow is already over-full. Plus, there's less subworkflow inputs and outputs to keep track of. Might be a job for another PR though! |
@prototaxites Absolutely, that makes sense to me. That would help to simplify the mag workflow a bit. What do you think about this @jfy133 ? |
I've pondered that a few times, however many of these download steps take a very long time, and thus from a user PoV I think it makes sense to have it triggered right at the beginning of the pipeline so by the time assembly and binning is done it's ready to go rather than getting all the way to binning and then having to wait the same length of time again before you can start the binning QC. In my mind to clear up the code I would rather have a dB download subworkflow for all DB Downloadsto make the code clearer. Maybe from a related module PoV it's not as efficient but functionally they are related Any counter arguments? |
In most cases, database downloads don’t depend on anything, so, if I'm not wrong, the processes should start immediately regardless of where they are in placed in the code. Or am I missing something? |
Yes, that's my understanding - Nextflow processes kick off as soon as any inputs are available, so processes beginning with file/URL input from parameters should start as soon as the pipeline begins, no matter how many subworkflows deep they are (there might be a latency hit if you go 10,000 subworkflows deep...). Also not opposed to a "database download" subWF - but that seems a little more limited in scope. In particular I'm thinking about all the steps that are like |
Hmm fair. I might be feeling over defensive due to the huge number of conditions mag has... so maybe this is indeed the case.
Fair point. Ok - if you're feeling to up to it @dialvarezs go ahead and move the relevant database downloads to the BINQC subworklow(s) :) Note you don't need a separate CheckM download CI check -> I would rather you just include an UNTAR step in the pipeline isntead. I added the seaprte CheckM CI tests originally due to instable downloads when connecting to the servers in Australia, but as these databases are now on Zenodo this should rarely happen. |
@jfy133 I'll get back to this on Friday or Saturday.
I think this is a step in the right direction to improve the modularity of mag, so I will.
Got it, I will remove the checkm2 ci test I added. Should I remove the checkm ci test as well? |
Yes you probably can! Thank you! |
This PR adds:
BIN_QC
subworkflow, integrating CHECKM, CHECKM2, BUSCO and GUNC.Closes #607.
PR checklist
nf-core pipelines lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).