-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add domain-level classification for bins #395
Conversation
Add summary, fix classifier bug
|
@nf-core-bot fix linting |
Created a PR for test data with eukaryotic reads to the mag branch of the test-datasets repo: nf-core/test-datasets#782 |
The binrefinement test seems to be failing because it's running out of space:
I don't think the outputs of the test should have changed based on the changes in the PR - anyone have any thoughts on what's up? |
As far as I know there is only a specific amount of storage space available on github tests, maybe that was now exceeded. I just restarted the failed test, maybe it was just a temporal hiccup. |
No luck! Looking at the binrefinement config, I notice that the busco_clean parameter isn't set. A BUSCO run can take up quite a lot of space, and with CONCOCT adding ~129 bins (many of them single contig), this might be mean the test is using a lot more storage than is necessary, resulting in failure. It might be worth enabling it within the test config? (Though it doesn’t fix the problem if this was passing before…) |
Found the problem, now that I have some time (and Gitpod credits) to check. There's an error in the BUSCO module, where the check for the busco_clean parameter merely checks if the variable exists, rather than checking if it is equal to "Y": https://github.com/nf-core/mag/blob/master/modules/local/busco.nf:
I remember spotting the bug when making this PR, and in this branch, it is fixed: https://github.com/prototaxites/mag/blob/euk_classify/modules/local/busco.nf
I'll put a wee PR in to fix the BUSCO module, and add the busco_clean parameter to the test_binrefinement config, as the large number of bins in this test following the addition of CONCOCT means that the test should be failing (but isn't) due to the accidental deletion of temporary BUSCO files. |
It works! 😅 Reviews welcome 😊 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
The PR does make quite some changes, also in files I am not particular familiar with (the remodeled binning), so I'd appreciate another opinion, but it seems fine to me.
Co-authored-by: Daniel Straub <42973691+d4straub@users.noreply.github.com>
Does anyone else have any comments? I'd really like to work on adding a eukaryotic annotation step to the pipeline, but ideally this would build on top of this PR. Otherwise, I could submit a separate PR which adds this to run on all bins regardless of domain, if that is preferable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall really nice, and some good optimisations - thanks @prototaxites !
Mostly minor changes and questions:
- As you're adding enw functioanlity, you need to update the workflow diagram SVG!
- Do you need to add
${meta.domains}
to more of theprefixes
in the post-domain classification steps inmodules.conf
, in cases once it's turned on (basically for any downstream step that consumesch_input_for_postbinning_bins_*
on input? (may not be necssary, but I want to double check).
Thanks for the comments all! I'll work on the pipeline diagram and probably shoot a draft in Slack before committing.
I don't think this is necessary, as each bin can only have one classification, and goes into each process once - better to just filter on it internally IMO. When it's turned off, it should be set to 'unclassified' internally, which would also be ugly to print out. |
Also, a heads up that the new test_domain_classification.config test depends on the following PR in the test datasets repo: (should I enable the test in CI here: https://github.com/nf-core/mag/blob/master/.github/workflows/ci.yml ?) |
I was thinking about this again and I realised there was the potential for input filename collisions due to this, due to the assumption that 'unknown' bins could be either eukaryotic or prokaryotic, and using these bins in both 'halves' of the post-processing - in the specific case of using DAS Tool and keeping all bins (pre- and post-refinement), those unknown bins would collide. I've fixed this by removing this assumption and sending unknowns only down the 'prokaryote' path. Each bin should now only be represented once in each post-binning step. Do you have any other suggestions, @jfy133? Merging is currently blocked pending requested changes! |
As far as I can tell LGTM ! |
…all number of test configs
edit: ready to go now!
This one is still a work-in-progress, but I'm putting this up now as it passes
-profile test
, with all bins from the minigut being classified as prokarya, as expected. If anyone has a dataset which produces eukaryotic MAGs, I'd be keen to hear if it works for you.Aside from adding the domain classification subworkflow, I've split off the coverage/MAG_DEPTHS processes to a new subworkflow, which allows for a set of final bins to be determined (following refinement, classification, etc.) before calculating coverage.
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).