Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add domain-level classification for bins #395

Merged
merged 36 commits into from
Jun 9, 2023
Merged

Add domain-level classification for bins #395

merged 36 commits into from
Jun 9, 2023

Conversation

prototaxites
Copy link
Contributor

@prototaxites prototaxites commented Feb 23, 2023

edit: ready to go now!

This one is still a work-in-progress, but I'm putting this up now as it passes -profile test, with all bins from the minigut being classified as prokarya, as expected. If anyone has a dataset which produces eukaryotic MAGs, I'd be keen to hear if it works for you.

Aside from adding the domain classification subworkflow, I've split off the coverage/MAG_DEPTHS processes to a new subworkflow, which allows for a set of final bins to be determined (following refinement, classification, etc.) before calculating coverage.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
    • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
    • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@github-actions
Copy link

github-actions bot commented Feb 24, 2023

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 7172909

+| ✅ 158 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗   1 tests had warnings |!

❗ Test warnings:

  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your prefered methods description, e.g. add publication citation for this pipeline

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy

✅ Tests passed:

Run details

  • nf-core/tools version 2.8
  • Run at 2023-06-09 10:14:09

@prototaxites
Copy link
Contributor Author

@nf-core-bot fix linting

@prototaxites
Copy link
Contributor Author

Created a PR for test data with eukaryotic reads to the mag branch of the test-datasets repo: nf-core/test-datasets#782

@prototaxites prototaxites marked this pull request as ready for review March 6, 2023 12:44
conf/test.config Outdated Show resolved Hide resolved
@prototaxites
Copy link
Contributor Author

The binrefinement test seems to be failing because it's running out of space:

  docker: failed to register layer: ApplyLayer exit status 1 stdout:  stderr: write /usr/local/lib/R/library/methods/help/methods.rdb: no space left on device.

I don't think the outputs of the test should have changed based on the changes in the PR - anyone have any thoughts on what's up?

@d4straub
Copy link
Collaborator

As far as I know there is only a specific amount of storage space available on github tests, maybe that was now exceeded. I just restarted the failed test, maybe it was just a temporal hiccup.

@prototaxites
Copy link
Contributor Author

prototaxites commented Mar 10, 2023

No luck! Looking at the binrefinement config, I notice that the busco_clean parameter isn't set. A BUSCO run can take up quite a lot of space, and with CONCOCT adding ~129 bins (many of them single contig), this might be mean the test is using a lot more storage than is necessary, resulting in failure.

It might be worth enabling it within the test config? (Though it doesn’t fix the problem if this was passing before…)

@prototaxites
Copy link
Contributor Author

Found the problem, now that I have some time (and Gitpod credits) to check.

There's an error in the BUSCO module, where the check for the busco_clean parameter merely checks if the variable exists, rather than checking if it is equal to "Y":

https://github.com/nf-core/mag/blob/master/modules/local/busco.nf:

    # if needed delete temporary BUSCO files
    if [ ${busco_clean} ]; then
        find . -depth -type d -name "augustus_config" -execdir rm -rf "{}" \\;
        find . -depth -type d -name "auto_lineage" -execdir rm -rf "{}" \\;
        find . -depth -type d -name "run_*" -execdir rm -rf "{}" +
    fi

I remember spotting the bug when making this PR, and in this branch, it is fixed:

https://github.com/prototaxites/mag/blob/euk_classify/modules/local/busco.nf

    # if needed delete temporary BUSCO files
    if [ ${busco_clean} = "Y" ]; then
        find . -depth -type d -name "augustus_config" -execdir rm -rf "{}" \\;
        find . -depth -type d -name "auto_lineage" -execdir rm -rf "{}" \\;
        find . -depth -type d -name "run_*" -execdir rm -rf "{}" +
    fi

I'll put a wee PR in to fix the BUSCO module, and add the busco_clean parameter to the test_binrefinement config, as the large number of bins in this test following the addition of CONCOCT means that the test should be failing (but isn't) due to the accidental deletion of temporary BUSCO files.

@prototaxites
Copy link
Contributor Author

It works! 😅 Reviews welcome 😊

Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
The PR does make quite some changes, also in files I am not particular familiar with (the remodeled binning), so I'd appreciate another opinion, but it seems fine to me.

subworkflows/local/depths.nf Outdated Show resolved Hide resolved
workflows/mag.nf Outdated Show resolved Hide resolved
workflows/mag.nf Show resolved Hide resolved
workflows/mag.nf Show resolved Hide resolved
conf/modules.config Outdated Show resolved Hide resolved
@prototaxites
Copy link
Contributor Author

Does anyone else have any comments? I'd really like to work on adding a eukaryotic annotation step to the pipeline, but ideally this would build on top of this PR. Otherwise, I could submit a separate PR which adds this to run on all bins regardless of domain, if that is preferable?

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall really nice, and some good optimisations - thanks @prototaxites !

Mostly minor changes and questions:

  • As you're adding enw functioanlity, you need to update the workflow diagram SVG!
  • Do you need to add ${meta.domains} to more of the prefixes in the post-domain classification steps in modules.conf, in cases once it's turned on (basically for any downstream step that consumes ch_input_for_postbinning_bins_* on input? (may not be necssary, but I want to double check).

bin/domain_classification.R Show resolved Hide resolved
conf/modules.config Show resolved Hide resolved
conf/modules.config Outdated Show resolved Hide resolved
docs/output.md Outdated Show resolved Hide resolved
docs/output.md Show resolved Hide resolved
workflows/mag.nf Outdated Show resolved Hide resolved
subworkflows/local/domain_classification.nf Outdated Show resolved Hide resolved
subworkflows/local/domain_classification.nf Show resolved Hide resolved
workflows/mag.nf Outdated Show resolved Hide resolved
nextflow.config Show resolved Hide resolved
@prototaxites
Copy link
Contributor Author

Thanks for the comments all! I'll work on the pipeline diagram and probably shoot a draft in Slack before committing.

Do you need to add ${meta.domains} to more of the prefixes in the post-domain classification steps in modules.conf, in cases once it's turned on (basically for any downstream step that consumes ch_input_for_postbinning_bins_* on input? (may not be necssary, but I want to double check).

I don't think this is necessary, as each bin can only have one classification, and goes into each process once - better to just filter on it internally IMO. When it's turned off, it should be set to 'unclassified' internally, which would also be ugly to print out.

@prototaxites
Copy link
Contributor Author

prototaxites commented May 9, 2023

Also, a heads up that the new test_domain_classification.config test depends on the following PR in the test datasets repo:

nf-core/test-datasets#782

(should I enable the test in CI here: https://github.com/nf-core/mag/blob/master/.github/workflows/ci.yml ?)

@jfy133 jfy133 added this to the 2.4.0 milestone May 26, 2023
@prototaxites
Copy link
Contributor Author

Do you need to add ${meta.domains} to more of the prefixes in the post-domain classification steps in modules.conf, in cases once it's turned on (basically for any downstream step that consumes ch_input_for_postbinning_bins_* on input? (may not be necssary, but I want to double check).

I was thinking about this again and I realised there was the potential for input filename collisions due to this, due to the assumption that 'unknown' bins could be either eukaryotic or prokaryotic, and using these bins in both 'halves' of the post-processing - in the specific case of using DAS Tool and keeping all bins (pre- and post-refinement), those unknown bins would collide. I've fixed this by removing this assumption and sending unknowns only down the 'prokaryote' path. Each bin should now only be represented once in each post-binning step.

Do you have any other suggestions, @jfy133? Merging is currently blocked pending requested changes!

@jfy133
Copy link
Member

jfy133 commented Jun 7, 2023

As far as I can tell LGTM !

@prototaxites prototaxites merged commit 7c6a62a into nf-core:dev Jun 9, 2023
@prototaxites prototaxites deleted the euk_classify branch June 9, 2023 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants