Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DAS_Tool binning refinement #291

Merged
merged 66 commits into from
Jun 1, 2022
Merged

Add DAS_Tool binning refinement #291

merged 66 commits into from
Jun 1, 2022

Conversation

jfy133
Copy link
Member

@jfy133 jfy133 commented Apr 2, 2022

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
    • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
    • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@jfy133 jfy133 added the WIP Work in progress label Apr 2, 2022
@github-actions
Copy link

github-actions bot commented Apr 7, 2022

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit a433d97

+| ✅ 147 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗   1 tests had warnings |!

❗ Test warnings:

  • readme - README did not have a Nextflow minimum version badge.

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy

✅ Tests passed:

Run details

  • nf-core/tools version 2.4.1
  • Run at 2022-06-01 13:55:46

@jfy133 jfy133 removed the WIP Work in progress label Apr 7, 2022
@jfy133 jfy133 requested review from skrakau and d4straub April 7, 2022 12:43
Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but do I understand correct that the refined bins will not be used in downstream analysis (Prokka, QUAST, BUSCO, GTDB-tk, CAT) but the non-refined bins? I would hope that when I choose to run DAS_Tool, that those refined bins are used for downstream analysis. Or will too many bins get lost? Is there a reason not to use the refined bins?

conf/modules.config Outdated Show resolved Hide resolved
docs/output.md Outdated Show resolved Hide resolved
subworkflows/local/binning_refinement.nf Outdated Show resolved Hide resolved
subworkflows/local/binning_refinement.nf Outdated Show resolved Hide resolved
subworkflows/local/binning_refinement.nf Outdated Show resolved Hide resolved
subworkflows/local/binning_refinement.nf Show resolved Hide resolved
workflows/mag.nf Outdated Show resolved Hide resolved
workflows/mag.nf Outdated Show resolved Hide resolved
docs/output.md Outdated Show resolved Hide resolved
workflows/mag.nf Outdated Show resolved Hide resolved
@skrakau
Copy link
Member

skrakau commented Apr 14, 2022

just realised, it's currently not handled when --skip_metabat2 is specified, e.g. in subworkflows/local/binning.nf#L110
->Access to 'METABAT2_METABAT2.out' is undefined since process doesn't declare any output

@jfy133
Copy link
Member Author

jfy133 commented Apr 14, 2022

just realised, it's currently not handled when --skip_metabat2 is specified, e.g. in subworkflows/local/binning.nf#L110 ->Access to 'METABAT2_METABAT2.out' is undefined since process doesn't declare any output

Good catch! Updating

lib/WorkflowMag.groovy Outdated Show resolved Hide resolved
@jfy133 jfy133 requested review from skrakau and d4straub May 4, 2022 13:14
Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me but I have no time at the moment to run it and have a look at output file location/content.

conf/base.config Outdated Show resolved Hide resolved
conf/base.config Outdated Show resolved Hide resolved
conf/modules.config Outdated Show resolved Hide resolved
conf/modules.config Outdated Show resolved Hide resolved
docs/usage.md Outdated Show resolved Hide resolved
@skrakau
Copy link
Member

skrakau commented May 9, 2022

Hi @jfy133 , why did you revert merging dev into this branch? Like this it seems that .set {} is used again at multiple places. Could you rebase this? Or am I missing something?

@jfy133
Copy link
Member Author

jfy133 commented May 9, 2022

Not sure, I guess it broke something and I forgot to merge it back in 🤦‍♀. Will fix it after this meeting! Sorry!

docs/output.md Outdated Show resolved Hide resolved
docs/output.md Outdated Show resolved Hide resolved
docs/output.md Outdated Show resolved Hide resolved
docs/output.md Outdated Show resolved Hide resolved
docs/usage.md Outdated Show resolved Hide resolved
@skrakau
Copy link
Member

skrakau commented May 23, 2022

Hi @jfy133 and @d4straub, I continued work on this PR, it would be great if you could have a look again and let me know what you think.

Most important changes:

  • So far the bins used as input for DAS Tool were published in DASTOOL/bins (specified in modules.config for RENAME_PREDASTOOL ), but not the by DAS Tool refined bins. I changed this, publishing now the DAS Tool results processed in RENAME_POSTDASTOOL.
  • the MaxBin2 bins extension specified for DASTOOL_FASTATOCONTIG2BIN_MAXBIN2 was wrong, since this is changed in RENAME_PREDASTOOL already to fa. Thus ``DASTOOL_FASTATOCONTIG2BIN_MAXBIN2` output files were empty.
  • ch_dastool_bins_newmeta contained also "unbinned.fa" files -> separated this
  • since the refined bin filenames contain the information about the original binner, e.g. "-MaxBin2Refined-", there was a problem for downstream processes which group things by binner info, causing for example separate depth or BUSCO plots for "MaxBin2Refined" and "MetaBat2", instead of creating one plot for refined bin set of DASTool. I changed meta.binner to DASTool (and adjusted everything, so that the process MAG_DEPTHS can run for each meta.binner separetely)

I hope there wasn't a reason that I forgot and didn't see, e.g. causing issues for downstream processes, why DASTOOL was not used for meta.binner consistently yet.

Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So RENAME_PREDASTOOL is not in the modules.config? Seems fine to me, just want to confirm that this is intended.
Looks good to me, but I did not run the code and I lost track of all the changes in this PR, meaning I might have missed something.

CHANGELOG.md Outdated Show resolved Hide resolved
conf/modules.config Show resolved Hide resolved
conf/modules.config Show resolved Hide resolved
Co-authored-by: Daniel Straub <42973691+d4straub@users.noreply.github.com>
@skrakau
Copy link
Member

skrakau commented May 25, 2022

Thanks @d4straub for reviewing! Yes, no files from RENAME_PREDASTOOL need to be published, at least from my understanding

docs/usage.md Outdated Show resolved Hide resolved
@jfy133
Copy link
Member Author

jfy133 commented May 31, 2022

Some comments from me:

For the workflow diagram, is QUAST meant to be abdunance estimation and visualisation? I would flip it if so. But I would assume QUAST is normally for Evaluation, however then it's not clear what the Estimation/Visualisation is referring to? (I have ideas to make more space if you need).

  • So far the bins used as input for DAS Tool were published in DASTOOL/bins (specified in modules.config for RENAME_PREDASTOOL ), but not the by DAS Tool refined bins. I changed this, publishing now the DAS Tool results processed in RENAME_POSTDASTOOL.

This is OK to me, the only real problem there is you would have a file name mismatch between the DASTool/bins/*.fa and whatever is recorded in the DAS_Tool log/eval/tsv files... but maybe this is not an issue (can be inferred by the user?). Might be worth a note in the docs if you think this is reasonable.

  • the MaxBin2 bins extension specified for DASTOOL_FASTATOCONTIG2BIN_MAXBIN2 was wrong, since this is changed in RENAME_PREDASTOOL already to fa. Thus ``DASTOOL_FASTATOCONTIG2BIN_MAXBIN2` output files were empty.

Whoops, good catch!

  • ch_dastool_bins_newmeta contained also "unbinned.fa" files -> separated this

I originally kept that in as IIRC the unbinned files/output files from SPLIT_FASTA were also included in the MAG_DEPTHS plots? Do you not want that here for the unrefined ones?

  • since the refined bin filenames contain the information about the original binner, e.g. "-MaxBin2Refined-", there was a problem for downstream processes which group things by binner info, causing for example separate depth or BUSCO plots for "MaxBin2Refined" and "MetaBat2", instead of creating one plot for refined bin set of DASTool. I changed meta.binner to DASTool (and adjusted everything, so that the process MAG_DEPTHS can run for each meta.binner separetely)
    I hope there wasn't a reason that I forgot and didn't see, e.g. causing issues for downstream processes, why DASTOOL was not used for meta.binner consistently yet.

I think if MAG_DEPTHS can cope with this now, that's OK. I didn't set the binner as everything for DASTool as conceptually, this is incorrect. The output remains the same as the output of MaxBin/MetaBAT2 (separately!) just with a few contigs included/excluded. Theoretically you may end up with a clash where the two files from MaxBin/MetaBAT2 (with those in the name) are then refined, but they will now have the same output name if they have DASTool instead of the original binner. But assuming the original binner is stored in the file names I guess that will be OK, given you heavily rely on this.

I will test the pipeline now to make sure everything looks OK from my perspective, but otherwise I think everything looks OK from my end now!

skrakau and others added 2 commits May 31, 2022 18:00
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
@skrakau
Copy link
Member

skrakau commented May 31, 2022

For the workflow diagram, is QUAST meant to be abdunance estimation and visualisation? I would flip it if so. But I would assume QUAST is normally for Evaluation, however then it's not clear what the Estimation/Visualisation is referring to? (I have ideas to make more space if you need).

True, it's a bit confusing. "Abundance estimation and visualisation" is done by custom scripts within the pipeline, i.e. there is no tool name we could list for the figure as for the other parts (same holds for "MAG summary").

  • So far the bins used as input for DAS Tool were published in DASTOOL/bins (specified in modules.config for RENAME_PREDASTOOL ), but not the by DAS Tool refined bins. I changed this, publishing now the DAS Tool results processed in RENAME_POSTDASTOOL.

This is OK to me, the only real problem there is you would have a file name mismatch between the DASTool/bins/*.fa and whatever is recorded in the DAS_Tool log/eval/tsv files... but maybe this is not an issue (can be inferred by the user?). Might be worth a note in the docs if you think this is reasonable.

Ah true! Ok, sorry I missed that you enumerated the input bin files new. So with DASTool/bins/ you meant to store (only) the input bins? I am confused, because the docs says "Refined bins in fasta format.".
Then it is not possible for the user to know from which original bin a refined bin originates, right? (Is that alright?)
Where is the name mismatch exactly? I only see that *_allBins.eval contains the already renamed bin names ("MetaBat2Refined"), while referring to the actually not yet refined input bins. Which is also not good.

  • ch_dastool_bins_newmeta contained also "unbinned.fa" files -> separated this

I originally kept that in as IIRC the unbinned files/output files from SPLIT_FASTA were also included in the MAG_DEPTHS plots? Do you not want that here for the unrefined ones?

For the input for MAG_DEPTHS the bins in ch_dastool_bins_newmeta get anyway mixed with the unbins in RENAME_POSTDASTOOL.out.refined_unbins (here). The output channel refined_bins from BINNING_REFINEMENT on the other hand, is expected to not contain unbins for the downstream part as it will be mixed with unbins as well (see here). If I don't miss anything here.

  • since the refined bin filenames contain the information about the original binner, e.g. "-MaxBin2Refined-", there was a problem for downstream processes which group things by binner info, causing for example separate depth or BUSCO plots for "MaxBin2Refined" and "MetaBat2", instead of creating one plot for refined bin set of DASTool. I changed meta.binner to DASTool (and adjusted everything, so that the process MAG_DEPTHS can run for each meta.binner separetely)
    I hope there wasn't a reason that I forgot and didn't see, e.g. causing issues for downstream processes, why DASTOOL was not used for meta.binner consistently yet.

I think if MAG_DEPTHS can cope with this now, that's OK. I didn't set the binner as everything for DASTool as conceptually, this is incorrect. The output remains the same as the output of MaxBin/MetaBAT2 (separately!) just with a few contigs included/excluded. Theoretically you may end up with a clash where the two files from MaxBin/MetaBAT2 (with those in the name) are then refined, but they will now have the same output name if they have DASTool instead of the original binner. But assuming the original binner is stored in the file names I guess that will be OK, given you heavily rely on this.

Good that you are bringing this up, maybe I understood it wrong. Why is this conceptually incorrect? Isn't the DAS Tool output a non-redundant bin set, i.e. some bins from the individual binners are missing and only one best fitting is chosen and refined? For test_binrefinement the current output just contains MetaBAT2Refined bins, no MaxBin2Refined bins' (assumed by chance), and this is what I understood from the DAS Tool paper figure as well. Is this wrong?
As an alternative one could add a meta.orig_binner info. But so far, if I remember correctly, this information is anyway not (yet) used downstream, right? Anyway, maybe we should document it better within the code (mentioned it one in one comment, but this might be overseen).

@jfy133
Copy link
Member Author

jfy133 commented Jun 1, 2022

True, it's a bit confusing. "Abundance estimation and visualisation" is done by custom scripts within the pipeline, i.e. there is no tool name we could list for the figure as for the other parts (same holds for "MAG summary").

Ahh I see. Maybe you could wrap in brackets or something? But maybe not so important.

Ah true! Ok, sorry I missed that you enumerated the input bin files new. So with DASTool/bins/ you meant to store (only) the input bins? I am confused, because the docs says "Refined bins in fasta format.".
Then it is not possible for the user to know from which original bin a refined bin originates, right? (Is that alright?)
Where is the name mismatch exactly? I only see that *_allBins.eval contains the already renamed bin names ("MetaBat2Refined"), while referring to the actually not yet refined input bins. Which is also not good.

Ah wait, you're right actually. I got confused with the DASTOOL-POSTRENAMING (where I originally did the renaming), so this isn't so bad at that stage. (I forgot that I moved where we did that 😅). Yes, the *.eval files is indeed not good :. Unless we add the Refined in the PostRenaming step? Although again then we have a mismatch between downstream steps and DAS_Tool itself...

I think the least damaging compromise is to leave it as you have it now, and we just document it. I doubt many people would look at _allBins.eval, as we already check for quality/completeness with other metrics with QUAST/BUSCO anywya? What do you think?

For the input for MAG_DEPTHS the bins in ch_dastool_bins_newmeta get anyway mixed with the unbins in RENAME_POSTDASTOOL.out.refined_unbins (here). The output channel refined_bins from BINNING_REFINEMENT on the other hand, is expected to not contain unbins for the downstream part as it will be mixed with unbins as well (see here). If I don't miss anything here.

Oops missed that .mix! Alles klar.

Good that you are bringing this up, maybe I understood it wrong. Why is this conceptually incorrect? Isn't the DAS Tool output a non-redundant bin set, i.e. some bins from the individual binners are missing and only one best fitting is chosen and refined? For test_binrefinement the current output just contains MetaBAT2Refined bins, no MaxBin2Refined bins' (assumed by chance), and this is what I understood from the DAS Tool paper figure as well. Is this wrong?
As an alternative one could add a meta.orig_binner info. But so far, if I remember correctly, this information is anyway not (yet) used downstream, right? Anyway, maybe we should document it better within the code (mentioned it one in one comment, but this might be overseen).

Good that you are bringing this up, maybe I understood it wrong. Why is this conceptually incorrect? Isn't the DAS Tool output a non-redundant bin set, i.e. some bins from the individual binners are missing and only one best fitting is chosen and refined? For test_binrefinement the current output just contains MetaBAT2Refined bins, no MaxBin2Refined bins' (assumed by chance), and this is what I understood from the DAS Tool paper figure as well. Is this wrong?
As an alternative one could add a meta.orig_binner info. But so far, if I remember correctly, this information is anyway not (yet) used downstream, right? Anyway, maybe we should document it better within the code (mentioned it one in one comment, but this might be overseen).

Yes that is also true... I was thinking more from a file-based perspective, as in they just 'update' the existing FASTA files, and retain the same file name. I had thought I hadn't seen any of the FASTAs being entirely removed (for example), but maybe I misremembered here (I will check in in the test I'm running now).

Ok then overall I think you have corrected my misunderstandings, so I think everything looks OK now 👍 - I'll let you know on slack if the tests run as I expect it, and then you can ... I guess approve the PR on behalf of both of us 😅 👍

@jfy133
Copy link
Member Author

jfy133 commented Jun 1, 2022

no MaxBin2Refined bins' (assumed by chance)

This maybe because MAXBIN fails:

[bd/e67f57] NOTE: Process `NFCORE_MAG:MAG:BINNING:MAXBIN2 (test_minigut_sample2)` terminated with an error exit status (255) -- Error is ignored
[81/e877e8] NOTE: Process `NFCORE_MAG:MAG:BINNING_REFINEMENT:DASTOOL_DASTOOL (test_minigut_sample2)` terminated with an error exit status (1) -- Error is ignored

MaxBin2 wasn't as reliable as MetaBAT2 if I reemember correctly what my colleague told me

docs/usage.md Show resolved Hide resolved
@skrakau
Copy link
Member

skrakau commented Jun 1, 2022

no MaxBin2Refined bins' (assumed by chance)

This maybe because MAXBIN fails:

[bd/e67f57] NOTE: Process `NFCORE_MAG:MAG:BINNING:MAXBIN2 (test_minigut_sample2)` terminated with an error exit status (255) -- Error is ignored
[81/e877e8] NOTE: Process `NFCORE_MAG:MAG:BINNING_REFINEMENT:DASTOOL_DASTOOL (test_minigut_sample2)` terminated with an error exit status (1) -- Error is ignored

MaxBin2 wasn't as reliable as MetaBAT2 if I reemember correctly what my colleague told me

just for completeness here: for the MEGAHIT assembly of sample test_minigut MaxBin2 bins exist, but DAS TOOL only outputs "MetaBat2Refined" bins ...

@skrakau skrakau self-requested a review June 1, 2022 15:44
@skrakau skrakau merged commit f6b1872 into nf-core:dev Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants