Update to busco 5.1.0 and enable automated lineage selection #179

skrakau · 2021-04-08T15:15:48Z

Among others in preparation for adding GTDB-tk (#178).

This updates Busco from 4.1.4 to 5.1.0 and enables the use of the automated lineage selection:

by default now the BUSCO parameter --auto-lineage is used and the data is downloaded automatically (also for the tests currently)
the lineage can still be specified by providing a lineage dataset via the mag parameter --busco_reference (uses BUSCO parameter --lineage_dataset). This still requires the download of a file_versions.tsv file, which is used to check if the newest lineage dataset is used (BUSCO warns if not).
to run BUSCO in offline mode, the mag parameter --busco_download_path can be used (BUSCO: --download_path currently only in combination with --auto-lineage or --auto-lineage-prok)
additionally the mag parameter --busco_auto_lineage_prok can be used to ignore eukaryotes (BUSCO: --auto_lineage_prok)
the mag parameter --save_busco_reference is also used to save the lineage datasets downloaded by BUSCO, I added an extra process for this to only do this once and not for each BUSCO process

Error handling:

failed analysis due to no matching BUSCO genes or failed placements:
- if run with --busco_reference the number of marker genes for the corresponding lineage is used and all output files just contain a "100% Missing" etc.
- if run with --auto-lineage number of marker genes is unknown and no busco results file is generated -> these contigs are missing in the MultiQC BUSCO report. In the final busco_summary.txt I put a NA since I still thought those contigs should be listed (could also be done differently)
  - since in this case the "short_summary*" file from busco is missing, I added additional output files for failed analyses which get processed in the downstream BUSCO_SUMMARY process:${bin}_busco.failed_bins.txt containing just the bin name
  - I added a Nextflow warning if busco analysis failed for bins due these error types (not for kept unbinned contigs though!)
in my testdata there are a two unbinned contigs for which the analysis works when selecting the lineage, but fails for auto selection during placement. Since this only happens for unbinned contigs with > 96% missing I thought maybe it is OK, and we can just catch this error as well: warn if this happens for proper bins, and report this as 100% missing with NA as total number in the busco_summary.txt. What do you think?

Open todos:

when using --busco_reference: add also a warning if for a binned genome 100% are missing?
Placements failed: could be memory problem?
compress files when saving dowloaded lineage datasets
test if analysis can be reproduced with saved downloaded data?
update docs
update to 5.1.2: maybe in another PR, since I had problems with the current singularity images

PR checklist

…put files for Busco

…wnload by default)

github-actions · 2021-04-08T15:17:40Z

`nf-core lint` overall result: Passed ✅ ⚠️

Posted for pipeline commit 5e9c00b

+| ✅ 117 tests passed       |+
#| ❔   4 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

files_exist - File not found: environment.yml
files_exist - File not found: Dockerfile
nextflow_config - Config variable not found: process.container
pipeline_todos - TODO string in awstest.yml: You can customise CI pipeline run tests as required

❔ Tests ignored:

files_unchanged - File does not exist: .github/workflows/push_dockerhub_dev.yml
files_unchanged - File does not exist: .github/workflows/push_dockerhub_release.yml
conda_env_yaml - No environment.yml file found - skipping conda_env_yaml test
conda_dockerfile - No environment.yml / Dockerfile file found - skipping conda_dockerfile test

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: CHANGELOG.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File found: .github/markdownlint.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: assets/nf-core-mag_logo.png
files_exist - File found: bin/markdown_to_html.py
files_exist - File found: docs/images/nf-core-mag_logo.png
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: lib/nfcore_external_java_deps.jar
files_exist - File found: lib/NfcoreSchema.groovy
files_exist - File found: main.nf
files_exist - File found: conf/base.config
files_exist - File found: .github/workflows/awstest.yml
files_exist - File found: .github/workflows/awsfulltest.yml
files_exist - File not found check: Singularity
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .travis.yml
nextflow_config - Config variable found: manifest.name
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: manifest.homePage
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.version
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config manifest.name began with nf-core/
nextflow_config - Config variable manifest.homePage began with https://github.com/nf-core/
nextflow_config - Config dag.file ended with .svg
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config manifest.version ends in dev: '1.3.0dev'
files_unchanged - .gitattributes matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - LICENSE matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/CONTRIBUTING.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/bug_report.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.md matches the template
files_unchanged - .github/markdownlint.yml matches the template
files_unchanged - .github/PULL_REQUEST_TEMPLATE.md matches the template
files_unchanged - .github/workflows/branch.yml matches the template
files_unchanged - .github/workflows/linting_comment.yml matches the template
files_unchanged - .github/workflows/linting.yml matches the template
files_unchanged - assets/email_template.html matches the template
files_unchanged - assets/email_template.txt matches the template
files_unchanged - assets/sendmail_template.txt matches the template
files_unchanged - assets/nf-core-mag_logo.png matches the template
files_unchanged - bin/markdown_to_html.py matches the template
files_unchanged - docs/images/nf-core-mag_logo.png matches the template
files_unchanged - docs/README.md matches the template
files_unchanged - lib/nfcore_external_java_deps.jar matches the template
files_unchanged - lib/NfcoreSchema.groovy matches the template
files_unchanged - .gitignore matches the template
files_unchanged - assets/multiqc_config.yaml matches the template
actions_ci - '.github/workflows/ci.yml' is triggered on expected events
actions_ci - '.github/workflows/ci.yml' checks minimum NF version
actions_awstest - '.github/workflows/awstest.yml' is triggered correctly
actions_awsfulltest - .github/workflows/awsfulltest.yml is triggered correctly
actions_awsfulltest - .github/workflows/awsfulltest.yml does not use -profile test
readme - README Nextflow minimum version badge matched config. Badge: 20.10.0, Config: 20.10.0
pipeline_name_conventions - Name adheres to nf-core convention
template_strings - Did not find any Jinja template strings (103 files)
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_params - Schema matched params returned from nextflow config
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: ci.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: awsfulltest.yml
actions_schema_validation - Workflow validation passed: awstest.yml
merge_markers - No merge markers found in pipeline files

Run details

nf-core/tools version 1.13.3
Run at 2021-04-26 11:43:10

d4straub · 2021-04-09T08:27:46Z

This is an awesome upgrade!

Error handling:

if run with --busco_reference the number of marker genes for the corresponding lineage is used and all output files just contain a "100% Missing" etc.

Sounds good

if run with --auto-lineage number of marker genes is unknown and no busco results file is generated -> these contigs are missing in the MultiQC BUSCO report. In the final busco_summary.txt I put a NA since I still thought those contigs should be listed (could also be done differently)

I guess you mean "bin" instead of "contig" here. Missing in the MultiQC report is fine. I absolutely agree that in busco_summary.txt bins with failed BUSCO analysis should appear anyhow. NA is ok, but if easily possible 100% Missing would be also great, e.g. by producing a dummy "short_summary*" file (and using NA only in the field for number of BUSCOs, i.e. "Total number"), but thats probably too much work for such a small improvement.

since in this case the "short_summary*" file from busco is missing, I added additional output files for failed analyses which get processed in the downstream BUSCO_SUMMARY process:${bin}_busco.failed_bins.txt containing just the bin name

Not sure if this is needed since busco_summary.txt already contains that info, if I understood correctely? Nevertheless, one more file is fine as well. Just the bin name is fine.

I added a Nextflow warning if busco analysis failed for bins due these error types (not for kept unbinned contigs though!)

I agree that this isn't really relevant for kept unbinned contigs. However, I'd hope that those still appear in the busco_summary.txt?

Open todos:

when using --busco_reference: add also a warning if for a binned genome 100% are missing?

Yes that would be great.

edit:

Placements failed: could be memory problem?

I think this would be great to check out.

d4straub · 2021-04-09T08:32:05Z

lib/Completion.groovy

+            for (bin in busco_failed_bins) {
+                failed_bins += "    ${bin}\n"
+            }
+            log.info "-${colors.purple}[$workflow.manifest.name]${colors.red} For ${busco_failed_bins.size()} bin(s) the BUSCO analysis failed because no genes where found or placements failed:\n${failed_bins}See ${params.outdir}/GenomeBinning/QC/BUSCO/[bin]_busco.err for further information.${colors.reset}-"


Hm so this is also when placements failed, is that the same problem? I am uncertain.

yes, in both cases bins would be listed here.

The problem with the placement error is it says:
ERROR: Placements failed. Try to rerun increasing the memory or select a lineage manually.

In my case it was definitely not a memory problem. However, I am afraid that this could be caused by memory issues and then we would block retrying (such error messages do not really help...). That is why I pointed to the ${bin}_busco.err file, so the user could at least discover this. It would of course not be nice, but I also don't know how else to handle this.

d4straub · 2021-04-09T08:41:32Z

modules/local/busco.nf

+        if [ \${#summaries[@]} -ne 1 ]; then
+            echo "ERROR: none or multiple 'BUSCO/short_summary.specific.*.BUSCO.txt' files found. Expected one."
+            exit 1
+        fi


Curious: in what case can there be several specific summary files?

I don't know :D I think I just wanted to check if it is exactly 1

skrakau · 2021-04-09T14:08:01Z

I guess you mean "bin" instead of "contig" here. Missing in the MultiQC report is fine. I absolutely agree that in busco_summary.txt bins with failed BUSCO analysis should appear anyhow. NA is ok, but if easily possible 100% Missing would be also great, e.g. by producing a dummy "short_summary*" file (and using NA only in the field for number of BUSCOs, i.e. "Total number"), but thats probably too much work for such a small improvement.

Currently the busco_summary.txt looks like this when using auto selection:

GenomeBin       %Complete       %Complete and single-copy       %Complete and duplicated        %Fragmented     %Missing        Total number
SPAdes-test_minigut.1.fa        17.7    17.7    0.0     0.8     81.5    124
SPAdes-test_minigut.2.fa        13.7    13.7    0.0     3.2     83.1    124
MEGAHIT-test_minigut.1.fa       12.1    12.1    0.0     3.2     84.7    124
MEGAHIT-test_minigut.2.fa       18.5    18.5    0.0     0.8     80.7    124
SPAdes-test_minigut_sample2.unbinned.1.fa       0.0%    0.0%    0.0%    0.0%    100.0%  NA
SPAdes-test_minigut_sample2.unbinned.2.fa       0.0%    0.0%    0.0%    0.0%    100.0%  NA
...

I adjusted the summary.busco.py script for this.

since in this case the "short_summary*" file from busco is missing, I added additional output files for failed analyses which get processed in the downstream BUSCO_SUMMARY process:${bin}_busco.failed_bins.txt containing just the bin name

Not sure if this is needed since busco_summary.txt already contains that info, if I understood correctely? Nevertheless, one more file is fine as well. Just the bin name is fine.

busco_summary.txt does not yet contain the info, because for those bins no summary file from BUSCO arrives. In theory the bin name would be sufficient yes, but I didn't get it done passing the bin names as val types, because collect() and passing it to the BUSCO_SUMMARY process only worked for files. Maybe there is a solution, but I didn't see a straight forward one. Do you know one?

I added a Nextflow warning if busco analysis failed for bins due these error types (not for kept unbinned contigs though!)

I agree that this isn't really relevant for kept unbinned contigs. However, I'd hope that those still appear in the busco_summary.txt?

sure :)

skrakau · 2021-04-22T11:39:49Z

ok, finally another update :) roughly the following changes were done:

added Nextflow warning that BUSCO did not find any matching genes also when using --busco_reference
added used lineage dataset to busco_summary.tsv
when run in auto lineage selection mode: results for both the selected domain and a more specific lineage are provided if available (short_summary.domain.[lineage].[assembler]-[bin].txt, short_summary.specific_lineage.[lineage].[assembler]-[bin].txt, assembler]-[bin]_buscos.[lineage].fna.gz, [assembler]-[bin]_buscos.[lineage].faa.gz and in busco_summary.tsv)
domain results are now retrieved even if placement fails
add 'Viruses' for domain in busco_summary.tsv
fixed saving of datasets downloaded by BUSCO when using auto selection in online mode: since the complete reference data provided by BUSCO is > 100GB big (which could be used to ensure reproducibility), I wanted a setting which allows saving only the by BUSCO downloaded datasets. The downloaded data depends on the lineage selection. Therefore the process BUSCO_SAVE_DOWNLOAD is now run on a subset of the download folders (from different BUSCO runs), one for each selected specific lineage to make sure to retrieve all used datasets, while not overwriting already saved files. This now can be used to reproduce the BUSCO analysis (with same settings and same input data).
regarding the saving of the downloaded files, I did not compress them yet as they did not take that much space yet, and like this the folder can directly be used as input for --busco_download_path
docs output.md and usage.md updated

and the last time I forgot to mention:

the process BUSCO_PLOT does not output anymore the sample-wise summary files [assembler]-[sample]-busco_summary.txt, if this is needed I could easily add it again (generated with the BUSCO_SUMMARY process). Just wondered if they are used at all since they were also not mentioned in the output.md so far?

d4straub · 2021-04-22T11:52:45Z

Really helpful changes!

the process BUSCO_PLOT does not output anymore the sample-wise summary files [assembler]-[sample]-busco_summary.txt, if this is needed I could easily add it again (generated with the BUSCO_SUMMARY process). Just wondered if they are used at all since they were also not mentioned in the output.md so far?

[assembler]-[sample]-busco_summary.txt are not used at all, because all this information is supposed to be in busco_summary.tsv anyway.

skrakau · 2021-04-22T15:42:53Z

Thanks a lot @d4straub for all your feedback! Ready for re-review again

skrakau · 2021-04-26T10:44:04Z

Correction:
unfortunately there are a few differences regarding the handling of mollicutes lineages and failed placements when running BUSCO in online and offline mode. Thus the saved datasets downloaded automatically by BUSCO can currently not be used to reproduce the analysis by specifying the already downloaded files via --busco_download_path. I adjusted the docs accordingly.

d4straub

Looks good to me!

skrakau and others added 13 commits March 31, 2021 15:51

Update to Busco 5.1.0

8fd9bce

Add auto-lineage functionality for busco

42d145c

Added handling of errors (i.e. due to no found genes) and missing out…

124d1c3

…put files for Busco

Fix unnecessary path from busco_summary

341bf05

Adjust Busco cpu requirement

45c9dcc

Adjust Busco MultiQC info

4cbc498

Fix used busco --download_path

e23fd95

Add parameter checks for busco paramter

19cd80c

Adjust comments

9a569d9

Add --busco_auto_lineage_prok param

6960f6a

Add optional saving of busco_downloads

816090b

Set default for busco_reference to '' (auto lineage selection with do…

521c575

…wnload by default)

Add new busco params to nextflow_schema.json

fe2f1d1

skrakau added 4 commits April 8, 2021 17:32

Fix BUSCO_SUMMARY to also run if one input is empty

2ffff30

Improve error handling of placements and other BUSCO errors

c4bd1ba

Fix handling of busco_reference param within bash script

bc4f114

Fix usage of skip_busco param

104090c

skrakau marked this pull request as ready for review April 8, 2021 18:11

skrakau requested a review from d4straub April 8, 2021 18:12

d4straub reviewed Apr 9, 2021

View reviewed changes

Increase memory for busco to avoid placement memory problems

60204c6

skrakau force-pushed the update_busco_5.1.0 branch from ab890e8 to 60204c6 Compare April 9, 2021 16:21

skrakau added 4 commits April 9, 2021 19:55

Add def to variables defined within process

69a162e

Adjust warning for busco using --busco_reference

c15eecb

Rename 'failed_bins' to 'failed_bin' where necessary

cb4d890

Add lineage dataset to busco_summary.txt

34c6ccd

skrakau added 14 commits April 21, 2021 20:58

Adjust warning message for failed placements

520dd70

Adjust MultiQC BUSCO info

5fc2e5c

Change busco_summary.txt to busco_summary.tsv

14664ee

Output both domain and specific BUSCO sequences where applciable

0f44c5b

Adjust warning for failed BUSCO placemetns

6f7c97f

Fix and escape single quote in multiqc config

556c160

Add handling of busco results for mollicutes

3646a70

Fix: remove % from busco summary for failed bins

21fcc62

Fix: set nullgob for whole busco script

5837b05

Adjust busco specific output in busco_summary.tsv

b49a2c2

Add 'Viruses' domain to busco_summary.tsv

200a904

Change NA representation from 'NA' to '' in busco_summary.tsv

6795bd7

Fix: save BUSCO downloads for different lineages

94d55f5

Adjust docs for reproducible BUSCO usage

03b16bf

skrakau force-pushed the update_busco_5.1.0 branch from bd53286 to 61b541d Compare April 22, 2021 11:38

skrakau added 4 commits April 22, 2021 15:52

Update output.md for new BUSCO output files

87d0959

Change whitespace and remove comment

f872736

Change colour for warnings about handled BUSCO problems to yellow

c3723fd

Update CHANGELOG

83804e7

skrakau force-pushed the update_busco_5.1.0 branch from 61b541d to 83804e7 Compare April 22, 2021 13:54

skrakau requested a review from d4straub April 22, 2021 15:41

Update docs regarding reproducible BUSCO analysis

eb8ed75

d4straub approved these changes Apr 26, 2021

View reviewed changes

Adjust description of busco_download_path param in schema

5e9c00b

skrakau merged commit af7c17d into nf-core:dev Apr 26, 2021

skrakau deleted the update_busco_5.1.0 branch May 31, 2021 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to busco 5.1.0 and enable automated lineage selection #179

Update to busco 5.1.0 and enable automated lineage selection #179

skrakau commented Apr 8, 2021 •

edited

Loading

github-actions bot commented Apr 8, 2021 •

edited

Loading

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

d4straub commented Apr 9, 2021 •

edited

Loading

d4straub Apr 9, 2021

skrakau Apr 9, 2021

d4straub Apr 9, 2021

skrakau Apr 9, 2021

skrakau commented Apr 9, 2021

skrakau commented Apr 22, 2021

d4straub commented Apr 22, 2021 •

edited

Loading

skrakau commented Apr 22, 2021

skrakau commented Apr 26, 2021

d4straub left a comment

Update to busco 5.1.0 and enable automated lineage selection #179

Update to busco 5.1.0 and enable automated lineage selection #179

Conversation

skrakau commented Apr 8, 2021 • edited Loading

PR checklist

github-actions bot commented Apr 8, 2021 • edited Loading

nf-core lint overall result: Passed ✅ ⚠️

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

d4straub commented Apr 9, 2021 • edited Loading

Error handling:

Open todos:

d4straub Apr 9, 2021

Choose a reason for hiding this comment

skrakau Apr 9, 2021

Choose a reason for hiding this comment

d4straub Apr 9, 2021

Choose a reason for hiding this comment

skrakau Apr 9, 2021

Choose a reason for hiding this comment

skrakau commented Apr 9, 2021

skrakau commented Apr 22, 2021

d4straub commented Apr 22, 2021 • edited Loading

skrakau commented Apr 22, 2021

skrakau commented Apr 26, 2021

d4straub left a comment

Choose a reason for hiding this comment

skrakau commented Apr 8, 2021 •

edited

Loading

github-actions bot commented Apr 8, 2021 •

edited

Loading

`nf-core lint` overall result: Passed ✅ ⚠️

d4straub commented Apr 9, 2021 •

edited

Loading

d4straub commented Apr 22, 2021 •

edited

Loading