BUSCO creates a huge number of files - add a flag to remove some of these? #341

prototaxites · 2022-10-05T13:25:32Z

Description of feature

Somewhere between a bug and a feature request - I'm running the pipeline and have generated 608 bins. I'm now at the BUSCO step and the total number of files I have on my scratch partition has jumped from approximately 90,000 to 2.1 million with the ~250 that have completed so far - approximately 6-7k files for each process. Looking at the current status of the BUSCO results directory, there's only ~1600 files in total - a huge number less than 2 million.

This has caused me to hit the maximum file limit on our cluster, and I had to ask our cluster admin for a temporary increase in the limit. I understand that these output files might be useful to some, but perhaps including a parameter flag (-busco-cleanup maybe?) that deletes (for example) all the auto-lineage files that aren't included in the pipeline results would be useful for those of us who need to keep file numbers lower?

d4straub · 2022-10-05T13:34:18Z

If I understand you right than you are talking about the entire file system, i.e. including the work folder from nextflow, not just the results folder. If yes, then I do not see this as pipeline problem but rather a nextflow "problem". The work folder typically stays there until the pipeline is done, it is required by nextflow. Maybe you would be interested in https://www.nextflow.io/docs/latest/cli.html#clean? Unfortunately I do not have an idea how to reduce the file numbers, your dataset is just large (but not huge) and the pipeline requires significant resources, as metagenome assembly does typically. You can certainly skip BUSCO if you'd like with --skip_busco.

Edit:
Two more thoughts:

https://github.com/nf-core/mag/blob/a8e92af70eca59a92b72262e6cdde11e69375801/modules/local/busco.nf maybe could benefit from deleting some files during the module run, however some files are referenced as output, see

mag/modules/local/busco.nf

Line 17 in a8e92af

tuple env(most_spec_db), path('busco_downloads/') , optional:true , emit: busco_downloads
on the other hand, you could benefit from using scratch if you have it (we use that by default, but the scratch mem was sometimes too low for nf-core/mag runs on our cluster). Scratch is typically automatically deleted and nextflow doesnt need it. The work directory will contain that way only the input and output of processes, but not any unused files.

prototaxites · 2022-10-05T14:44:09Z

Hi, thanks for the thoughts! The issue with using the clean command is that hitting the file number limit kills the pipeline immediately, so it's impossible to get to the step where you could retroactively clean e.g. all the busco work directories.

https://github.com/nf-core/mag/blob/a8e92af70eca59a92b72262e6cdde11e69375801/modules/local/busco.nf maybe could benefit from deleting some files during the module run, however some files are referenced as output, see

This is what I had in mind when suggesting this - a perhaps judicious flag that activates a bunch of rm commands at the end of the process run to reduce file numbers.

on the other hand, you could benefit from using scratch if you have it (we use that by default, but the scratch mem was sometimes too low for nf-core/mag runs on our cluster). Scratch is typically automatically deleted and nextflow doesnt need it. The work directory will contain that way only the input and output of processes, but not any unused files.

I'm currently running the pipeline with the work/ directory set to be written to scratch using the -w option - it's file limits on the scratch partition that I'm hitting!

d4straub · 2022-10-05T15:12:04Z

Oh wow, file limits on the scratch partition, never heard of that!
Do you have an idea what directories of the process hold many files? Haven't gotten any example at hand to see for myself right now. So that we can consider removing those directories.

prototaxites · 2022-10-05T16:04:34Z

Here's a couple of tree outputs I've run on the work directories (apologies, am stuck in the lab this week and don't have much time for anything else!). To my eye, the directories augustus_config, auto_lineage and the ones that start run_ (e.g. run_metazoa_odb10) seem to be the main culprits. I think the symlinked busco database that I have stored locally on scratch may also be contributing, but that's unavoidable as part of the way Nextflow handles resources.

tree1.txt
tree2.txt

edit: the output directories are described here: https://busco.ezlab.org/busco_userguide.html#outputs

prototaxites · 2022-11-15T10:03:50Z

Have got a solution for this that works - removing the folders "augustus_config", "BUSCO/auto_lineage", and "BUSCO/run_*" at the end of the script (before writing versions) prevents the build up of millions of files:

if [ -d augustus_config ]; then
    rm -rf augustus_config
fi
if [ -d BUSCO/auto_lineage ]; then
    rm -rf BUSCO/auto_lineage
    rm -rf BUSCO/run_*
fi

If there's interest in including this as a feature, I could put together a pull request, putting the option behind a disabled-by-default parameter flag (perhaps params.clean_busco?), given that users might want to look at the HMM models in the work directory. Otherwise I'm happy to just continue re-adding it to my local copy when I update it - this issue in part occurs because I'm working with very large numbers of metagenomes (~300) at any one time.

d4straub · 2022-11-15T10:32:33Z

Hi, that sounds good. Yes, a PR would be great! Probably params.busco_clean to have the tool name at the start (this isnt a standard, but I more and more think that helps). Please ping me as a reviewer when you have the PR ready.

prototaxites · 2022-11-15T11:46:48Z

Great, I've got a fork assembled with the changes - is there anywhere outside of nextflow_schema.jsonand nextflow.config that the parameter needs to be documented?

I'll submit the PR once I've tested the busco_clean flag with the test profile and made sure the directories are removed.

d4straub · 2022-11-15T12:27:38Z

is there anywhere outside of nextflow_schema.jsonand nextflow.config that the parameter needs to be documented?

CHANGELOG, activate in one of the test profiles that the setting will be tested

prototaxites added the enhancement New feature or request label Oct 5, 2022

prototaxites mentioned this issue Nov 15, 2022

Add busco_clean parameter #353

Merged

10 tasks

prototaxites closed this as completed Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUSCO creates a huge number of files - add a flag to remove some of these? #341

BUSCO creates a huge number of files - add a flag to remove some of these? #341

prototaxites commented Oct 5, 2022

d4straub commented Oct 5, 2022 •

edited

Loading

prototaxites commented Oct 5, 2022

d4straub commented Oct 5, 2022

prototaxites commented Oct 5, 2022 •

edited

Loading

prototaxites commented Nov 15, 2022

d4straub commented Nov 15, 2022

prototaxites commented Nov 15, 2022

d4straub commented Nov 15, 2022

BUSCO creates a huge number of files - add a flag to remove some of these? #341

BUSCO creates a huge number of files - add a flag to remove some of these? #341

Comments

prototaxites commented Oct 5, 2022

Description of feature

d4straub commented Oct 5, 2022 • edited Loading

prototaxites commented Oct 5, 2022

d4straub commented Oct 5, 2022

prototaxites commented Oct 5, 2022 • edited Loading

prototaxites commented Nov 15, 2022

d4straub commented Nov 15, 2022

prototaxites commented Nov 15, 2022

d4straub commented Nov 15, 2022

d4straub commented Oct 5, 2022 •

edited

Loading

prototaxites commented Oct 5, 2022 •

edited

Loading