Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUSCO creates a huge number of files - add a flag to remove some of these? #341

Closed
prototaxites opened this issue Oct 5, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@prototaxites
Copy link
Contributor

Description of feature

Somewhere between a bug and a feature request - I'm running the pipeline and have generated 608 bins. I'm now at the BUSCO step and the total number of files I have on my scratch partition has jumped from approximately 90,000 to 2.1 million with the ~250 that have completed so far - approximately 6-7k files for each process. Looking at the current status of the BUSCO results directory, there's only ~1600 files in total - a huge number less than 2 million.

This has caused me to hit the maximum file limit on our cluster, and I had to ask our cluster admin for a temporary increase in the limit. I understand that these output files might be useful to some, but perhaps including a parameter flag (-busco-cleanup maybe?) that deletes (for example) all the auto-lineage files that aren't included in the pipeline results would be useful for those of us who need to keep file numbers lower?

@prototaxites prototaxites added the enhancement New feature or request label Oct 5, 2022
@d4straub
Copy link
Collaborator

d4straub commented Oct 5, 2022

If I understand you right than you are talking about the entire file system, i.e. including the work folder from nextflow, not just the results folder. If yes, then I do not see this as pipeline problem but rather a nextflow "problem". The work folder typically stays there until the pipeline is done, it is required by nextflow. Maybe you would be interested in https://www.nextflow.io/docs/latest/cli.html#clean? Unfortunately I do not have an idea how to reduce the file numbers, your dataset is just large (but not huge) and the pipeline requires significant resources, as metagenome assembly does typically. You can certainly skip BUSCO if you'd like with --skip_busco.

Edit:
Two more thoughts:

@prototaxites
Copy link
Contributor Author

Hi, thanks for the thoughts! The issue with using the clean command is that hitting the file number limit kills the pipeline immediately, so it's impossible to get to the step where you could retroactively clean e.g. all the busco work directories.

https://github.com/nf-core/mag/blob/a8e92af70eca59a92b72262e6cdde11e69375801/modules/local/busco.nf maybe could benefit from deleting some files during the module run, however some files are referenced as output, see

This is what I had in mind when suggesting this - a perhaps judicious flag that activates a bunch of rm commands at the end of the process run to reduce file numbers.

on the other hand, you could benefit from using scratch if you have it (we use that by default, but the scratch mem was sometimes too low for nf-core/mag runs on our cluster). Scratch is typically automatically deleted and nextflow doesnt need it. The work directory will contain that way only the input and output of processes, but not any unused files.

I'm currently running the pipeline with the work/ directory set to be written to scratch using the -w option - it's file limits on the scratch partition that I'm hitting!

@d4straub
Copy link
Collaborator

d4straub commented Oct 5, 2022

Oh wow, file limits on the scratch partition, never heard of that!
Do you have an idea what directories of the process hold many files? Haven't gotten any example at hand to see for myself right now. So that we can consider removing those directories.

@prototaxites
Copy link
Contributor Author

prototaxites commented Oct 5, 2022

Here's a couple of tree outputs I've run on the work directories (apologies, am stuck in the lab this week and don't have much time for anything else!). To my eye, the directories augustus_config, auto_lineage and the ones that start run_ (e.g. run_metazoa_odb10) seem to be the main culprits. I think the symlinked busco database that I have stored locally on scratch may also be contributing, but that's unavoidable as part of the way Nextflow handles resources.

tree1.txt
tree2.txt

edit: the output directories are described here: https://busco.ezlab.org/busco_userguide.html#outputs

@prototaxites
Copy link
Contributor Author

Have got a solution for this that works - removing the folders "augustus_config", "BUSCO/auto_lineage", and "BUSCO/run_*" at the end of the script (before writing versions) prevents the build up of millions of files:

if [ -d augustus_config ]; then
    rm -rf augustus_config
fi
if [ -d BUSCO/auto_lineage ]; then
    rm -rf BUSCO/auto_lineage
    rm -rf BUSCO/run_*
fi

If there's interest in including this as a feature, I could put together a pull request, putting the option behind a disabled-by-default parameter flag (perhaps params.clean_busco?), given that users might want to look at the HMM models in the work directory. Otherwise I'm happy to just continue re-adding it to my local copy when I update it - this issue in part occurs because I'm working with very large numbers of metagenomes (~300) at any one time.

@d4straub
Copy link
Collaborator

Hi, that sounds good. Yes, a PR would be great! Probably params.busco_clean to have the tool name at the start (this isnt a standard, but I more and more think that helps). Please ping me as a reviewer when you have the PR ready.

@prototaxites
Copy link
Contributor Author

Great, I've got a fork assembled with the changes - is there anywhere outside of nextflow_schema.jsonand nextflow.config that the parameter needs to be documented?

I'll submit the PR once I've tested the busco_clean flag with the test profile and made sure the directories are removed.

@d4straub
Copy link
Collaborator

is there anywhere outside of nextflow_schema.jsonand nextflow.config that the parameter needs to be documented?

CHANGELOG, activate in one of the test profiles that the setting will be tested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants