-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUSCO creates a huge number of files - add a flag to remove some of these? #341
Comments
If I understand you right than you are talking about the entire file system, i.e. including the Edit:
|
Hi, thanks for the thoughts! The issue with using the clean command is that hitting the file number limit kills the pipeline immediately, so it's impossible to get to the step where you could retroactively clean e.g. all the busco work directories.
This is what I had in mind when suggesting this - a perhaps judicious flag that activates a bunch of rm commands at the end of the process run to reduce file numbers.
I'm currently running the pipeline with the work/ directory set to be written to scratch using the -w option - it's file limits on the scratch partition that I'm hitting! |
Oh wow, file limits on the scratch partition, never heard of that! |
Here's a couple of tree outputs I've run on the work directories (apologies, am stuck in the lab this week and don't have much time for anything else!). To my eye, the directories augustus_config, auto_lineage and the ones that start run_ (e.g. run_metazoa_odb10) seem to be the main culprits. I think the symlinked busco database that I have stored locally on scratch may also be contributing, but that's unavoidable as part of the way Nextflow handles resources. edit: the output directories are described here: https://busco.ezlab.org/busco_userguide.html#outputs |
Have got a solution for this that works - removing the folders "augustus_config", "BUSCO/auto_lineage", and "BUSCO/run_*" at the end of the script (before writing versions) prevents the build up of millions of files: if [ -d augustus_config ]; then
rm -rf augustus_config
fi
if [ -d BUSCO/auto_lineage ]; then
rm -rf BUSCO/auto_lineage
rm -rf BUSCO/run_*
fi If there's interest in including this as a feature, I could put together a pull request, putting the option behind a disabled-by-default parameter flag (perhaps |
Hi, that sounds good. Yes, a PR would be great! Probably |
Great, I've got a fork assembled with the changes - is there anywhere outside of I'll submit the PR once I've tested the busco_clean flag with the test profile and made sure the directories are removed. |
CHANGELOG, activate in one of the test profiles that the setting will be tested |
Description of feature
Somewhere between a bug and a feature request - I'm running the pipeline and have generated 608 bins. I'm now at the BUSCO step and the total number of files I have on my scratch partition has jumped from approximately 90,000 to 2.1 million with the ~250 that have completed so far - approximately 6-7k files for each process. Looking at the current status of the BUSCO results directory, there's only ~1600 files in total - a huge number less than 2 million.
This has caused me to hit the maximum file limit on our cluster, and I had to ask our cluster admin for a temporary increase in the limit. I understand that these output files might be useful to some, but perhaps including a parameter flag (-busco-cleanup maybe?) that deletes (for example) all the auto-lineage files that aren't included in the pipeline results would be useful for those of us who need to keep file numbers lower?
The text was updated successfully, but these errors were encountered: