-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NUCmer job generation for large jobs slows down rapidly. #306
Comments
We have also discussed having the 'joblist' container be a set, rather than a list (#297). I don't know if Python garbage collection would do something similar here, or if that is specifically related to the |
I don't know if it would, either. It's hard to beat O(1) for efficiency, so we'd gain nothing from changing to a set at this point, I think (though it may still be useful elsewhere). |
I looked at |
Replacing the The rate of appending drops by 50% between 40k items and 80k items, regardless. Batching outputs may be the most usefully efficient option. |
Batching gave an average speed-up of about 100x - very fast! -until it fell over at 400k jobs. I don't yet understand why that happened. |
With batching, it appears that the script enters a I had the function batch jobs into lists of 10k, and It may be that, for scalability, we need to restructure how jobs are passed around the code. This may be a solved issue after the |
Summary:
For large comparison runs (e.g. 2500 input genomes) the process of generating NUCmer jobs is slow, and starts to slow rapidly after about 80k command-lines are created.
I think this may be due to the structure of
generate_joblist()
insubcmd_anim.py
Description:
In
generate_joblist()
, an empty list (joblist
) is created. This is populated withComparisonJob
objects in afor
loop.If we're not in recovery mode, the only problematic-looking task is to construct a
ComparisonJob
and add it tojoblist
. This is meant to be O(1) in Python > 3.1 (there used to be a bug causing a slowdown issue, but this was apparently resolved: https://bugs.python.org/issue4074).StackOverflow indicates the problem may be resolved by turning off garbage collection:
As we're adding objects to the list, Garbage Collection is checking the entire list on each append. Wrapping the append with
gc.disable()
/gc.enable()
might fix it.The text was updated successfully, but these errors were encountered: