Fixed performance bug in large directories #426
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
After the release, I made a first performance test of writing histograms. Here's the script:
scaling-histograms-uproot.py
The performance was terrible.
Notice that the number of bins in each test is not monotonically increasing, but the time to complete is. I thought there might have been something wrong in the cascade-writing (in which a change in state invokes a walk over changed tree nodes, writing the changes exactly once), but that wasn't it. It was purely algorithmic: the DirectoryData maintained a single list of keys, any kind of search iterated through them all, and adding one key meant serializing the whole DirectoryData, which is now larger. Each of these operations is individually O(n), so if you do them n times, as in this test, the total running time is O(n²). Classic algorithms stuff!
So I replaced the list with optimized data structures for searching (O(1) per search, rather than O(n)) and I replaced the "all keys" serialization with "only new keys" if keys have only been appended at the end. Other cases (replacing or deleting a key, which are more rare) still rewrite the whole set of keys. Although I'm optimizing the code for a specific performance test, the problem of "write a bunch of histograms in a directory" is a very common use-case, and the extra complexity in the DirectoryData class is worth it.
Particularly because these are the updated performance results with that one fix:
In the end, after having written 10000 histograms, it's 80× faster, and that factor would continue to grow with the number of histograms written.
It also compares favorably with PyROOT:
scaling-histograms-pyroot.py
I don't know what's going on in PyROOT, since most of the work should be happening on the C++ side. This also grows with time, rather than with number of bins, so there might be some similar story: a registry that gets filled up with objects to manage or something.
Anyway, the C++ code is much faster, so most of this is in the Python-C++ bindings:
scaling-histograms.cpp
There is a time-dependent increase in C++, and that might be a global registry of histogram objects, but the effect here in C++ is much smaller than in PyROOT, so there's something significant in the bindings.
Anyway, I got to change my code to fit the performance test, so on that subjective level, the comparison isn't fair. But this also shows that the fastest possible time is probably around 0.17 seconds for 1000 histograms, since that's ROOT's writing rate when the O(n²) effect hasn't kicked in, and I'm just happy that Uproot's 0.65 seconds is something close to that.