-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow writing with many small chunks #167
Comments
Right, this is a case that I forgot to implement. It should be straightforward to fix this, and speed this up quite a bit.
A good chunk of time is spend in the h5py virtual dataset code (the |
As versioned HDF5 relays a lot on ndindex, my plan is to take Python's C implementation of the slice object, make it hashable and inheritable such that ndindex can use that instead. |
Is this (hashable slice object) for the construction of |
The new object could be used in versioned HDF5 directly, but I want to (after creating the new object) first focus on using it within ndindex. |
Unfortunately in our use case we often end up with suboptimal chunk sizes. Unversioned
h5py
is able to handle those without issues, but withversioned_hdf5
this turns out to be pretty slow:This takes around 9 seconds for me to write 120 numbers.
A little bit of profiling points to two things:
as_subchunks
inInMemoryDataset.__setitem__
where it ends up calling
_fallback
because there is no case forIntegerArray
. Could we not use the same code path as forInteger
?create_virtual_dataset
:Is it possible to speed this up? In this example we only change some very small subset of the data. If we could keep track of the changes we could probably copy the old virtual dataset and modify it appropriately?
The text was updated successfully, but these errors were encountered: