-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign InMemoryDataset data model #370
base: master
Are you sure you want to change the base?
Conversation
|
||
@cython.cclass | ||
class IndexChunkMapper: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class and all of its subclasses are just a refactor of the legacy code, with no logic changes.
I've rewired the legacy as_subchunk_map
to use the refactored code.
There are minor additions, which are not used by the legacy as_subchunk_map
:
- new methods
chunks_indexer
andwhole_chunks_indexer
- the method
chunks_submap
allows selecting more than one contiguous chunk at a time, butas_subchunk_map
will only ever ask for one - the same method raises an exception for early exit when none of the chunks requested is impacted by the index. This never happens with the legacy
as_subchunk_map
by definition.
@@ -336,3 +674,861 @@ def as_subchunk_map( | |||
chunk_subidxs = chunk_subidxs[:idx_len] | |||
|
|||
yield Tuple(*chunk_slices), arr_subidxs, chunk_subidxs | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor ends here. Everything below this line is new code that's not been integrated yet.
I must point out that the code cythonizes very poorly. This can be improved in a later iteration.
versioned_hdf5/subchunk_map.py
Outdated
return [(tuple(idx), value_idx) for *idx, value_idx in zip(*indices, values_idx)] | ||
|
||
|
||
def _hypercubes_between( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy with this design. It performs particularly badly in some edge cases, e.g. it's great after you call
a[::2 * chunk_size, :] = 1
but it's awful after
a[:, ::2 * chunk_size] = 1
because in the latter case you'll end up loading one chunk at a time.
It's also problematically slow when you have most of the dataset staged in memory, e.g. after a[:] = 1
For the sake of moving things forward, I would like to deliver like this as is and rework it later on with a smarter algorithm that can generate long and narrow rectangles.
versioned_hdf5/subchunk_map.py
Outdated
idx_to[1:], | ||
shape[1:], | ||
): | ||
yield (idx_to[0], *ni), (idx_to[0] + 1, *nj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yield, tuples, etc. are all things that compile in cython but give you ZERO performance benefit from pure python.
Hacks to make it stay in C space for longer are very much possible but at the cost of some complication. I would like to keep them as a later deliverable only if we want to squeeze extra speed out of it.
787fdf7
to
bd5e9ff
Compare
return idx, mappers | ||
|
||
|
||
def as_subchunk_map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a temporary refactor, for my sanity to verify that I didn't break anything fundamental early on, and for the sanity of reviewers. It will be completely removed in the final version.
514d33e
to
96eebda
Compare
e53cfda
to
52170dc
Compare
84e172d
to
1e58e69
Compare
1e58e69
to
b5ce3ba
Compare
b5ce3ba
to
76bab3a
Compare
Note: This PR is still missing some components - see checklist below.
versioned_hdf5.wrappers.InMemoryDataset
is plagued by slowness caused by the control code, both because the control code itself is very slow and because it callsh5py.Dataset.__getitem__
independently for every single chunk.The new control logic introduced in this PR aims to minimise the number of calls to
h5py.Dataset.__getitem__
by selecting large rectangles of unmodified data spanning multiple chunks.This new design introduces a helper class,
StagedChangesArray
, which handles all the control logic. This class has minimal external API and is completely abstracted from h5py.There are also other major design improvements. Namely, in master
__setitem__
first loads all impacted chunks into memory, in the off chance that they are not wholly covered by the parameter index, and then updates them. In this PR, this happens exclusively for the chunks that are not wholly covered by the index of__setitem__
.In other words,
a[:] = 1
in master loads the whole dataset into memory if it's not in memory already, whereas in this PR it never touches the disk.Analogously, in master
resize()
reads the whole dataset from disk. This no longer happens in this PR, which only loads the edge chunks if the shape is not exactly divisible by chunk_size.Status
__getitem__
__setitem__
resize()
__getitem__
by hand__setitem__
by handresize()
by handDemo
See
demo.ipynb
.Benchmarks
See
benchmarks.ipynb
.Early benchmarks show up to 500x speedup in the control code and up to 2000x reduction in the number of calls to
h5py.Dataset.__getitem__
.