Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign InMemoryDataset data model #370

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Aug 26, 2024

Note: This PR is still missing some components - see checklist below.

versioned_hdf5.wrappers.InMemoryDataset is plagued by slowness caused by the control code, both because the control code itself is very slow and because it calls h5py.Dataset.__getitem__ independently for every single chunk.

The new control logic introduced in this PR aims to minimise the number of calls to h5py.Dataset.__getitem__ by selecting large rectangles of unmodified data spanning multiple chunks.

This new design introduces a helper class, StagedChangesArray, which handles all the control logic. This class has minimal external API and is completely abstracted from h5py.

There are also other major design improvements. Namely, in master __setitem__ first loads all impacted chunks into memory, in the off chance that they are not wholly covered by the parameter index, and then updates them. In this PR, this happens exclusively for the chunks that are not wholly covered by the index of __setitem__.
In other words, a[:] = 1 in master loads the whole dataset into memory if it's not in memory already, whereas in this PR it never touches the disk.

Analogously, in master resize() reads the whole dataset from disk. This no longer happens in this PR, which only loads the edge chunks if the shape is not exactly divisible by chunk_size.

Status

  • implement __getitem__
  • implement __setitem__
  • implement resize()
  • crudely test __getitem__ by hand
  • crudely test __setitem__ by hand
  • crudely test resize() by hand
  • integrate with wrappers.py
  • all existing unit tests pass
  • new unit tests to stress the new classes in isolation
  • clean up legacy code
  • high level documentation
  • break PR up into stages to simplify code review

Demo

See demo.ipynb.

Benchmarks

See benchmarks.ipynb.
Early benchmarks show up to 500x speedup in the control code and up to 2000x reduction in the number of calls to h5py.Dataset.__getitem__.

@crusaderky crusaderky self-assigned this Aug 26, 2024

@cython.cclass
class IndexChunkMapper:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class and all of its subclasses are just a refactor of the legacy code, with no logic changes.
I've rewired the legacy as_subchunk_map to use the refactored code.
There are minor additions, which are not used by the legacy as_subchunk_map:

  • new methods chunks_indexer and whole_chunks_indexer
  • the method chunks_submap allows selecting more than one contiguous chunk at a time, but as_subchunk_map will only ever ask for one
  • the same method raises an exception for early exit when none of the chunks requested is impacted by the index. This never happens with the legacy as_subchunk_map by definition.

@@ -336,3 +674,861 @@ def as_subchunk_map(
chunk_subidxs = chunk_subidxs[:idx_len]

yield Tuple(*chunk_slices), arr_subidxs, chunk_subidxs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor ends here. Everything below this line is new code that's not been integrated yet.
I must point out that the code cythonizes very poorly. This can be improved in a later iteration.

return [(tuple(idx), value_idx) for *idx, value_idx in zip(*indices, values_idx)]


def _hypercubes_between(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with this design. It performs particularly badly in some edge cases, e.g. it's great after you call
a[::2 * chunk_size, :] = 1
but it's awful after
a[:, ::2 * chunk_size] = 1
because in the latter case you'll end up loading one chunk at a time.

It's also problematically slow when you have most of the dataset staged in memory, e.g. after a[:] = 1

For the sake of moving things forward, I would like to deliver like this as is and rework it later on with a smarter algorithm that can generate long and narrow rectangles.

idx_to[1:],
shape[1:],
):
yield (idx_to[0], *ni), (idx_to[0] + 1, *nj)
Copy link
Collaborator Author

@crusaderky crusaderky Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yield, tuples, etc. are all things that compile in cython but give you ZERO performance benefit from pure python.
Hacks to make it stay in C space for longer are very much possible but at the cost of some complication. I would like to keep them as a later deliverable only if we want to squeeze extra speed out of it.

return idx, mappers


def as_subchunk_map(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a temporary refactor, for my sanity to verify that I didn't break anything fundamental early on, and for the sanity of reviewers. It will be completely removed in the final version.

@crusaderky
Copy link
Collaborator Author

CC @ArvidJB @peytondmurray

@crusaderky crusaderky force-pushed the inmemorydataset_v2 branch 14 times, most recently from 514d33e to 96eebda Compare September 2, 2024 18:36
@crusaderky crusaderky force-pushed the inmemorydataset_v2 branch 8 times, most recently from e53cfda to 52170dc Compare September 5, 2024 14:33
@crusaderky crusaderky force-pushed the inmemorydataset_v2 branch 23 times, most recently from 84e172d to 1e58e69 Compare September 22, 2024 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants