Suboptimal performance after `setitem` on a column of chunks #375

crusaderky · 2024-09-16T07:43:56Z

This is a follow-up to #370 and more specifically #371.

Downstream of #370, this is very fast:

dset[::2 * chunk_size, :] = a
b = dset[:]

While this is much slower:

dset[:, ::2 * chunk_size] = a
b = dset[:]

The difference is that in the first case fill_hyperspace returns contiguous slabs of chunks to read directly from the underlying HDF5 layer, so a whole row of chunks is loaded up with a single call to hdf5.Dataset.__getitem__, whereas in the second case fill_hyperspace fails to notice that an optimization is possible and as a result you end up calling hdf5.Dataset.__getitem__ again and again for every individual chunk.

In other words,

        Input   Current behaviour Optimal behaviour
      0123456789     0123456789      0123456789
    0 X.X.X.X.X.   0 XaXbXcXdXe    0 XaXbXcXdXe
    1 X.X.X.X.X.   1 XfXgXhXiXj    1 XaXbXcXdXe
    2 X.X.X.X.X.   2 XkXlXmXnXo    2 XaXbXcXdXe
    3 X.X.X.X.X.   3 XpXqXrXsXt    3 XaXbXcXdXe
    4 X.X.X.X.X.   4 XuXvXwXyXz    4 XaXbXcXdXe

in the above diagram, Xs are the chunks that are already in memory because of the previous call to dset[:, ::2 * chunk_size] = a, whereas each lowercase letter represents a separate call to hdf5.Dataset.__getitem__ when the user invokes b = dset[:].

The text was updated successfully, but these errors were encountered:

crusaderky · 2024-10-31T13:55:08Z

Closed by #386

crusaderky mentioned this issue Sep 16, 2024

[InMemoryDataset redesign] fill_hyperspace() #371

Merged

crusaderky closed this as completed Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal performance after `setitem` on a column of chunks #375

Suboptimal performance after `setitem` on a column of chunks #375

crusaderky commented Sep 16, 2024

crusaderky commented Oct 31, 2024

Suboptimal performance after __setitem__ on a column of chunks #375

Suboptimal performance after __setitem__ on a column of chunks #375

Comments

crusaderky commented Sep 16, 2024

crusaderky commented Oct 31, 2024

Suboptimal performance after `setitem` on a column of chunks #375

Suboptimal performance after `setitem` on a column of chunks #375