Slow writing with many small chunks #167

ArvidJB · 2021-03-08T22:54:31Z

Unfortunately in our use case we often end up with suboptimal chunk sizes. Unversioned h5py is able to handle those without issues, but with versioned_hdf5 this turns out to be pretty slow:

dt = np.dtype('double')
d0 = 2
d1 = 15220
d2 = 2
chunks = (600, 2, 4)
with h5py.File('foo.h5', 'w') as f:
    vf = VersionedHDF5File(f)
    with vf.stage_version('0') as sv:
        sv.create_dataset('bar', shape=(d0, d1, d2), maxshape=(None, None, None),
                          chunks=chunks, dtype=dt,
                          data=np.full((d0, d1, d2), 0, dtype=dt))

start = time.time()
with h5py.File('foo.h5', 'r+') as f:
    vf = VersionedHDF5File(f)
    with vf.stage_version(str(i)) as sv:
        i2 = np.random.choice(d1, 30, replace=False)
        i2 = np.sort(i2)
        sv['bar'][:, i2, :] = np.full((d0, len(i2), d2), i, dtype=dt)
end = time.time()
print('writing: {}'.format(end - start))

This takes around 9 seconds for me to write 120 numbers.

A little bit of profiling points to two things:

The call to as_subchunks in InMemoryDataset.__setitem__

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   593                                               @with_phil
   594                                               @profile
   595                                               def __setitem__(self, args, val):
...
   700        78   24219378.0 310504.8     99.0          for c in self.chunks.as_subchunks(idx, self.shape):
...

where it ends up calling _fallback because there is no case for IntegerArray. Could we not use the same code path as for Integer?

The other slow spot is this loop in create_virtual_dataset:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   170                                           @profile
   171                                           def create_virtual_dataset(f, version_name, name, shape, slices, attrs=None, fillvalue=None):
....
   192     26127      50638.0      1.9      0.2          for c, s in slices.items():
   193     26124    1592688.0     61.0      6.6              if c.isempty():
   194                                                           continue
   195                                                       # idx = Tuple(s, *Tuple(*[slice(0, i) for i in shape[1:]]).as_subindex(Tuple(*c.args[1:])).args)
   196     26124    5472288.0    209.5     22.8              S = [Slice(0, shape[i], 1).as_subindex(c.args[i]) for i in range(1, len(shape))]
   197     26123    1725495.0     66.1      7.2              idx = Tuple(s, *S)
   198                                                       # assert c.newshape(shape) == vs[idx.raw].shape, (c, shape, s)
   199     26123   12892876.0    493.5     53.8              layout[c.raw] = vs[idx.raw]
...

Is it possible to speed this up? In this example we only change some very small subset of the data. If we could keep track of the changes we could probably copy the old virtual dataset and modify it appropriately?

The text was updated successfully, but these errors were encountered:

asmeurer · 2021-03-08T23:49:39Z

where it ends up calling _fallback because there is no case for IntegerArray. Could we not use the same code path as for Integer?

Right, this is a case that I forgot to implement. It should be straightforward to fix this, and speed this up quite a bit.

Is it possible to speed this up? In this example we only change some very small subset of the data. If we could keep track of the changes we could probably copy the old virtual dataset and modify it appropriately?

A good chunk of time is spend in the h5py virtual dataset code (the layout[c.raw] = vs[idx.raw]). I previously didn't look at this too closely, I can investigate more and see if this can be improved. The issue is in the h5py code, every call to vs[idx] calls deepcopy on the vs object. It seems likely that this can be done more efficiently.

See issue deshaw#167.

ilanschnell · 2021-11-04T16:32:43Z

As versioned HDF5 relays a lot on ndindex, my plan is to take Python's C implementation of the slice object, make it hashable and inheritable such that ndindex can use that instead.

ArvidJB · 2021-11-05T15:57:03Z

Is this (hashable slice object) for the construction of data_dict? In particular the construction of slice_map seems like it would benefit from this.

ilanschnell · 2021-11-05T17:46:36Z

The new object could be used in versioned HDF5 directly, but I want to (after creating the new object) first focus on using it within ndindex.

ericdatakelly added this to the March 2021 milestone Mar 22, 2021

asmeurer added a commit to asmeurer/versioned-hdf5 that referenced this issue Mar 22, 2021

Add new benchmarks for writing with many chunks

844b548

See issue deshaw#167.

asmeurer mentioned this issue Mar 22, 2021

More performance improvements #168

Merged

ericdatakelly modified the milestones: March 2021, April 2021 Apr 5, 2021

ericdatakelly assigned asmeurer Apr 5, 2021

ericdatakelly modified the milestones: April 2021, May 2021 May 10, 2021

ericdatakelly modified the milestones: May 2021, June 2021 Jun 7, 2021

ericdatakelly modified the milestones: June 2021, July 2021 Jul 8, 2021

ericdatakelly modified the milestones: July 2021, August 2021 Aug 12, 2021

ericdatakelly modified the milestones: August 2021, September 2021 Sep 29, 2021

ericdatakelly modified the milestones: September 2021, October 2021 Oct 6, 2021

ericdatakelly assigned ilanschnell and unassigned asmeurer Nov 5, 2021

ericdatakelly modified the milestones: October 2021, November 2021 Dec 2, 2021

ericdatakelly modified the milestones: November 2021, December 2021 Dec 14, 2021

ilanschnell mentioned this issue Dec 16, 2021

add optional cythonization Quansight-Labs/ndindex#127

Merged

ericdatakelly removed this from the December 2021 milestone Jan 4, 2022

ericdatakelly added this to the January 2022 milestone Jan 4, 2022

ericdatakelly modified the milestones: January 2022, February 2022 Feb 14, 2022

ericdatakelly removed this from the February 2022 milestone Mar 9, 2022

peytondmurray assigned ilanschnell and unassigned ilanschnell Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow writing with many small chunks #167

Slow writing with many small chunks #167

ArvidJB commented Mar 8, 2021

asmeurer commented Mar 8, 2021

ilanschnell commented Nov 4, 2021

ArvidJB commented Nov 5, 2021

ilanschnell commented Nov 5, 2021

Slow writing with many small chunks #167

Slow writing with many small chunks #167

Comments

ArvidJB commented Mar 8, 2021

asmeurer commented Mar 8, 2021

ilanschnell commented Nov 4, 2021

ArvidJB commented Nov 5, 2021

ilanschnell commented Nov 5, 2021