Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create_virtual_dataset is slow #226

Open
ArvidJB opened this issue May 3, 2022 · 1 comment
Open

create_virtual_dataset is slow #226

ArvidJB opened this issue May 3, 2022 · 1 comment
Assignees
Milestone

Comments

@ArvidJB
Copy link
Collaborator

ArvidJB commented May 3, 2022

Writing virtual datasets seems to be pretty slow because of the calls to deepcopy in VirtualSource.__getitem__:

In [26]: %%time
    ...: with TempDirCtx() as d:
    ...:     with h5py.File(d / 'foo.h5', 'w') as f:
    ...:         a = np.random.rand(1, 36, 26, 19)
    ...:         f.create_dataset('bar', data=a, chunks=a.shape, maxshape=(None, None, None, None))
    ...:     for i in range(1, 101):
    ...:         with h5py.File(d / 'foo.h5', 'r+') as f:
    ...:             f['bar'].resize((i + 1, 36, 26, 19))
    ...:             a = np.random.rand(1, 36, 26, 19)
    ...:             f['bar'][i, :, :, :] = a
    ...:
    ...:
CPU times: user 129 ms, sys: 8.01 ms, total: 137 ms
Wall time: 137 ms

In [27]: %%time
    ...: with TempDirCtx() as d:
    ...:     with h5py.File(d / 'foo.h5', 'w') as f:
    ...:         vf = VersionedHDF5File(f)
    ...:         with vf.stage_version('v0') as sv:
    ...:             a = np.random.rand(1, 36, 26, 19)
    ...:             sv.create_dataset('bar', data=a, chunks=a.shape, maxshape=(None, None, None, None))
    ...:     for i in range(1, 101):
    ...:         with h5py.File(d / 'foo.h5', 'r+') as f:
    ...:             vf = VersionedHDF5File(f)
    ...:             with vf.stage_version('v{i}'.format(i=i)) as sv:
    ...:                 sv['bar'].resize((i + 1, 36, 26, 19))
    ...:                 a = np.random.rand(1, 36, 26, 19)
    ...:                 sv['bar'][[i], ...] = a
    ...:
    ...:
    ...:
CPU times: user 2.65 s, sys: 49.3 ms, total: 2.7 s
Wall time: 2.7 s

Looking at the code it seems that there was some performance optimization there which was broken by h5py version 3.3:
h5py/h5py#1905
Is it possible to work around this performance degradation?

@ArvidJB
Copy link
Collaborator Author

ArvidJB commented May 20, 2022

I also ran into the same problem in _recreate_virtual_dataset where the deepcopy also dominates performance.

We can evade the deepcopy if we replace

                layout[c.raw] = vs[idx.raw]

by

                vs = VirtualSource('.', name=raw_data.name, shape=raw_data.shape, dtype=raw_data.dtype)
                key = idx.raw
                vs.sel = select(vs.shape, key, dataset=None)
                _convert_space_for_key(vs.sel.id, key)
                layout[c.raw] = vs

which does seem to be faster. In the case I am currently debugging the time drops from 749s to 507s. I think there is probably still room for a lot of improvement, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants