Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of create_virtual_dataset for h5py 3 #232

Merged
merged 3 commits into from
Aug 2, 2022

Conversation

asmeurer
Copy link
Collaborator

Using VirtualLayout and VirtualSource directly is slow for a few reasons. The
main reason is that VirtualSource does a deepcopy of itself every time you
slice it. The high-level code also calls the internal helper
_convert_space_for_key(), which is unnecessary for our given slices (it only
does anything for unbounded slices), and recreates a selection object each
time, which we can do just once.

Any further performance improvements here would have to come from speeding up
the h5py select() function, which is where most of the time is now spent. I have not yet investigated if there are any obvious gains that can be made here.

I am still investigating if any of these improvements can be upstreamed to h5py itself so that these workarounds aren't necessary and we can just use the normal high-level APIs.

This takes the example in #226 from 3.66 seconds to 2.77 seconds on my machine. In h5py 2.10, the example takes 2.52 seconds via the old optimizations.

Using VirtualLayout and VirtualSource directly is slow for a few reasons. The
main reason is that VirtualSource does a deepcopy of itself every time you
slice it. The high-level code also calls the internal helper
_convert_space_for_key(), which is unnecessary for our given slices (it only
does anything for unbounded slices), and recreates a selection object each
time, which we can do just once.

Any further performance improvements here would have to come from speeding up
the h5py select() function, which is where most of the time is now spent.
@magsol
Copy link

magsol commented Jul 5, 2022

@ArvidJB @quarl Have you had a chance to test this out?

@ArvidJB
Copy link
Collaborator

ArvidJB commented Jul 11, 2022

Thanks, this looks much better. On my machine the example from #226 now runs in 1.81s instead of 2.7s.
Can we apply the same fix in replay.py (_recreate_virtual_dataset) as well?

@magsol magsol added this to the July 2022 milestone Jul 14, 2022
@asmeurer
Copy link
Collaborator Author

So now the latest version of h5py is causing the tests to segfault. I'm not sure what the h5py 3.6 failure is about. I'm not able to reproduce it locally.

@ArvidJB
Copy link
Collaborator

ArvidJB commented Jul 21, 2022

What's the error you are seeing? We noticed that there's a bug in h5py version 3.7 with fillvalue handling which is fixed by this PR: h5py/h5py#2111

@asmeurer
Copy link
Collaborator Author

The errors are on the CI builds here. h5py 3.6 is giving this error (which I cannot reproduce locally):

_________________________ test_mask_reading_read_only __________________________
tmp_path = PosixPath('/tmp/pytest-of-runner/pytest-0/test_mask_reading_read_only0')
    @mark.xfail(h5py.__version__[0] == '2'
                or h5py.__version__[0] == '3' and int(h5py.__version__[2]) < 3,
                reason='h5py 2 does not support masks on virtual datasets')
    def test_mask_reading_read_only(tmp_path):
        # Reading a virtual dataset with a mask does not work in HDF5, so make
        # sure it still works for versioned datasets.
        file_name = os.path.join(tmp_path, 'file.hdf5')
        mask = np.array([True, True, False], dtype='bool')
        with h5py.File(file_name, 'w') as f:
            vf = VersionedHDF5File(f)
            with vf.stage_version('r0') as sv:
                sv.create_dataset('bar', data=[1, 2, 3], chunks=(2,))
                b = sv['bar'][mask]
                assert_equal(b, [1, 2])
            b = vf['r0']['bar'][mask]
            assert_equal(b, [1, 2])
        with h5py.File(file_name, 'r') as f:
            vf = VersionedHDF5File(f)
            sv = vf['r0']
>           b = sv['bar'][mask]
versioned_hdf5/tests/test_api.py:1839: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper
    ???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper
    ???
/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/h5py/_hl/dataset.py:793: in __getitem__
    self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper
    ???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper
    ???
h5py/h5d.pyx:192: in h5py.h5d.DatasetID.read
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   OSError: Can't read data (point selections not currently supported)
h5py/_proxy.pyx:112: OSError

h5py 3.7 is segfaulting (which I can reproduce)

versioned_hdf5/tests/test_api.py::test_InMemoryArrayDataset_chunks PASSED [  2%]
Fatal Python error: Segmentation fault
Current thread 0x00007f0eff3b5740 (most recent call first):
  File "/home/runner/work/versioned-hdf5/versioned-hdf5/versioned_hdf5/wrappers.py", line 1133 in __init__
  File "/home/runner/work/versioned-hdf5/versioned-hdf5/versioned_hdf5/wrappers.py", line 450 in __init__
  File "/home/runner/work/versioned-hdf5/versioned-hdf5/versioned_hdf5/wrappers.py", line 111 in _add_to_data
  File "/home/runner/work/versioned-hdf5/versioned-hdf5/versioned_hdf5/wrappers.py", line 100 in __setitem__
  File "/home/runner/work/versioned-hdf5/versioned-hdf5/versioned_hdf5/versions.py", line 53 in _get
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/h5py/_hl/group.py", line 635 in proxy
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/h5py/_hl/group.py", line 636 in visititems
  File "/home/runner/work/versioned-hdf5/versioned-hdf5/versioned_hdf5/versions.py", line 59 in create_version_group
  File "/home/runner/work/versioned-hdf5/versioned-hdf5/versioned_hdf5/api.py", line 196 in stage_version
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/contextlib.py", line 112 in __enter__
  File "/home/runner/work/versioned-hdf5/versioned-hdf5/versioned_hdf5/tests/test_api.py", line 1457 in test_string_dtypes
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/python.py", line 192 in pytest_pyfunc_call
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/python.py", line 1761 in runtest
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/runner.py", line 166 in pytest_runtest_call
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/runner.py", line 259 in <lambda>
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/runner.py", line 338 in from_call
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/runner.py", line 219 in call_and_report
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/runner.py", line 130 in runtestprotocol
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/runner.py", line 111 in pytest_runtest_protocol
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/main.py", line 347 in pytest_runtestloop
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/main.py", line 322 in _main
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/main.py", line 268 in wrap_session
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/main.py", line 315 in pytest_cmdline_main
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/config/__init__.py", line 165 in main
  File "/usr/share/miniconda/envs/test-environment/lib/python3.7/site-packages/_pytest/config/__init__.py", line [187](https://github.com/deshaw/versioned-hdf5/runs/7438555480?check_suite_focus=true#step:5:188) in console_main
  File "/usr/share/miniconda/envs/test-environment/bin/pytest", line 11 in <module>
/home/runner/work/_temp/2d77edc5-a551-4ff2-8ceb-1e5dd73f10c8.sh: line 25:  3551 Segmentation fault      (core dumped) pytest $PYTEST_FLAGS

@asmeurer
Copy link
Collaborator Author

Looks like that might be the same issue but I'll check to be sure.

@ArvidJB
Copy link
Collaborator

ArvidJB commented Aug 2, 2022

Hi @asmeurer: did you get to the bottom of those segfaults?
Even if not, can you create a release? I don't see any issues when I test out your changes locally so it should be okay, afaict.

@asmeurer
Copy link
Collaborator Author

asmeurer commented Aug 2, 2022

I think the issues are upstream h5py problems, so we should be OK to release. I wasn't able to take a look at them.

@asmeurer asmeurer merged commit f49da4f into deshaw:master Aug 2, 2022
@asmeurer asmeurer mentioned this pull request Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants