Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

h5py reference counting bug with release 1.3.7 #216

Open
ArvidJB opened this issue Feb 1, 2022 · 5 comments
Open

h5py reference counting bug with release 1.3.7 #216

ArvidJB opened this issue Feb 1, 2022 · 5 comments
Milestone

Comments

@ArvidJB
Copy link
Collaborator

ArvidJB commented Feb 1, 2022

With the changes in 1.3.7 we are sporadically getting the following error:

  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/codemill/bessen/versioned_hdf5_env/lib64/python3.7/site-packages/h5py/_hl/files.py", line 461, in __exit__
    self.close()
  File "/codemill/bessen/versioned_hdf5_env/lib64/python3.7/site-packages/h5py/_hl/files.py", line 431, in close
    id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 269, in h5py.h5f.get_obj_ids
RuntimeError: Can't increment id ref count (can't locate ID)

I have tracked this down to this change
e2509ae
which changes the way Hashtable is cached.

  1. Why did we need the lru_cache in the first place? I assume this addresses some performance problem?
  2. Is caching anything containing a reference to the h5py.File f invalid anyway?

Below is a reproducer which seems to cause this to happen with high probability:

import h5py
import numpy as np
from versioned_hdf5 import VersionedHDF5File

ds_names = ['ds{}'.format(i) for i in range(10)]
num_versions = 101


def f1():
    with h5py.File('foo.h5', 'w') as f:
        vf = VersionedHDF5File(f)
        with vf.stage_version('0') as sv:
            data_group = sv.create_group('data')
            for name in ds_names:
                data_group.create_dataset(name, data=np.arange(4))


def f2():
    for i in range(1, num_versions):
        with h5py.File('foo.h5', 'r+') as f:
            vf = VersionedHDF5File(f)
            with vf.stage_version(str(i)) as sv:
                data_group = sv['data']
                for name in ds_names:
                    ds = data_group[name]
                    ds.resize((i + 4,))
                    ds[:] = np.arange(i, 2 * i + 4)


def f3():
    with h5py.File('foo.h5', 'r') as f:
        versions = f['_version_data/versions']
        version_names = list(versions.keys())
        return version_names


def f4(version_names):
    with h5py.File('foo.h5', 'r') as f:
        vf = VersionedHDF5File(f)
        for version_name in version_names:
            if version_name != '__first_version__':
                cv = vf[version_name]
                data_group = cv['data']
                for name in ds_names:
                    ds = data_group[name]
                    _ = ds[:]


if __name__ == '__main__':
    f1()
    f2()
    for _ in range(10):
        version_names = f3()
        f4(version_names)
        version_names = f3()
        f4(version_names)
@asmeurer
Copy link
Collaborator

asmeurer commented Feb 1, 2022

Yes, it was for performance. Otherwise we have to reload the hashtable from the file every time we write to a given dataset. However, for me, I still get the error even if I remove the caching.

It looks like this is the same sort of bug as #162. The issue doesn't occur for me in h5py version 3.4.0.

IIRC, the conclusion from that issue is that the bug is harmless because it only happens when the file is being closed, so you can work around it by catching it when closing the file, similar to #178.

@ArvidJB
Copy link
Collaborator Author

ArvidJB commented Feb 1, 2022

Yes, it was for performance. Otherwise we have to reload the hashtable from the file every time we write to a given dataset. However, for me, I still get the error even if I remove the caching.

Let me try again. The bug is not happening every time, maybe I just got a false negative.

It looks like this is the same sort of bug as #162. The issue doesn't occur for me in h5py version 3.4.0.

I think this might be our best option here, we should upgrade to a newer version of h5py anyway.

@asmeurer
Copy link
Collaborator

asmeurer commented Feb 1, 2022

With your reproducer script and h5py 2.10, I also see this in 1.3.6 and 1.3.5. I suspect the spuriousness of issue led to an invalid git bisection. If it's the same issue as before, small changes can make the problem appear to go away, but only because they happen to change the state of the garbage collector in a way that avoids the problem (it's more or less a race condition).

@asmeurer
Copy link
Collaborator

asmeurer commented Feb 1, 2022

If anything, we may want to bisect the change in h5py that fixes this. Let me know if you want me to do this. I don't know if it is exactly the same thing that fixed the other problem that fixed this or if it is some other change. IIRC you were backporting the relevant h5py fix from the other issue, but I could be wrong about that.

@ericdatakelly ericdatakelly added this to the February 2022 milestone Feb 14, 2022
@ericdatakelly ericdatakelly removed this from the February 2022 milestone Mar 9, 2022
@magsol magsol added this to the April 2022 milestone Mar 23, 2022
@magsol
Copy link

magsol commented Nov 22, 2022

@ArvidJB @asmeurer Is this still a priority? If not, should we close this issue, or put it in the backlog?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants