h5py reference counting bug with release 1.3.7 #216

ArvidJB · 2022-02-01T15:53:37Z

With the changes in 1.3.7 we are sporadically getting the following error:

  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/codemill/bessen/versioned_hdf5_env/lib64/python3.7/site-packages/h5py/_hl/files.py", line 461, in __exit__
    self.close()
  File "/codemill/bessen/versioned_hdf5_env/lib64/python3.7/site-packages/h5py/_hl/files.py", line 431, in close
    id_list = h5f.get_obj_ids(self.id, ~h5f.OBJ_FILE)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 269, in h5py.h5f.get_obj_ids
RuntimeError: Can't increment id ref count (can't locate ID)

I have tracked this down to this change
e2509ae
which changes the way Hashtable is cached.

Why did we need the lru_cache in the first place? I assume this addresses some performance problem?
Is caching anything containing a reference to the h5py.File f invalid anyway?

Below is a reproducer which seems to cause this to happen with high probability:

import h5py
import numpy as np
from versioned_hdf5 import VersionedHDF5File

ds_names = ['ds{}'.format(i) for i in range(10)]
num_versions = 101


def f1():
    with h5py.File('foo.h5', 'w') as f:
        vf = VersionedHDF5File(f)
        with vf.stage_version('0') as sv:
            data_group = sv.create_group('data')
            for name in ds_names:
                data_group.create_dataset(name, data=np.arange(4))


def f2():
    for i in range(1, num_versions):
        with h5py.File('foo.h5', 'r+') as f:
            vf = VersionedHDF5File(f)
            with vf.stage_version(str(i)) as sv:
                data_group = sv['data']
                for name in ds_names:
                    ds = data_group[name]
                    ds.resize((i + 4,))
                    ds[:] = np.arange(i, 2 * i + 4)


def f3():
    with h5py.File('foo.h5', 'r') as f:
        versions = f['_version_data/versions']
        version_names = list(versions.keys())
        return version_names


def f4(version_names):
    with h5py.File('foo.h5', 'r') as f:
        vf = VersionedHDF5File(f)
        for version_name in version_names:
            if version_name != '__first_version__':
                cv = vf[version_name]
                data_group = cv['data']
                for name in ds_names:
                    ds = data_group[name]
                    _ = ds[:]


if __name__ == '__main__':
    f1()
    f2()
    for _ in range(10):
        version_names = f3()
        f4(version_names)
        version_names = f3()
        f4(version_names)

The text was updated successfully, but these errors were encountered:

asmeurer · 2022-02-01T20:29:28Z

Yes, it was for performance. Otherwise we have to reload the hashtable from the file every time we write to a given dataset. However, for me, I still get the error even if I remove the caching.

It looks like this is the same sort of bug as #162. The issue doesn't occur for me in h5py version 3.4.0.

IIRC, the conclusion from that issue is that the bug is harmless because it only happens when the file is being closed, so you can work around it by catching it when closing the file, similar to #178.

ArvidJB · 2022-02-01T20:47:12Z

Yes, it was for performance. Otherwise we have to reload the hashtable from the file every time we write to a given dataset. However, for me, I still get the error even if I remove the caching.

Let me try again. The bug is not happening every time, maybe I just got a false negative.

It looks like this is the same sort of bug as #162. The issue doesn't occur for me in h5py version 3.4.0.

I think this might be our best option here, we should upgrade to a newer version of h5py anyway.

asmeurer · 2022-02-01T23:42:13Z

With your reproducer script and h5py 2.10, I also see this in 1.3.6 and 1.3.5. I suspect the spuriousness of issue led to an invalid git bisection. If it's the same issue as before, small changes can make the problem appear to go away, but only because they happen to change the state of the garbage collector in a way that avoids the problem (it's more or less a race condition).

asmeurer · 2022-02-01T23:43:46Z

If anything, we may want to bisect the change in h5py that fixes this. Let me know if you want me to do this. I don't know if it is exactly the same thing that fixed the other problem that fixed this or if it is some other change. IIRC you were backporting the relevant h5py fix from the other issue, but I could be wrong about that.

magsol · 2022-11-22T18:06:32Z

@ArvidJB @asmeurer Is this still a priority? If not, should we close this issue, or put it in the backlog?

ericdatakelly added this to the February 2022 milestone Feb 14, 2022

ericdatakelly removed this from the February 2022 milestone Mar 9, 2022

magsol added this to the April 2022 milestone Mar 23, 2022

ArvidJB mentioned this issue Oct 5, 2023

Delete versions OOM fix #277

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

h5py reference counting bug with release 1.3.7 #216

h5py reference counting bug with release 1.3.7 #216

ArvidJB commented Feb 1, 2022

asmeurer commented Feb 1, 2022

ArvidJB commented Feb 1, 2022

asmeurer commented Feb 1, 2022

asmeurer commented Feb 1, 2022

magsol commented Nov 22, 2022

h5py reference counting bug with release 1.3.7 #216

h5py reference counting bug with release 1.3.7 #216

Comments

ArvidJB commented Feb 1, 2022

asmeurer commented Feb 1, 2022

ArvidJB commented Feb 1, 2022

asmeurer commented Feb 1, 2022

asmeurer commented Feb 1, 2022

magsol commented Nov 22, 2022