Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock related problem in on travis-ci but not on local machine #2560

Closed
horta opened this issue Nov 21, 2018 · 10 comments
Closed

Lock related problem in on travis-ci but not on local machine #2560

horta opened this issue Nov 21, 2018 · 10 comments

Comments

@horta
Copy link
Contributor

horta commented Nov 21, 2018

There is a KeyError at xarray.backends.file_manager.CachingFileManager for the acquire method on these lines:

        with self._lock:
            try:
                file = self._cache[self._key]

which does not happen when I test on my macos.

Let me know if you require further testing on my part.

Full log:

=================================== FAILURES ===================================

________________________________ test_qtl_xarr _________________________________


self = <xarray.backends.file_manager.CachingFileManager object at 0x7f59e2cab978>


    def acquire(self):

        """Acquiring a file object from the manager.

    

        A new file is only opened if it has expired from the

        least-recently-used cache.

    

        This method uses a reentrant lock, which ensures that it is

        thread-safe. You can safely acquire a file in multiple threads at the

        same time, as long as the underlying file object is thread-safe.

    

        Returns

        -------

        An open file object, as returned by ``opener(*args, **kwargs)``.

        """

        with self._lock:

            try:

>               file = self._cache[self._key]


miniconda/envs/test-environment/lib/python3.6/site-packages/xarray/backends/file_manager.py:137: 

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 


self = <xarray.backends.lru_cache.LRUCache object at 0x7f5a04a9c630>

key = [<function _open_netcdf4_group at 0x7f5a04abc1e0>, ('/tmp/tmprj8lwlat/xarr.hdf5', CombinedLock([<SerializableLock: 447...>])), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('group', '/foo/chr1'), ('persist', False))]


    def __getitem__(self, key):

        # record recent use of the key by moving it to the front of the list

        with self._lock:

>           value = self._cache[key]

E           KeyError: [<function _open_netcdf4_group at 0x7f5a04abc1e0>, ('/tmp/tmprj8lwlat/xarr.hdf5', CombinedLock([<SerializableLock: 4471090f-419e-49bd-9d27-4156e09febdd>, <SerializableLock: b9eb5ba1-2830-402e-b48f-3ef45768b7c2>])), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('group', '/foo/chr1'), ('persist', False))]


miniconda/envs/test-environment/lib/python3.6/site-packages/xarray/backends/lru_cache.py:43: KeyError


During handling of the above exception, another exception occurred:


    def test_qtl_xarr():

        with limix.example.file_example("xarr.hdf5.bz2") as filepath:

            filepath = limix.sh.extract(filepath, verbose=False)

            sample_ids = limix.io.hdf5.fetch(filepath, "/foo/chr1/col_header/sample_ids")

    

            rsid = dict()

            for i in range(1, 3):

                rsid[i] = limix.io.hdf5.fetch(

                    filepath, "/foo/chr{}/row_header/rsid".format(i)

                )

    

            G = []

            for i in range(1, 3):

>               g = xr.open_dataset(filepath, "/foo/chr{}".format(i))["matrix"]


miniconda/envs/test-environment/lib/python3.6/site-packages/limix/qtl/test/test_qtl_xarr.py:20: 

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

miniconda/envs/test-environment/lib/python3.6/site-packages/xarray/backends/api.py:320: in open_dataset

    filename_or_obj, group=group, lock=lock, **backend_kwargs)

miniconda/envs/test-environment/lib/python3.6/site-packages/xarray/backends/netCDF4_.py:355: in open

    return cls(manager, lock=lock, autoclose=autoclose)

miniconda/envs/test-environment/lib/python3.6/site-packages/xarray/backends/netCDF4_.py:314: in __init__

    self.format = self.ds.data_model

miniconda/envs/test-environment/lib/python3.6/site-packages/xarray/backends/netCDF4_.py:359: in ds

    return self._manager.acquire().value

miniconda/envs/test-environment/lib/python3.6/site-packages/xarray/backends/file_manager.py:143: in acquire

    file = self._opener(*self._args, **kwargs)

miniconda/envs/test-environment/lib/python3.6/site-packages/xarray/backends/netCDF4_.py:247: in _open_netcdf4_group

    ds = nc4.Dataset(filename, mode=mode, **kwargs)

netCDF4/_netCDF4.pyx:2135: in netCDF4._netCDF4.Dataset.__init__

    ???

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 


>   ???

E   OSError: [Errno -101] NetCDF: HDF error: b'/tmp/tmprj8lwlat/xarr.hdf5'


netCDF4/_netCDF4.pyx:1752: OSError

===================== 1 failed, 50 passed in 65.82 seconds =====================

The command "bash <(curl -fsSL https://raw.githubusercontent.com/horta/ci/master/travis.sh)" exited with 1.




Done. Your build exited with 1.
@shoyer
Copy link
Member

shoyer commented Nov 22, 2018

Thanks for the report!

This might be an xarray issue, or it might also be a netCDF4 issue -- I wonder if you might have different versions of libnetcdf installed locally and on Traivs-CI?

@shoyer
Copy link
Member

shoyer commented Nov 22, 2018

Could you share a link the file in which these tests are defined? I couldn't find it on the master branch.

@horta
Copy link
Contributor Author

horta commented Nov 22, 2018

@jnhansen
Copy link

jnhansen commented Dec 7, 2018

I am having the exact same problem. Have you found a solution/workaround?

@shoyer
Copy link
Member

shoyer commented Dec 7, 2018

if you have a simpler example, that would make things easier to debug. The error you're seeing here is basically "HDF5 encountered an error" which could happen for any number of reasons.

@horta
Copy link
Contributor Author

horta commented Dec 8, 2018

Sorry guys. I've found the problem and solution.

The problem is that filesystem not supporting lock mechanism. The solution is to export the following variable: export HDF5_USE_FILE_LOCKING=FALSE.

@shoyer
Copy link
Member

shoyer commented Dec 9, 2018

OK, I think I understand what's going on here and why this issue only appears with xarray v0.11.

The problem is related to how you are opening files with xarray but not closing them:
https://github.com/limix/limix/blob/8bc0861035cc60b3ce7bcbf7f147bcc828580828/limix/qtl/test/test_qtl_xarr.py#L20

With v0.11, xarray introduced a least-recently-used cache for netCDF files. This means xarray's LRUCache is holding on to a reference to your files, so they never get garbage collected, and automatically closed by netCDF4-Python. Hence the HDF5 locks never get released.

Although it is indeed a good practice to always explicitly close files, I think we should explore using weak references to keep track of files in our cache so we don't hold on to them for holding than necessary.

@shoyer
Copy link
Member

shoyer commented Dec 9, 2018

If either of you have time, it would be great if you could test out #2595 to see if that resolves your issue without requiring the environment variable.

@jnhansen
Copy link

jnhansen commented Dec 9, 2018

I have done a bit more testing on this, and I believe the issue may not necessarily be with xarray but with rasterio (disclaimer: I haven't tested your pull request yet).

I can reproduce the following on Ubuntu and on Travis CI. On Mac OS none of these errors occur.

Minimum example:

import xarray as xr
import numpy as np
import rasterio  # The rasterio import makes the last line of this code fail.
ds = xr.Dataset()
ds['data'] = (('y', 'x'), np.ones((10, 10)))
ds.to_netcdf('test.nc', engine='netcdf4')

I was able to fix the error by prepending a

import netCDF4

at the very top of the script.

A very similar thing happens with engine='h5netcdf'. The same script works without rasterio, fails with rasterio, and can be fixed by inserting import h5netcdf at the top of the script.

@shoyer
Copy link
Member

shoyer commented Dec 10, 2018

See also #2535 for rasterio/netCDF4 issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants