Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace FilePattern.is_opendap with generalized FilePattern.file_type #320

Closed
cisaacstern opened this issue Mar 7, 2022 · 8 comments · Fixed by #322
Closed

Replace FilePattern.is_opendap with generalized FilePattern.file_type #320

cisaacstern opened this issue Mar 7, 2022 · 8 comments · Fixed by #322

Comments

@cisaacstern
Copy link
Member

XarrayZarrRecipe defaults to use the "h5netcdf" xarray backend.

When we are using default tempfile (i.e. local) storage, this will not work, as noted in #315 and pangeo-forge/staged-recipes#120.

To resolve this, we should adjust this logic

if "engine" not in kw:
kw["engine"] = "h5netcdf"

to default to engine="netcdf4" if the storage_config has not been changed from the default tempfile targets.

@rabernat
Copy link
Contributor

rabernat commented Mar 7, 2022

I don't understand why this is necessary. Are the files really not openable by the h5netcdf engine?

We use local tempfiles in our our test suite and have not noticed any problems.

@cisaacstern
Copy link
Member Author

Good question. (xref #315 (comment))

When run in the sandbox, the recipes in #315 (comment) and pangeo-forge/staged-recipes#120 (comment) both error with

Traceback
OSError                                   Traceback (most recent call last)
/tmp/ipykernel_342/3080067870.py in <module>
    115 
    116 run_function = recipe_pruned.to_function()
--> 117 run_function()
    118 
    119 # ## Check the output

/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/executors/python.py in function()
     44                         stage.function(m, config=pipeline.config)
     45                 else:
---> 46                     stage.function(config=pipeline.config)
     47 
     48         return function

/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in prepare_target(config)
    462 
    463         for chunk_key in init_chunks:
--> 464             with open_chunk(chunk_key, config=config) as ds:
    465                 # ds is already chunked
    466 

/srv/conda/envs/notebook/lib/python3.9/contextlib.py in __enter__(self)
    117         del self.args, self.kwds, self.func
    118         try:
--> 119             return next(self.gen)
    120         except StopIteration:
    121             raise RuntimeError("generator didn't yield") from None

/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in open_chunk(chunk_key, config)
    337     # need to open an unknown number of contexts at the same time
    338     with ExitStack() as stack:
--> 339         dsets = [stack.enter_context(open_input(input_key, config=config)) for input_key in inputs]
    340 
    341         # subset before chunking; hopefully lazy

/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in <listcomp>(.0)
    337     # need to open an unknown number of contexts at the same time
    338     with ExitStack() as stack:
--> 339         dsets = [stack.enter_context(open_input(input_key, config=config)) for input_key in inputs]
    340 
    341         # subset before chunking; hopefully lazy

/srv/conda/envs/notebook/lib/python3.9/contextlib.py in enter_context(self, cm)
    446         _cm_type = type(cm)
    447         _exit = _cm_type.__exit__
--> 448         result = _cm_type.__enter__(cm)
    449         self._push_cm_exit(cm, _exit)
    450         return result

/srv/conda/envs/notebook/lib/python3.9/contextlib.py in __enter__(self)
    117         del self.args, self.kwds, self.func
    118         try:
--> 119             return next(self.gen)
    120         except StopIteration:
    121             raise RuntimeError("generator didn't yield") from None

/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in open_input(input_key, config)
    301                     kw["engine"] = "h5netcdf"
    302                 logger.debug(f"about to enter xr.open_dataset context on {f}")
--> 303                 with xr.open_dataset(f, **kw) as ds:
    304                     logger.debug("successfully opened dataset")
    305                     ds = fix_scalar_attr_encoding(ds)

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    493 
    494     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495     backend_ds = backend.open_dataset(
    496         filename_or_obj,
    497         drop_variables=drop_variables,

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, format, group, lock, invalid_netcdf, phony_dims, decode_vlen_strings)
    372 
    373         filename_or_obj = _normalize_path(filename_or_obj)
--> 374         store = H5NetCDFStore.open(
    375             filename_or_obj,
    376             format=format,

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in open(cls, filename, mode, format, group, lock, autoclose, invalid_netcdf, phony_dims, decode_vlen_strings)
    176 
    177         manager = CachingFileManager(h5netcdf.File, filename, mode=mode, kwargs=kwargs)
--> 178         return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
    179 
    180     def _acquire(self, needs_lock=True):

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in __init__(self, manager, group, mode, lock, autoclose)
    121         # todo: utilizing find_root_and_group seems a bit clunky
    122         #  making filename available on h5netcdf.Group seems better
--> 123         self._filename = find_root_and_group(self.ds)[0].filename
    124         self.is_remote = is_remote_uri(self._filename)
    125         self.lock = ensure_lock(lock)

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in ds(self)
    187     @property
    188     def ds(self):
--> 189         return self._acquire()
    190 
    191     def open_store_variable(self, name, var):

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in _acquire(self, needs_lock)
    179 
    180     def _acquire(self, needs_lock=True):
--> 181         with self._manager.acquire_context(needs_lock) as root:
    182             ds = _nc4_require_group(
    183                 root, self._group, self._mode, create_group=_h5netcdf_create_group

/srv/conda/envs/notebook/lib/python3.9/contextlib.py in __enter__(self)
    117         del self.args, self.kwds, self.func
    118         try:
--> 119             return next(self.gen)
    120         except StopIteration:
    121             raise RuntimeError("generator didn't yield") from None

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/file_manager.py in acquire_context(self, needs_lock)
    185     def acquire_context(self, needs_lock=True):
    186         """Context manager for acquiring a file."""
--> 187         file, cached = self._acquire_with_cache_info(needs_lock)
    188         try:
    189             yield file

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    203                     kwargs = kwargs.copy()
    204                     kwargs["mode"] = self._mode
--> 205                 file = self._opener(*self._args, **kwargs)
    206                 if self._mode == "w":
    207                     # ensure file doesn't get overriden when opened again

/srv/conda/envs/notebook/lib/python3.9/site-packages/h5netcdf/core.py in __init__(self, path, mode, invalid_netcdf, phony_dims, **kwargs)
    710                 else:
    711                     self._preexisting_file = os.path.exists(path)
--> 712                     self._h5file = h5py.File(path, mode, **kwargs)
    713             else:  # file-like object
    714                 if h5py.__version__ < LooseVersion("2.9.0"):

/srv/conda/envs/notebook/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, **kwds)
    505                                  fs_persist=fs_persist, fs_threshold=fs_threshold,
    506                                  fs_page_size=fs_page_size)
--> 507                 fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
    508 
    509             if isinstance(libver, tuple):

/srv/conda/envs/notebook/lib/python3.9/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    218         if swmr and swmr_support:
    219             flags |= h5f.ACC_SWMR_READ
--> 220         fid = h5f.open(name, flags, fapl=fapl)
    221     elif mode == 'r+':
    222         fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.open()

OSError: Unable to open file (file signature not found)

I'd assumed (perhaps incorrectly) that this was an xarray backed issue.

As for why it wouldn't have been caught in the test suite, perhaps this could be explained by the fact that we create the netcdfs for our tests with xr.to_netcdf, which perhaps results in different formatting than for the "real world" files raising these errors?

Admittedly, I'm now less confident that this is as simple as switching to "netcdf4" for tempfiles, because that does raise

ValueError: Invalid dtype for data variable: <xarray.DataArray 'flag' (time: 2879)>
dask.array<concatenate, shape=(2879,), dtype=|S35, chunksize=(1440,), chunktype=numpy.ndarray>

(full traceback in #315 (comment))

... for which I am not immediately aware of a solution.

@rabernat
Copy link
Contributor

rabernat commented Mar 7, 2022

Can you post a direct link to one of the offending files?

@cisaacstern
Copy link
Member Author

cisaacstern commented Mar 7, 2022

http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/2021/ZCYL5_20210101v30001.nc

And my initial suggestion on this thread was motivated by the following observation:

# save remote file to local path "ZCYL5_20210101v30001.nc"
wget http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/2021/ZCYL5_20210101v30001.nc
In [1]: import xarray as xr
In [2]: ds = xr.open_dataset("ZCYL5_20210101v30001.nc", engine="h5netcdf")  # raises error
In [3]: ds = xr.open_dataset("ZCYL5_20210101v30001.nc", engine="netcdf4")  # opens file without error

@rabernat
Copy link
Contributor

rabernat commented Mar 7, 2022

Ok great, so we have gotten down to the core issue. Charles, could you open an issue on https://github.com/h5netcdf/h5netcdf/issues so we can get to the bottom of it?

edit to clarify: the principle I am applying here is that we don't want to introduce workarounds in pangeo-forge-recipes for what is, IMO, an upstream problem. Let's go fix it at the source.

@cisaacstern cisaacstern changed the title Tempfile storage_config should default to engine="netcdf4" With "h5netcdf" backend, opening certain netcdfs raises OSError: Unable to open file (file signature not found) Mar 7, 2022
@cisaacstern
Copy link
Member Author

the principle I am applying here is that we don't want to introduce workarounds in pangeo-forge-recipes for what is, IMO, an upstream problem. Let's go fix it at the source.

A good principle!

@cisaacstern
Copy link
Member Author

We closed the upstream issue in h5netcdf because we realized the actual issue is that the source file is netcdf3.

This will require xarray_open_kwargs=dict(engine="scipy")).

Ryan has suggested that, rather than suggesting this directly to users, we can:

  • Add an annotation to the FilePattern about the file type. This would subsume the is_opendap option
  • We could have an enum about supported file types, GRIB, netCDF3, netCDF4, opendap, etc

Brainstorming from there, we would then be able to assign the corresponding backend dynamically and try opening here

if "engine" not in kw:
kw["engine"] = "h5netcdf"
logger.debug(f"about to enter xr.open_dataset context on {f}")
with xr.open_dataset(f, **kw) as ds:

and raise a descriptive error if this fails.

@cisaacstern cisaacstern changed the title With "h5netcdf" backend, opening certain netcdfs raises OSError: Unable to open file (file signature not found) Replace FilePattern.is_opendap with generalized FilePattern.file_type Mar 7, 2022
@rabernat
Copy link
Contributor

rabernat commented Mar 8, 2022

Brainstorming from there, we would then be able to assign the corresponding backend dynamically

exactly, we would have something like this

class FileType(Enum):
    unknown=auto()
    netcdf3 = auto()
    netcdf4 = auto()
    grib = auto()
    opendap = auto()


opener_map = {
    FileType.netcdf3: {"engine": "scipy"},
    FileType.netcdf4: {"engine": "h5netcdf"}
}

etc.

Going even further, if we pursue the "opener refactor" (#245), we could have different opener classes for the different file types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants