-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace FilePattern.is_opendap
with generalized FilePattern.file_type
#320
Comments
I don't understand why this is necessary. Are the files really not openable by the h5netcdf engine? We use local tempfiles in our our test suite and have not noticed any problems. |
Good question. (xref #315 (comment)) When run in the sandbox, the recipes in #315 (comment) and pangeo-forge/staged-recipes#120 (comment) both error with TracebackOSError Traceback (most recent call last)
/tmp/ipykernel_342/3080067870.py in <module>
115
116 run_function = recipe_pruned.to_function()
--> 117 run_function()
118
119 # ## Check the output
/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/executors/python.py in function()
44 stage.function(m, config=pipeline.config)
45 else:
---> 46 stage.function(config=pipeline.config)
47
48 return function
/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in prepare_target(config)
462
463 for chunk_key in init_chunks:
--> 464 with open_chunk(chunk_key, config=config) as ds:
465 # ds is already chunked
466
/srv/conda/envs/notebook/lib/python3.9/contextlib.py in __enter__(self)
117 del self.args, self.kwds, self.func
118 try:
--> 119 return next(self.gen)
120 except StopIteration:
121 raise RuntimeError("generator didn't yield") from None
/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in open_chunk(chunk_key, config)
337 # need to open an unknown number of contexts at the same time
338 with ExitStack() as stack:
--> 339 dsets = [stack.enter_context(open_input(input_key, config=config)) for input_key in inputs]
340
341 # subset before chunking; hopefully lazy
/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in <listcomp>(.0)
337 # need to open an unknown number of contexts at the same time
338 with ExitStack() as stack:
--> 339 dsets = [stack.enter_context(open_input(input_key, config=config)) for input_key in inputs]
340
341 # subset before chunking; hopefully lazy
/srv/conda/envs/notebook/lib/python3.9/contextlib.py in enter_context(self, cm)
446 _cm_type = type(cm)
447 _exit = _cm_type.__exit__
--> 448 result = _cm_type.__enter__(cm)
449 self._push_cm_exit(cm, _exit)
450 return result
/srv/conda/envs/notebook/lib/python3.9/contextlib.py in __enter__(self)
117 del self.args, self.kwds, self.func
118 try:
--> 119 return next(self.gen)
120 except StopIteration:
121 raise RuntimeError("generator didn't yield") from None
/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in open_input(input_key, config)
301 kw["engine"] = "h5netcdf"
302 logger.debug(f"about to enter xr.open_dataset context on {f}")
--> 303 with xr.open_dataset(f, **kw) as ds:
304 logger.debug("successfully opened dataset")
305 ds = fix_scalar_attr_encoding(ds)
/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
493
494 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495 backend_ds = backend.open_dataset(
496 filename_or_obj,
497 drop_variables=drop_variables,
/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, format, group, lock, invalid_netcdf, phony_dims, decode_vlen_strings)
372
373 filename_or_obj = _normalize_path(filename_or_obj)
--> 374 store = H5NetCDFStore.open(
375 filename_or_obj,
376 format=format,
/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in open(cls, filename, mode, format, group, lock, autoclose, invalid_netcdf, phony_dims, decode_vlen_strings)
176
177 manager = CachingFileManager(h5netcdf.File, filename, mode=mode, kwargs=kwargs)
--> 178 return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
179
180 def _acquire(self, needs_lock=True):
/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in __init__(self, manager, group, mode, lock, autoclose)
121 # todo: utilizing find_root_and_group seems a bit clunky
122 # making filename available on h5netcdf.Group seems better
--> 123 self._filename = find_root_and_group(self.ds)[0].filename
124 self.is_remote = is_remote_uri(self._filename)
125 self.lock = ensure_lock(lock)
/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in ds(self)
187 @property
188 def ds(self):
--> 189 return self._acquire()
190
191 def open_store_variable(self, name, var):
/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in _acquire(self, needs_lock)
179
180 def _acquire(self, needs_lock=True):
--> 181 with self._manager.acquire_context(needs_lock) as root:
182 ds = _nc4_require_group(
183 root, self._group, self._mode, create_group=_h5netcdf_create_group
/srv/conda/envs/notebook/lib/python3.9/contextlib.py in __enter__(self)
117 del self.args, self.kwds, self.func
118 try:
--> 119 return next(self.gen)
120 except StopIteration:
121 raise RuntimeError("generator didn't yield") from None
/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/file_manager.py in acquire_context(self, needs_lock)
185 def acquire_context(self, needs_lock=True):
186 """Context manager for acquiring a file."""
--> 187 file, cached = self._acquire_with_cache_info(needs_lock)
188 try:
189 yield file
/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
203 kwargs = kwargs.copy()
204 kwargs["mode"] = self._mode
--> 205 file = self._opener(*self._args, **kwargs)
206 if self._mode == "w":
207 # ensure file doesn't get overriden when opened again
/srv/conda/envs/notebook/lib/python3.9/site-packages/h5netcdf/core.py in __init__(self, path, mode, invalid_netcdf, phony_dims, **kwargs)
710 else:
711 self._preexisting_file = os.path.exists(path)
--> 712 self._h5file = h5py.File(path, mode, **kwargs)
713 else: # file-like object
714 if h5py.__version__ < LooseVersion("2.9.0"):
/srv/conda/envs/notebook/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, **kwds)
505 fs_persist=fs_persist, fs_threshold=fs_threshold,
506 fs_page_size=fs_page_size)
--> 507 fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
508
509 if isinstance(libver, tuple):
/srv/conda/envs/notebook/lib/python3.9/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
218 if swmr and swmr_support:
219 flags |= h5f.ACC_SWMR_READ
--> 220 fid = h5f.open(name, flags, fapl=fapl)
221 elif mode == 'r+':
222 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5f.pyx in h5py.h5f.open()
OSError: Unable to open file (file signature not found) I'd assumed (perhaps incorrectly) that this was an xarray backed issue. As for why it wouldn't have been caught in the test suite, perhaps this could be explained by the fact that we create the netcdfs for our tests with Admittedly, I'm now less confident that this is as simple as switching to ValueError: Invalid dtype for data variable: <xarray.DataArray 'flag' (time: 2879)>
dask.array<concatenate, shape=(2879,), dtype=|S35, chunksize=(1440,), chunktype=numpy.ndarray> (full traceback in #315 (comment)) ... for which I am not immediately aware of a solution. |
Can you post a direct link to one of the offending files? |
http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/2021/ZCYL5_20210101v30001.nc And my initial suggestion on this thread was motivated by the following observation: # save remote file to local path "ZCYL5_20210101v30001.nc"
wget http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/2021/ZCYL5_20210101v30001.nc In [1]: import xarray as xr
In [2]: ds = xr.open_dataset("ZCYL5_20210101v30001.nc", engine="h5netcdf") # raises error
In [3]: ds = xr.open_dataset("ZCYL5_20210101v30001.nc", engine="netcdf4") # opens file without error |
Ok great, so we have gotten down to the core issue. Charles, could you open an issue on https://github.com/h5netcdf/h5netcdf/issues so we can get to the bottom of it? edit to clarify: the principle I am applying here is that we don't want to introduce workarounds in pangeo-forge-recipes for what is, IMO, an upstream problem. Let's go fix it at the source. |
storage_config
should default to engine="netcdf4"
"h5netcdf"
backend, opening certain netcdfs raises OSError: Unable to open file (file signature not found)
A good principle! |
We closed the upstream issue in h5netcdf because we realized the actual issue is that the source file is netcdf3. This will require Ryan has suggested that, rather than suggesting this directly to users, we can:
Brainstorming from there, we would then be able to assign the corresponding backend dynamically and pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py Lines 300 to 303 in de33ebc
and raise a descriptive error if this fails. |
"h5netcdf"
backend, opening certain netcdfs raises OSError: Unable to open file (file signature not found)
FilePattern.is_opendap
with generalized FilePattern.file_type
exactly, we would have something like this class FileType(Enum):
unknown=auto()
netcdf3 = auto()
netcdf4 = auto()
grib = auto()
opendap = auto()
opener_map = {
FileType.netcdf3: {"engine": "scipy"},
FileType.netcdf4: {"engine": "h5netcdf"}
} etc. Going even further, if we pursue the "opener refactor" (#245), we could have different opener classes for the different file types. |
XarrayZarrRecipe
defaults to use the"h5netcdf"
xarray backend.When we are using default
tempfile
(i.e. local) storage, this will not work, as noted in #315 and pangeo-forge/staged-recipes#120.To resolve this, we should adjust this logic
pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py
Lines 300 to 301 in de33ebc
to default to
engine="netcdf4"
if thestorage_config
has not been changed from the defaulttempfile
targets.The text was updated successfully, but these errors were encountered: