-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help with opening netcdf4 files via HTTP #49
Comments
One other weird datapoint that I have discovered is that I can pickle the un-loadable dataset, load it from the pickle, and then call url = 'https://www.unidata.ucar.edu/software/netcdf/examples/OMI-Aura_L2-example.nc' # netcdf
open_file = fsspec.open(url, mode='rb')
with open_file as fp:
with xr.open_dataset(fp, engine='h5netcdf') as ds:
pass
ds_pk = pickle.dumps(ds)
ds1 = pickle.loads(ds_pk)
ds1.load() This suggests that the dataset is serializable in some way. But something is happening in beam that is preventing that path from being taken. |
Ok the mystery deepens even further. After running the code just above, you can then call |
I'm enjoying the conversation with myself, so I'll just keep going... 🙃 I figured out a bizarre workaround that makes it work def open_fsspec_openfile_with_xarray(of):
with of as fp:
with xr.open_dataset(fp, engine='h5netcdf') as ds:
pass
ds1 = pickle.loads(pickle.dumps(ds))
return ds1 |
Wait, what!? I would have thought that By the way, |
I have not tried to introspect the pickle objects. I don't know much about pickle details and internals. #49 (comment) is a fully isolated, copy-pasteable reproducer for this, so if you wanted to dig in, that would be the place to start. My best guess is that there is an This could also be interacting with Xarray in some unexpected ways. See pydata/xarray#4242 for some discussion of that. |
I got some weird conversion error
Perhaps I need later versions of things. |
Just try with a different remote file, e.g url = 'https://power-datastore.s3.amazonaws.com/v9/climatology/power_901_rolling_zones_utc.nc' |
Just a guess, but have you tried asking Beam to use cloudpickle rather than dill? I believe this is a recently added option. Cloudpickle is used for Dask, so this might work a bit more consistently. |
Thanks for the suggestion Stephan. I tried as follows from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions(pickle_library="cloudpickle")
with TestPipeline(options=options) as p: However, it appears to have no effect. I can specify |
This is probably a red herring. The code to set the pickler doesn't raise any errors if you pass an invalid option. So I am going to assume that that option does work, and that it doesn't solve the problem. |
Some more deep introspecting into these objects. import xarray as xr
import fsspec
from cloudpickle import dumps, loads
from pprint import pprint as print
url = 'https://power-datastore.s3.amazonaws.com/v9/climatology/power_901_rolling_zones_utc.nc'
with fsspec.open(url) as fp:
with xr.open_dataset(fp, engine='h5netcdf') as ds0:
pass
ds_pk = dumps(ds0)
ds1 = loads(ds_pk)
# go deep inside Xarray's array wrappers to get out the `xarray.backends.h5netcdf_.H5NetCDFArrayWrapper` objects
wrapper0 = ds0.T_ZONES.variable._data.array.array.array.array.array
wrapper1 = ds1.T_ZONES.variable._data.array.array.array.array.array
# now go inside those and get the actual `fsspec.implementations.http.HTTPFile` objects
fobj0 = wrapper0.datastore._manager._args[0]
fobj1 = wrapper1.datastore._manager._args[0]
print(fobj0.__dict__)
print(fobj1.__dict__)
I tried taking fobj0._closed = False
fobj0.loc = 0
ds0.load() However, this lead to the error
|
So we need a file-like where |
Perhaps we are barking up the wrong tree. Once the dataset is passed through pickle or cloudpickle, it becomes loadable again. In #49 (comment) @shoyer suggested we should be able to force beam to use cloudpickle to serialize things. So it should be working without any changes to our libraries. I am currently trying to dig deeper into the dill vs. cloudpickle issue. |
I'm finally looking into this a little in detail. Why do you use context managers in the functions that you're passing into beam.Map? I would generally not expect something like this to work -- the context manager is explicitly closing the file object. Objects passed between transforms in a Beam pipeline are not necessarily serialized via pickle (which as I understand would fix this by reopening the file), because it's unnecessary overhead if the separate map stages are evaluated on the same machine. So anyways, if I were do to this I would not use a context manager in the opener function. |
Thanks a lot Stephan! I appreciate your time.
Because that's what fsspec seems to require! I went on an extremely deep dive on this in fsspec/filesystem_spec#579. In the end, the recommendation from @martindurant was to always use the context manager when opening a file-like object (see fsspec/filesystem_spec#579 (comment)). However, that requirement seems incompatible with serialization, as you noted. I would love to see an example of opening a NetCDF4 file remotely over HTTP using the h5netcdf engine without a context manager. |
Ok, so I actually did test this case in fsspec/filesystem_spec#579 (comment). The following works with HTTP def open_fsspec_openfile_with_xarray(of):
return xr.open_dataset(of.open(), engine='h5netcdf') @martindurant, is that kosher? |
Yes, it's fine - the file will "close" (meaning dropping the buffer) when garbage collected. Local files instances made this way also pickle. |
But if I do, that I again hit the problem (from fsspec/filesystem_spec#579 (comment)) that This works open_file = fsspec.open(url)
ds = xr.open_dataset(open_file.open())
# but not
ds = xr.open_dataset(open_file)
# -> AttributeError: 'HTTPFile' object has no attribute '__fspath__' or this works from fsspec.implementations.http import HTTPFileSystem
fs = HTTPFileSystem()
open_file = fs.open(url)
ds = xr.open_dataset(open_file)
# but not
ds = xr.open_dataset(open_file.open())
# -> AttributeError: 'HTTPFile' object has no attribute 'open'" |
Ok I am satisfied with my workaround in pangeo-forge/pangeo-forge-recipes@c20f3fd, which is basically if hasattr(open_file, "open"):
open_file = open_file.open()
ds = xr.open_dataset(open_file) This seems to reliably serialize with whatever fsspec can throw at us. Sorry for the noise here. I appreciate the help. |
This is not necessarily an xarray-beam specific question; however it relates to issues here (e.g. #37, #32) as well as in Pangeo Forge. So I am asking it here. I hope people here will be able to help me. Ultimately I hope this will help use resolve pangeo-forge/pangeo-forge-recipes#373 and move forward with merging Pangeo Forge and xarray-beam.
Goal: open xarray datasets from HTTP endpoints lazily and pass them around a beam pipeline. Delay loading of data variable until later in the pipeline.
What I have tried
Here is the basic pipeline I am working with. It is a simplified, self-contained version of what we will want to do in Pangeo Forge. (Note: this probably requires installing the latest version of fsspec from master, in order to get fsspec/filesystem_spec#973.)
When I run this I get
ValueError: I/O operation on closed file. [while running '[1]: Map(load_xarray_ds)']
Full Traceback
This is not quite the same error as I am getting in pangeo-forge/pangeo-forge-recipes#373; there it it instead
OSError: Unable to open file (incorrect metadata checksum after all read attempts)
. I have not been able to reproduce that error outside of pytest. However, my example here fails at the same point: when callingds.load()
on an h5netcdf-backed xarray dataset pointing at an fsspec HTTPFile object.What is wrong
Overall my concern is with this pattern:
It feels wrong. I should either be
yeild
ing or else not using context managers. The first context manager is necessary. The second may be optional. But overall my understanding is that the outputs of theMap
function need to be pickled, in which case the contextmanager pattern doesn't make sense at all. I have tried various other flavors, likeor
but nothing seems to work. The fundamental issue seems to be simply this
Has anyone here managed to make something like this work? I feel like I'm missing something obvious.
cc @martindurant
The text was updated successfully, but these errors were encountered: