-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support creating DataSet from streaming object #1075
Comments
This does work for netCDF3 files, if you provide a file-like object (e.g., wrapped in Unfortunately, this is a netCDF4/HDF5 file:
And as yet, there is no support for reading from file-like objects in either h5py (h5py/h5py#552) or python-netCDF4 (Unidata/netcdf4-python#295). So we're currently stuck :(. One possibility is to use the new HDF5 library pyfive with h5netcdf (h5netcdf/h5netcdf#25). But pyfive doesn't have enough features yet to read netCDF files. |
Got it. :( Thanks! |
Is this issue resolvable now that Unidata/netcdf4-python#652 has been merged? |
Yes, we could support initializing a Dataset from netCDF4 file image in a |
FWIW this would be really useful 👍 from me, specifically for the use case above of reading from s3 |
Just to clarify: I wrote about that we use could support initializing a Dataset from a netCDF4 file image. But this wouldn't help yet for streaming access. Initializing a Dataset from a netCDF4 file image should actually work with the latest versions of xarray and netCDF4-python: nc4_ds = netCDF4.Dataset('arbitrary-name', memory=netcdf_bytes)
store = xarray.backends.NetCDF4DataStore(nc4_ds)
ds = xarray.open_dataset(store) |
Thanks @shoyer. So you can download the entire object into memory and then create a file image and read that? While not a full fix, it's definitely an improvement over download-to-disk-then-read workflow! |
@delgadom Yes, that should work (I haven't tested it, but yes in principle it should all work now). |
@delgadom - did you find a solution here? A few more references, we're exploring ways to do this in the Pangeo project using Fuse (pangeo-data/pangeo#52). There is a s3 equivalent of the gcsfs library used in that issue: https://github.com/dask/s3fs |
yes! Thanks @jhamman and @shoyer. I hadn't tried it yet, but just did. worked great! In [1]: import xarray as xr
...: import requests
...: import netCDF4
...:
...: %matplotlib inline
In [2]: res = requests.get(
...: 'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/rcp45/day/atmos/tasmin/' +
...: 'r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073.nc')
In [3]: res.status_code
Out [3]: 200
In [4]: res.headers['content-type']
Out [4]: 'application/x-netcdf'
In [5]: nc4_ds = netCDF4.Dataset('tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073', memory=res.content)
In [6]: store = xr.backends.NetCDF4DataStore(nc4_ds)
In [7]: ds = xr.open_dataset(store)
In [8]: ds.tasmin.isel(time=0).plot()
/global/home/users/mdelgado/git/public/xarray/xarray/plot/utils.py:51: FutureWarning: 'pandas.tseries.converter.register' has been moved and renamed to 'pandas.plotting.register_matplotlib_converters'.
converter.register()
Out [8]: <matplotlib.collections.QuadMesh at 0x2aede3c922b0> In [9]: ds
Out [9]:
<xarray.Dataset>
Dimensions: (lat: 720, lon: 1440, time: 365)
Coordinates:
* time (time) datetime64[ns] 2073-01-01T12:00:00 2073-01-02T12:00:00 ...
* lat (lat) float32 -89.875 -89.625 -89.375 -89.125 -88.875 -88.625 ...
* lon (lon) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 1.875 ...
Data variables:
tasmin (time, lat, lon) float64 ...
Attributes:
parent_experiment: historical
parent_experiment_id: historical
parent_experiment_rip: r1i1p1
Conventions: CF-1.4
institution: NASA Earth Exchange, NASA Ames Research C...
institute_id: NASA-Ames
realm: atmos
modeling_realm: atmos
version: 1.0
downscalingModel: BCSD
experiment_id: rcp45
frequency: day
realization: 1
initialization_method: 1
physics_version: 1
tracking_id: 1865ff49-b20c-4268-852a-a9503efec72c
driving_data_tracking_ids: N/A
driving_model_ensemble_member: r1i1p1
driving_experiment_name: historical
driving_experiment: historical
model_id: BCSD
references: BCSD method: Thrasher et al., 2012, Hydro...
DOI: http://dx.doi.org/10.7292/W0MW2F2G
experiment: RCP4.5
title: CESM1-BGC global downscaled NEX CMIP5 Cli...
contact: Dr. Rama Nemani: rama.nemani@nasa.gov, Dr...
disclaimer: This data is considered provisional and s...
resolution_id: 0.25 degree
project_id: NEXGDDP
table_id: Table day (12 November 2010)
source: BCSD 2014
creation_date: 2015-01-07T19:18:31Z
forcing: N/A
product: output |
We could potentially add a from_memory() constructor to NetCDF4DataStore to
simplify this process.
…On Thu, Jan 11, 2018 at 6:27 PM Michael Delgado ***@***.***> wrote:
yes! Thanks @jhamman <https://github.com/jhamman> and @shoyer
<https://github.com/shoyer>. I hadn't tried it yet, but just did. worked
great!
In [1]: import xarray as xr
...: import requests
...: import netCDF4
...:
...: %matplotlib inline
In [2]: res = requests.get(
...: 'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/rcp45/day/atmos/tasmin/' +
...: 'r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073.nc')
In [3]: res.status_code
Out [3]: 200
In [4]: res.headers['content-type']
Out [4]: 'application/x-netcdf'
In [5]: nc4_ds = netCDF4.Dataset('tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073', memory=res.content)
In [6]: store = xr.backends.NetCDF4DataStore(nc4_ds)
In [7]: ds = xr.open_dataset(store)
In [8]: ds.tasmin.isel(time=0).plot()
/global/home/users/mdelgado/git/public/xarray/xarray/plot/utils.py:51: FutureWarning: 'pandas.tseries.converter.register' has been moved and renamed to 'pandas.plotting.register_matplotlib_converters'.
converter.register()
Out [8]: <matplotlib.collections.QuadMesh at 0x2aede3c922b0>
[image: output_7_2]
<https://user-images.githubusercontent.com/3698640/34856943-f82619f4-f6fc-11e7-831d-f5d4032a338a.png>
In [9]: ds
Out [9]:
<xarray.Dataset>
Dimensions: (lat: 720, lon: 1440, time: 365)
Coordinates:
* time (time) datetime64[ns] 2073-01-01T12:00:00 2073-01-02T12:00:00 ...
* lat (lat) float32 -89.875 -89.625 -89.375 -89.125 -88.875 -88.625 ...
* lon (lon) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 1.875 ...
Data variables:
tasmin (time, lat, lon) float64 ...
Attributes:
parent_experiment: historical
parent_experiment_id: historical
parent_experiment_rip: r1i1p1
Conventions: CF-1.4
institution: NASA Earth Exchange, NASA Ames Research C...
institute_id: NASA-Ames
realm: atmos
modeling_realm: atmos
version: 1.0
downscalingModel: BCSD
experiment_id: rcp45
frequency: day
realization: 1
initialization_method: 1
physics_version: 1
tracking_id: 1865ff49-b20c-4268-852a-a9503efec72c
driving_data_tracking_ids: N/A
driving_model_ensemble_member: r1i1p1
driving_experiment_name: historical
driving_experiment: historical
model_id: BCSD
references: BCSD method: Thrasher et al., 2012, Hydro...
DOI: http://dx.doi.org/10.7292/W0MW2F2G
experiment: RCP4.5
title: CESM1-BGC global downscaled NEX CMIP5 Cli...
contact: Dr. Rama Nemani: ***@***.***, Dr...
disclaimer: This data is considered provisional and s...
resolution_id: 0.25 degree
project_id: NEXGDDP
table_id: Table day (12 November 2010)
source: BCSD 2014
creation_date: 2015-01-07T19:18:31Z
forcing: N/A
product: output
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1075 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1hiQ2cgre7e234H1PSZ33v3i1CWzks5tJsMQgaJpZM4KnpJe>
.
|
@delgadom which version of netCDF4 are you using? I'm following your same steps but am still receiving an |
xarray==0.10.2 Just tried it again and didn't have any issues: patt = (
'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/{scen}/day/atmos/{var}/' +
'r1i1p1/v1.0/{var}_day_BCSD_{scen}_r1i1p1_{model}_{year}.nc')
def open_url_dataset(url):
fname = os.path.splitext(os.path.basename(url))[0]
res = requests.get(url)
content = io.BytesIO(res.content)
nc4_ds = netCDF4.Dataset(fname, memory=res.content)
store = xr.backends.NetCDF4DataStore(nc4_ds)
ds = xr.open_dataset(store)
return ds
ds = open_url_dataset(url=patt.format(
model='GFDL-ESM2G', scen='historical', var='tasmax', year=1988))
ds |
@delgadom Ah, I see. I needed |
Is this now implemented (and hence can this issue be closed?) It appears that this works well: boto_s3 = boto3.client('s3')
s3_object = boto_s3.get_object(Bucket=bucket, Key=key)
netcdf_bytes = s3_object['Body'].read()
netcdf_bytes_io = io.BytesIO(netcdf_bytes)
ds = xr.open_dataset(netcdf_bytes_io) Is that the right approach to opening a NetCDF file on S3, using the latest xarray code? |
FWIW, I've also tested @delgadom's technique, using |
The use case is for netCDF files stored on s3 or other generic cloud storage
Ideal would be integration with the (hopefully) soon-to-be implemented dask.distributed features discussed in #798.
The text was updated successfully, but these errors were encountered: