Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support creating DataSet from streaming object #1075

Closed
delgadom opened this issue Nov 2, 2016 · 16 comments
Closed

Support creating DataSet from streaming object #1075

delgadom opened this issue Nov 2, 2016 · 16 comments

Comments

@delgadom
Copy link
Contributor

delgadom commented Nov 2, 2016

The use case is for netCDF files stored on s3 or other generic cloud storage

import requests, xarray as xr
fp = 'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_MPI-ESM-LR_2029.nc'
    
data = requests.get(fp, stream=True)
ds = xr.open_dataset(data.content)  # raises TypeError: embedded NUL character

Ideal would be integration with the (hopefully) soon-to-be implemented dask.distributed features discussed in #798.

@shoyer
Copy link
Member

shoyer commented Nov 2, 2016

This does work for netCDF3 files, if you provide a file-like object (e.g., wrapped in BytesIO) or set engine='scipy'.

Unfortunately, this is a netCDF4/HDF5 file:

>>> data.raw.read(8)
'\x89HDF\r\n\x1a\n'

And as yet, there is no support for reading from file-like objects in either h5py (h5py/h5py#552) or python-netCDF4 (Unidata/netcdf4-python#295). So we're currently stuck :(.

One possibility is to use the new HDF5 library pyfive with h5netcdf (h5netcdf/h5netcdf#25). But pyfive doesn't have enough features yet to read netCDF files.

@delgadom
Copy link
Contributor Author

delgadom commented Nov 2, 2016

Got it. :( Thanks!

@rabernat
Copy link
Contributor

rabernat commented Jun 9, 2017

Is this issue resolvable now that Unidata/netcdf4-python#652 has been merged?

@shoyer
Copy link
Member

shoyer commented Jun 9, 2017

Yes, we could support initializing a Dataset from netCDF4 file image in a bytes object.

@niallrobinson
Copy link

niallrobinson commented Nov 21, 2017

FWIW this would be really useful 👍 from me, specifically for the use case above of reading from s3

@shoyer
Copy link
Member

shoyer commented Nov 22, 2017

Just to clarify: I wrote about that we use could support initializing a Dataset from a netCDF4 file image. But this wouldn't help yet for streaming access.

Initializing a Dataset from a netCDF4 file image should actually work with the latest versions of xarray and netCDF4-python:

nc4_ds = netCDF4.Dataset('arbitrary-name', memory=netcdf_bytes)
store = xarray.backends.NetCDF4DataStore(nc4_ds)
ds = xarray.open_dataset(store)

@delgadom
Copy link
Contributor Author

Thanks @shoyer. So you can download the entire object into memory and then create a file image and read that? While not a full fix, it's definitely an improvement over download-to-disk-then-read workflow!

@shoyer
Copy link
Member

shoyer commented Nov 29, 2017

@delgadom Yes, that should work (I haven't tested it, but yes in principle it should all work now).

@jhamman
Copy link
Member

jhamman commented Jan 12, 2018

@delgadom - did you find a solution here?

A few more references, we're exploring ways to do this in the Pangeo project using Fuse (pangeo-data/pangeo#52). There is a s3 equivalent of the gcsfs library used in that issue: https://github.com/dask/s3fs

@delgadom
Copy link
Contributor Author

yes! Thanks @jhamman and @shoyer. I hadn't tried it yet, but just did. worked great!

In  [1]: import xarray as xr
    ...: import requests
    ...: import netCDF4
    ...: 
    ...: %matplotlib inline

In  [2]: res = requests.get(
    ...:     'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/rcp45/day/atmos/tasmin/' +
    ...:     'r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073.nc')

In  [3]: res.status_code
Out [3]: 200

In  [4]: res.headers['content-type']
Out [4]: 'application/x-netcdf'

In  [5]: nc4_ds = netCDF4.Dataset('tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073', memory=res.content)

In  [6]: store = xr.backends.NetCDF4DataStore(nc4_ds)

In  [7]: ds = xr.open_dataset(store)

In  [8]: ds.tasmin.isel(time=0).plot()
    /global/home/users/mdelgado/git/public/xarray/xarray/plot/utils.py:51: FutureWarning: 'pandas.tseries.converter.register' has been moved and renamed to 'pandas.plotting.register_matplotlib_converters'. 
      converter.register()
Out [8]: <matplotlib.collections.QuadMesh at 0x2aede3c922b0>

output_7_2

In  [9]: ds
Out [9]:
    <xarray.Dataset>
    Dimensions:  (lat: 720, lon: 1440, time: 365)
    Coordinates:
      * time     (time) datetime64[ns] 2073-01-01T12:00:00 2073-01-02T12:00:00 ...
      * lat      (lat) float32 -89.875 -89.625 -89.375 -89.125 -88.875 -88.625 ...
      * lon      (lon) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 1.875 ...
    Data variables:
        tasmin   (time, lat, lon) float64 ...
    Attributes:
        parent_experiment:              historical
        parent_experiment_id:           historical
        parent_experiment_rip:          r1i1p1
        Conventions:                    CF-1.4
        institution:                    NASA Earth Exchange, NASA Ames Research C...
        institute_id:                   NASA-Ames
        realm:                          atmos
        modeling_realm:                 atmos
        version:                        1.0
        downscalingModel:               BCSD
        experiment_id:                  rcp45
        frequency:                      day
        realization:                    1
        initialization_method:          1
        physics_version:                1
        tracking_id:                    1865ff49-b20c-4268-852a-a9503efec72c
        driving_data_tracking_ids:      N/A
        driving_model_ensemble_member:  r1i1p1
        driving_experiment_name:        historical
        driving_experiment:             historical
        model_id:                       BCSD
        references:                     BCSD method: Thrasher et al., 2012, Hydro...
        DOI:                            http://dx.doi.org/10.7292/W0MW2F2G
        experiment:                     RCP4.5
        title:                          CESM1-BGC global downscaled NEX CMIP5 Cli...
        contact:                        Dr. Rama Nemani: rama.nemani@nasa.gov, Dr...
        disclaimer:                     This data is considered provisional and s...
        resolution_id:                  0.25 degree
        project_id:                     NEXGDDP
        table_id:                       Table day (12 November 2010)
        source:                         BCSD 2014
        creation_date:                  2015-01-07T19:18:31Z
        forcing:                        N/A
        product:                        output

@shoyer
Copy link
Member

shoyer commented Jan 12, 2018 via email

@nickwg03
Copy link

@delgadom which version of netCDF4 are you using? I'm following your same steps but am still receiving an [Errno 2] No such file or directory

@delgadom
Copy link
Contributor Author

delgadom commented Mar 15, 2018

xarray==0.10.2
netCDF4==1.3.1

Just tried it again and didn't have any issues:

patt = (
    'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/{scen}/day/atmos/{var}/' +
    'r1i1p1/v1.0/{var}_day_BCSD_{scen}_r1i1p1_{model}_{year}.nc')

def open_url_dataset(url):

    fname = os.path.splitext(os.path.basename(url))[0]
    res = requests.get(url)
    content = io.BytesIO(res.content)
    nc4_ds = netCDF4.Dataset(fname, memory=res.content)
    
    store = xr.backends.NetCDF4DataStore(nc4_ds)
    ds = xr.open_dataset(store)

    return ds

ds = open_url_dataset(url=patt.format(
        model='GFDL-ESM2G', scen='historical', var='tasmax', year=1988))
ds

@nickwg03
Copy link

@delgadom Ah, I see. I needed libnetcdf=4.5.0, I had been using an earlier version. Sounds like prior to 4.5.0 there were still some issues with the name of the file being passed into netCDF4.Dataset, as is mentioned here: Unidata/netcdf4-python#295

@JackKelly
Copy link
Contributor

JackKelly commented May 28, 2020

Is this now implemented (and hence can this issue be closed?) It appears that this works well:

    boto_s3 = boto3.client('s3')
    s3_object = boto_s3.get_object(Bucket=bucket, Key=key)
    netcdf_bytes = s3_object['Body'].read()
    netcdf_bytes_io = io.BytesIO(netcdf_bytes)
    ds = xr.open_dataset(netcdf_bytes_io)

Is that the right approach to opening a NetCDF file on S3, using the latest xarray code?

@JackKelly
Copy link
Contributor

FWIW, I've also tested @delgadom's technique, using netCDF4 and it also works well (and is useful in situations where we don't want to install h5netcdf). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants