Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading Dataset from memory #406

Closed
m-novikov opened this issue May 5, 2015 · 38 comments · Fixed by #652
Closed

Reading Dataset from memory #406

m-novikov opened this issue May 5, 2015 · 38 comments · Fixed by #652

Comments

@m-novikov
Copy link

Can netcdf4 read Dataset from memory?
Something like this:

from netCDF4 import Dataset
f = open('netcdf_file.nc', 'rb')
df = Dataset(f)

It would be quite convenient and useful API.
As for now to read netCDF4 Dataset from URL I need explicitly save it as named temporary file.

@lesserwhirls
Copy link
Collaborator

Greetings!

We are working on this on the C side at Unidata, and hopefully it will be
finished soon. Once it's in the C lib, there should be very little work to
expose it in the netcdf4-python lib.

Cheers!

Sean

On Tuesday, May 5, 2015, Maxim Novikov notifications@github.com wrote:

Can netcdf4 read Dataset from memory?
Something like this:

from netCDF4 import Dataset
f = open('netcdf_file.nc', 'rb')
df = Dataset(f)

It would be quite convenient and useful API.
As for now to read netCDF4 Dataset from URL I need explicitly save it as
named temporary file.


Reply to this email directly or view it on GitHub
#406.

@dopplershift
Copy link
Member

To be clear, though, this would involve passing the entire "string" of data, not a file-like object.

@jswhit
Copy link
Collaborator

jswhit commented May 5, 2015

you can create a Dataset in memory, using 'diskless=True'.

@lesserwhirls
Copy link
Collaborator

Create, yes, but read - no.

Let's say you use the NetcdfSubset service from the THREDDS Data Server.
The server will return a netCDF file. When you use urllib2 to make the
request, you end up reading the server response, which is the bytes of a
netcdf file already in memory. The idea is to read in memory to remove the
need to write a temporary file to disk.

Sean

On Tue, May 5, 2015 at 1:31 PM, Jeff Whitaker notifications@github.com
wrote:

you can create a Dataset in memory, using 'diskless=True'.


Reply to this email directly or view it on GitHub
#406 (comment)
.

@jswhit
Copy link
Collaborator

jswhit commented May 5, 2015

you could copy the data directly to a diskless file (without first writing to disk) couldn't you?

@dopplershift
Copy link
Member

Define "diskless file".

NetCDF requires a filename to read data from.

@jswhit
Copy link
Collaborator

jswhit commented May 5, 2015

from netCDF4 import Dataset
nc = Dataset(URL)
ncm = Dataset('inmemory.nc',diskless=True,mode='w')
.....logic to copy data from nc to ncm...

then you have a 'diskless' or in-memory version of the dataset at URL

@dopplershift
Copy link
Member

Is the nc = Dataset(URL) supposed to work for anything besides opendap? It didn't for me when I just tried to hit a THREDDS server using NCSS. What I'm picturing is:

from netCDF4 import Dataset
from urllib.requests import urlopen
url = urlopen(URL)
ncdata = url.read()
nc = Dataset(ncdata, diskless=True, mode='r')

@m-novikov
Copy link
Author

Passing string seems like fine idea, in my case of usage this will be convenient enough (I read 10mb NetCDF files which is not problem to store in python memory).
But reading whole data in memory for python will increase memory consumption and this string will need to be garbage collected which is not always happens fast in CPython.
Why not stick to file interface? This will give developer possibility to implement most convenient way handling the source of data, for string only it can be wrapped with cStringIO/StringIO.

PS. I dont really know binary structure of NetCDF4 format, event if it cannot be read/written incrementally, handling buffer consumption on C side of python extension can be good for large files.

@dopplershift
Copy link
Member

Well right now, the C netcdf library only takes a filename or an opendap URL; there's not even the option of taking any kind of file pointer. Even if the latter was possible, you still wouldn't be able to turn a StringIO instance into such a thing for standard C.

What they're adding to the netCDF C library is an API to point to an existing in-memory buffer and eliminate all file I/O; HDF5 already has such an API. It would be possible to add a Python API to netcdf4-python to take a file-like object, but at some level here all of the data needs to be read into a buffer, with a single pointer to be handed to the C-library. This is likely not to actually be a str, but I'm not sure if it's bytes, bytearray, memoryview or what (direct conversion to non-Python buffer?). I still need to learn a bit more Cython...

@shoyer
Copy link
Contributor

shoyer commented May 7, 2015

There's been some discussion about this over on the h5py issue tracker (h5py/h5py#552). It sounds like some changes to the HDF5 libraries may be necessary to make this worth entirely smoothly.

In the meantime, if you're working with netCDF3 files, using a file like object is already possible with the scipy.io.netcdf interface.

@m-novikov
Copy link
Author

Thank you. As I read NetCDF4 files, for now I settle with NamedTemporaryFile workaround.

@thehesiod
Copy link
Contributor

thehesiod commented Apr 27, 2017

looked into using a local OpenDAP server but couldn't find anything that easily worked serving local netCDF files. This would be an option if anyone can get it to work.

@thehesiod
Copy link
Contributor

argh, did the work to create #652, however found: Unidata/netcdf-c#394 :(

@thehesiod
Copy link
Contributor

btw, I'm guessing this is a dup of #295

@thehesiod
Copy link
Contributor

btw, there may be another bug in netcdf-c with in-memory files, I just tried with a 2d array of data, and all rows after 100 returned garbage data. Investigating this now

@tam203
Copy link

tam203 commented Mar 29, 2019

sorry for the cross post but this isn't working for me as per the docs. I've tried python 2.7 and 3.7 and get the same error

[ec2-user@ip-172-31-12-20 project]$ python3 inmem.py
Traceback (most recent call last):
  File "inmem.py", line 5, in <module>
    netCDF4.Dataset("in-mem-file", mode='r', memory=data)
  File "netCDF4/_netCDF4.pyx", line 2285, in netCDF4._netCDF4.Dataset.__init__
  File "netCDF4/_netCDF4.pyx", line 1855, in netCDF4._netCDF4._ensure_nc_success
FileNotFoundError: [Errno 2] No such file or directory: b'in-mem-file'

code:

import netCDF4
with open('./db8d6757c80a3fa51779a325ba76336451ea0344.nc','rb') as fp:
    data = fp.read()
ds = netCDF4.Dataset("in-mem-file", mode='r', memory=data)
print(ds)

netCDF4 version '1.5.0'

a FileNotFoundError seems irrelevant since I'm trying to read from memory. Help much appreciated.

@jswhit jswhit reopened this Mar 29, 2019
@jswhit
Copy link
Collaborator

jswhit commented Mar 29, 2019

@tam203 - this should work. I can only think of two reasons it might not.

  1. you have an older version of netcdf-c that either doesn't support in-memory access, or has a bug that has since been fixed, or
  2. there's something about your file that the library doesn't like.

Can you tell us what version of netcdf-c you have, and post that file somewhere? (if it's small enough you can tar/zip it and attach it to this ticket).

@jswhit
Copy link
Collaborator

jswhit commented Mar 29, 2019

Could be related to Unidata/netcdf-c#394 which I believe was fixed in netcdf-c 4.5.0

@tam203
Copy link

tam203 commented Apr 1, 2019

Thanks. I'm using what came from python3 -m pip install netcdf4 -t .

I think you are correct in the bug it looks like I'm on version 4.4.1.1 of netcdf-c :

>>> netCDF4.getlibversion()
'4.4.1.1 of Mar 23 2019 19:51:19 $'

How do I go about getting version 4.5 and will the pip version be updated shortly?

I'm packaging this up on a AWS EC2 machine to use in lambda so I need the C library to be packaged with the python not just installed somewhere on the system, if that makes sense.

@jswhit
Copy link
Collaborator

jswhit commented Apr 1, 2019

Ah - I see the linux and osx wheels are built using 4.4.1.1. I will update that and create a new release (1.5.0.1) with new binary wheels. If you have a newer version of the library on your system you can follow the build instructions in the docs to rebuild from source and link against the newer library.

@jswhit
Copy link
Collaborator

jswhit commented Apr 1, 2019

wheels for 1.5.0.1 are available (using netcdf-c 4.6.3). Please let me know if this fixes the problem.

@tam203
Copy link

tam203 commented Apr 2, 2019

@jswhit Perfect that's fixed it thanks.

For any one's reference:

python3 -m pip install "netcdf4>=1.5.0.1" -t .

Is what I ran to ensure I got the new version. Ta.

@duducosmos
Copy link

@jswhit Perfect that's fixed it thanks.

For any one's reference:

python3 -m pip install "netcdf4>=1.5.0.1" -t .

Is what I ran to ensure I got the new version. Ta.

Solved my problem

@kmfweb
Copy link

kmfweb commented Sep 11, 2019

Didn't solve my problem, unfortunately.

Installing collected packages: numpy, cftime, netcdf4
Successfully installed cftime-1.0.3.4 netcdf4-1.5.2 numpy-1.17.2

Any other ideas?

@dopplershift
Copy link
Member

@kmfweb If you're installing using pip, that means you're using your systems version of netcdf-c (libnetcdf). What version of that is installed?

@kmfweb
Copy link

kmfweb commented Sep 12, 2019

@dopplershift I have checked using the command <> ncdump <> or <> nc-config --version <> which gives me the last line of output: netcdf library version 4.4.1.1 of Jun 8 2018 03:08:32

I have some old netCDF data which could, and still can be read. But with the new data I would like to read in and then to regard, I receive the "FileNotFoundError: [Errno 2] No such file or directory: b". Path file and name are correct, and I am able to access to file as well via ncview.

@dopplershift
Copy link
Member

I'm confused. Is this data you have in a file on disk or data that's already in a buffer in memory? Can you provide sample code for what's not working?

@kmfweb
Copy link

kmfweb commented Sep 12, 2019

I have been reading in data of decades, e.g. file19701979.nc, file19801989.nc, file19901999.nc etc. using a loop. Within this loop I have a function "New_Data,Latitudes,Longitudes = GetGrid4Slice(FileName,ReadInfo,SliceInfo,LatInfo,LonInfo)" which includes "ncf=netcdf.netcdf_file(FileName,'r')". For the couple of decadal netCDF files it runs through without any problems.

I have got rid of the decades loop, as now I am working with only 1 single netCDF file. For this netCDF file I receive the FileNotFoundError when calling "ncf=netcdf.netcdf_file(FileName,'r')". I am quite sure that my single netCDF file I am trying to read is a proper netCDF file as I am able to have a look using ncview.

I am not sure if this is about my loop, or the library's version. This file name and path is def the right one, and the Error is with the "b".

@dopplershift
Copy link
Member

Ok, then I think you should open a new issue. This issue is about reading datasets from a buffer that already exists in memory, not a file on disk.

@cpaton8
Copy link

cpaton8 commented Apr 20, 2020

I installed the most recent version of netcdf4 and am getting the error "OSError: [Errno -128] NetCDF: Attempt to use feature that was not turned on when netCDF was built."

import requests
from requests.auth import HTTPBasicAuth
import netCDF4
import os
import xarray as xr

username = ""
password = ""
session = requests.session()
session.auth = HTTPBasicAuth(username, password)

link = "https://urs.earthdata.nasa.gov/oauth/authorize?scope=uid&app_type=401&client_id=ijpRZvb9qeKCK5ctsn75Tg&response_type=code&redirect_uri=https%3A%2F%2Fe4ftl01.cr.usgs.gov%2Foauth&state=aHR0cHM6Ly9lNGZ0bDAxLmNyLnVzZ3MuZ292Ly9NT0RWNl9DbXBfQi9NT0xUL01PRDEzQTEuMDA2LzIwMTkuMDIuMDIvTU9EMTNBMS5BMjAxOTAzMy5oMDF2MDcuMDA2LjIwMTkwNTAyMDA0NTMuaGRm"

bio = io.BytesIO()
with session.get(link, stream=True) as resp:
    for chunk in resp.iter_content(chunk_size=2 ** 20):
        bio.write(chunk)
bio.seek(0)
 
test_bytes = bio.read()
netcdf_files = netCDF4.Dataset('in-mem-file', mode='r', memory=test_bytes)

@dopplershift
Copy link
Member

@cpaton8 That error message says:

Attempt to use feature that was not turned on when netCDF was built.

I'm not sure how you installed the netcdf-c package (libnetcdf.so or libnetcdf.dylib), but that message means it did not have the memory-based reading enabled when it was compiled.

@cpaton8
Copy link

cpaton8 commented Apr 30, 2020

@dopplershift all packages installed via conda-forge. libnetcdf 4.7.3 and netcdf 1.4.3
I am getting the same error when trying the above.
FileNotFoundError: [Errno 2] No such file or directory: b'in-mem-file'

@dopplershift
Copy link
Member

@cpaton8 A couple things:

  1. What OS and Python version are you on? conda (with conda-forge configured) has informed me in no uncertain terms that is will NOT install libnetcdf 4.7.3 and netcdf4 1.4.3 together.
  2. The sample code you provided does fail for me, but the message I get is OSError: [Errno -51] NetCDF: Unknown file format: b'in-mem-file', which is because in this case test_bytes is: b'HTTP Basic: Access denied.\n'

So, on macOS, in this environment:

>conda list
# packages in environment at /Users/rmay/miniconda3/envs/test-env:
#
# Name                    Version                   Build  Channel
brotlipy                  0.7.0           py37h9bfed18_1000    conda-forge
bzip2                     1.0.8                h0b31af3_2    conda-forge
ca-certificates           2020.4.5.1           hecc5488_0    conda-forge
certifi                   2020.4.5.1       py37hc8dfbb8_0    conda-forge
cffi                      1.14.0           py37h356ff06_0    conda-forge
cftime                    1.1.1.2          py37h10e2902_0    conda-forge
chardet                   3.0.4           py37hc8dfbb8_1006    conda-forge
cryptography              2.9.2            py37he655712_0    conda-forge
curl                      7.69.1               h2d98d24_0    conda-forge
hdf4                      4.2.13            h84186c3_1003    conda-forge
hdf5                      1.10.6          nompi_h3e39495_100    conda-forge
idna                      2.9                        py_1    conda-forge
jpeg                      9c                h1de35cc_1001    conda-forge
krb5                      1.17.1               h1752a42_0    conda-forge
libblas                   3.8.0               16_openblas    conda-forge
libcblas                  3.8.0               16_openblas    conda-forge
libcurl                   7.69.1               hc0b9707_0    conda-forge
libcxx                    10.0.0               h1af66ff_2    conda-forge
libedit                   3.1.20170329      hcfe32e1_1001    conda-forge
libffi                    3.2.1             h4a8c4bd_1007    conda-forge
libgfortran               4.0.0                         2    conda-forge
liblapack                 3.8.0               16_openblas    conda-forge
libnetcdf                 4.7.4           nompi_ha11d67f_102    conda-forge
libopenblas               0.3.9                h3d69b6c_0    conda-forge
libssh2                   1.8.2                hcdc9a53_2    conda-forge
llvm-openmp               10.0.0               h28b9765_0    conda-forge
ncurses                   6.1               h0a44026_1002    conda-forge
netcdf4                   1.5.3           nompi_py37hf55ae24_105    conda-forge
numpy                     1.18.1           py37h7687784_1    conda-forge
openssl                   1.1.1g               h0b31af3_0    conda-forge
pip                       20.1               pyh9f0ad1d_0    conda-forge
pycparser                 2.20                       py_0    conda-forge
pyopenssl                 19.1.0                     py_1    conda-forge
pysocks                   1.7.1            py37hc8dfbb8_1    conda-forge
python                    3.7.6           h90870a6_5_cpython    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
readline                  8.0                  hcfe32e1_0    conda-forge
requests                  2.23.0             pyh8c360ce_2    conda-forge
setuptools                46.1.3           py37hc8dfbb8_0    conda-forge
six                       1.14.0                     py_1    conda-forge
sqlite                    3.30.1               h93121df_0    conda-forge
tk                        8.6.10               hbbe82c9_0    conda-forge
urllib3                   1.25.9                     py_0    conda-forge
wheel                     0.34.2                     py_1    conda-forge
xz                        5.2.5                h0b31af3_0    conda-forge
zlib                      1.2.11            h0b31af3_1006    conda-forge

this code works fine:

import requests
import netCDF4

link = ('https://thredds.ucar.edu/thredds/fileServer/satellite/goes/east/grb/ABI/Mesoscale-2/Channel08/'
        'current/OR_ABI-L1b-RadM2-M6C08_G16_s20201221854495_e20201221854553_c20201221855015.nc')
with requests.get(link) as resp:
    netcdf_file = netCDF4.Dataset('in-mem-file', mode='r', memory=resp.content)

print(netcdf_file.title)

@nicksilver
Copy link

nicksilver commented Aug 13, 2020

@cpaton8 I am having a similar problem with netCDF version 4.6.0. When I run your above example:

import requests
import netCDF4

link = ('https://thredds.ucar.edu/thredds/fileServer/satellite/goes/east/grb/ABI/Mesoscale-2/Channel08/'
        'current/OR_ABI-L1b-RadM2-M6C08_G16_s20201221854495_e20201221854553_c20201221855015.nc')
with requests.get(link) as resp:
    netcdf_file = netCDF4.Dataset('in-mem-file', mode='r', memory=resp.content)

print(netcdf_file.title)

I get the error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "netCDF4/_netCDF4.pyx", line 2358, in netCDF4._netCDF4.Dataset.__init__
  File "netCDF4/_netCDF4.pyx", line 1926, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -51] NetCDF: Unknown file format: b'in-mem-file'

My python 3.6.9 environment looks like this:

Package          Version
---------------- ------------
affine           2.3.0
asciitree        0.3.3
attrs            19.3.0
backcall         0.2.0
beautifulsoup4   4.9.1
boto3            1.14.39
botocore         1.17.39
Bottleneck       1.3.2
certifi          2020.6.20
cffi             1.14.1
cfgrib           0.9.8.4
cftime           1.2.1
chardet          3.0.4
click            7.1.2
click-plugins    1.1.1
cligj            0.5.0
cycler           0.10.0
decorator        4.4.2
docopt           0.6.2
docutils         0.15.2
fasteners        0.15
Fiona            1.8.13.post1
geopandas        0.8.1
idna             2.10
ipykernel        5.3.4
ipython          7.16.1
ipython-genutils 0.2.0
jedi             0.17.2
Jinja2           2.11.2
jmespath         0.10.0
jupyter-client   6.1.6
jupyter-core     4.6.3
kiwisolver       1.2.0
llvmlite         0.33.0
MarkupSafe       1.1.1
matplotlib       3.3.0
monotonic        1.5
munch            2.5.0
nc-time-axis     1.2.0
netCDF4          1.5.4
numba            0.50.1
numbagg          0.1
numcodecs        0.6.4
numpy            1.19.1
pandas           1.1.0
parso            0.7.1
pexpect          4.8.0
pickleshare      0.7.5
Pillow           7.2.0
pip              20.2.2
pkg-resources    0.0.0
prompt-toolkit   3.0.6
protobuf         4.0.0rc2
psycopg2-binary  2.8.5
ptyprocess       0.6.0
pycparser        2.20
Pydap            3.2.2
Pygments         2.6.1
pyparsing        2.4.7
pyproj           2.6.1.post1
python-dateutil  2.8.1
pytz             2020.1
pyzmq            19.0.2
rasterio         1.1.5
requests         2.24.0
s3transfer       0.3.3
setuptools       49.3.1
Shapely          1.7.0
siphon           0.8.0
six              1.15.0
snuggs           1.4.7
soupsieve        2.0.1
tornado          6.0.4
traitlets        4.3.3
urllib3          1.25.10
wcwidth          0.2.5
WebOb            1.8.6
xarray           0.16.0
zarr             2.4.0

Thoughts?

@dopplershift
Copy link
Member

@nicksilver I find it really useful in cases like this to look at what's being returned by requests. If I take the code that's failing for you and print out the response:

import requests
import netCDF4

link = ('https://thredds.ucar.edu/thredds/fileServer/satellite/goes/east/grb/ABI/Mesoscale-2/Channel08/'
        'current/OR_ABI-L1b-RadM2-M6C08_G16_s20201221854495_e20201221854553_c20201221855015.nc')
with requests.get(link) as resp:
    print(resp.content.decode('utf-8'))

I see:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>TDS - Error report</title>
    <link rel="stylesheet" href="/thredds/tds.css" type="text/css"/>
  </head>
  <body>
    <h1>HTTP Status 404 - Not Found</h1>
    <HR size="1" noshade="noshade">
    <p><b>Status</b> 404 - Not Found</p>
    <HR size="1" noshade="noshade">
    <h3>THREDDS Data Server Version 4.6
      -- <a href='https://www.unidata.ucar.edu/software/thredds/v4.6/tds/TDS.html'>Documentation</a></h3>
  </body>
</html>

So the original data file has aged off the server. If I update to a currently available file:

import requests
import netCDF4

link = ('https://thredds.ucar.edu/thredds/fileServer/satellite/goes/east/grb/ABI/Mesoscale-2/Channel08/'
        'current/OR_ABI-L1b-RadM2-M6C08_G16_s20202261740546_e20202261741003_c20202261741040.nc')
with requests.get(link) as resp:
    netcdf_file = netCDF4.Dataset('in-mem-file', mode='r', memory=resp.content)

print(netcdf_file.title)

I get ABI L1b Radiances.

@cpaton8
Copy link

cpaton8 commented Aug 13, 2020

@nicksilver not sure if this is the issue you are running into but the Modis files we've been working with are HDF-EOS v2 which are based on HDF4. They would need to be converted (there's a tool called h4toh5) before they’re compatible with netCDF-4.

@nicksilver
Copy link

Beautiful...thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.