Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening Zarr files with R #1982

Closed
gdkrmr opened this issue Apr 12, 2021 · 15 comments
Closed

Opening Zarr files with R #1982

gdkrmr opened this issue Apr 12, 2021 · 15 comments

Comments

@gdkrmr
Copy link

gdkrmr commented Apr 12, 2021

I have build netcdf-c v4.8.0 on Manjaro and then the ncdf4 R package. And I get the following error:

 >
nc_open("file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr/#mode=zarr,file")
Error in R_nc4_open: NetCDF: internal library error; Please contact
Unidata support
Error in
nc_open("file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr/#mode=zarr,file")
:
   Error in nc_open trying to open file
file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr/#mode=zarr,file

A wrongly specified file type causes a hard crash:

 >
nc_open("file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr/#mode=nczarr,file")
R: /home/gkraemer/progs/cpp/netcdf-c/libnczarr/zclose.c:223:
zclose_type: Assertion `type && type->format_type_info != NULL' failed.
Aborted (core dumped)

Sorry that this is not more reproducible the dataset is so large that I cannot put it online and so far I could not build netcdf-c with the s3 support.

@DennisHeimbigner
Copy link
Collaborator

Do you know by chance if the file is an xarray created file?

@DennisHeimbigner
Copy link
Collaborator

Try to send me just the metadata from the file by doing something like this.

  1. find /home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr -name '.*' > tmp.txt
  2. tar -cf metadata.tar -T tmp.txt

@gdkrmr
Copy link
Author

gdkrmr commented Apr 12, 2021

The dataset is create here [1], I think they are using xarray, it is definitely not nczarr. Metadata in [2]

[1] https://github.com/esa-esdl/cube-generator

[2] Edit: sorry, took me a second to understand, you get it as zip, github doesn't allow tar files.

metadata.zip

@DennisHeimbigner
Copy link
Collaborator

Did not expect nczarr :-)
We have a fix for handling xarray zarr format. But let me
look at the metadata first to see if I see any other issues.

@DennisHeimbigner
Copy link
Collaborator

So a couple of questions.
First, What software was used to generate the file esdc-8d-0.083deg-184x270x270-2.1.0.zarr

Second, In that file, the object esdc-8d-0.083deg-184x270x270-2.1.0.zarr/aerosol_optical_thickness_1600/.zattrs has this entry
"source_attributes": "{'comment': 'Aerosol optical thickness derived from the dataset produced by the Aerosol CCI project.', 'long_name': ...}

When you print out this attribute value the original software, what does it look like? The reason I ask is because I have no way to deal with an attribute whose value is a JSON dictionary and I am curious as to how it should look.

@gdkrmr
Copy link
Author

gdkrmr commented Apr 15, 2021

First, What software was used to generate the file esdc-8d-0.083deg-184x270x270-2.1.0.zarr

The datasets are generated with this package, I think they use xarray internally:

https://github.com/esa-esdl/esdl-core/

you can find the actual providers of the data here, there is also the metadata you are asking about:

https://github.com/esa-esdl/esdl-core/tree/master/esdl/providers

When you print out this attribute value the original software, what does it look like? The reason I ask is because I have no way to deal with an attribute whose value is a JSON dictionary and I am curious as to how it should look.

I never used python to read these datasets in, I use Julia ESDL.jl [1] and have never seen these attributes printed. You can open the dataset using python:

In [1]: import xarray as xr
In [2]: c = xr.open_zarr("/path/to/cube.zarr")
In [13]: c.aerosol_optical_thickness_1600
Out[13]: 
<xarray.DataArray 'aerosol_optical_thickness_1600' (time: 1840, lat: 2160, lon: 4320)>
[17169408000 values with dtype=float64]
Coordinates:
  * lat      (lat) float64 89.96 89.88 89.79 89.71 ... -89.79 -89.87 -89.96
  * lon      (lon) float64 -180.0 -179.9 -179.8 -179.7 ... 179.8 179.9 180.0
  * time     (time) datetime64[ns] 1979-01-05 1979-01-13 ... 2018-12-31
Attributes: (12/13)
    Conventions:          CF-1.6
    easting:              -180.0 degrees
    esa_cci_path:         /neodc/esacci/aerosol/data/AATSR_SU/L3/v4.3/DAILY/
    history:              Thu May  7 16:43:23 2020 - ESDL data cube generation
    institution:          Brockmann Consult GmbH, Germany
    northing:             90.0 degrees
    ...                   ...
    source:               ESDL data cube generation, version 0.3.0.dev1
    source_attributes:    {'comment': 'Aerosol optical thickness derived from...
    time_coverage_end:    2012-04-10
    time_coverage_start:  2002-05-21
    units:                1
    url:                  http://www.esa-aerosol-cci.org/
In [14]: c.aerosol_optical_thickness_1600.source_attributes
Out[14]: "{'comment': 'Aerosol optical thickness derived from the dataset produced by the Aerosol CCI project.', 'long_name': 'Aerosol Optical Thickness at 1600 nm', 'project_name': 'ESA Aerosol CCI', 'references': 'Holzer-Popp, T., de Leeuw, G., Griesfeller, J., Martynenko, D., Klueser, L., Bevan, S., et al. (2013). Aerosol retrieval experiments in the ESA Aerosol_cci project. Atmospheric Measurement Techniques, 6, 1919-1957. doi:10.5194/amt-6-1919-2013. ', 'source_name': 'AOD1600_mean', 'standard_name': 'atmosphere_optical_thickness_due_to_aerosol_at_1600nm', 'units': '1', 'url': 'http://www.esa-aerosol-cci.org/'}"

it seems they just keep it as a string.

[1] https://github.com/esa-esdl/esdl-core/tree/master/esdl/providers

@DennisHeimbigner
Copy link
Collaborator

Is there a way you can get Julia to explicitly print that attribute?

@gdkrmr
Copy link
Author

gdkrmr commented Apr 16, 2021

For whatever reason that particular variable does not show up if I read in the entire data set (https://github.com/esa-esdl/ESDL.jl/issues/248):
here is another one, it is also just read in as a string, same as in python:

julia> using ESDL

julia> c = Cube("/home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr/")
YAXArray with the following dimensions
lon                 Axis with 4320 Elements from -179.95833333333331 to 179.95833333333331
lat                 Axis with 2160 Elements from 89.95833333333334 to -89.95833333333333
time                Axis with 1840 Elements from 1979-01-05T00:00:00 to 2018-12-31T00:00:00
Variable            Axis with 69 elements: leaf_area_index sensible_heat .. snow_sublimation Rg 
units: W m-2
Total size: 4.31 TB


julia> c.data.arrays[2].attrs["source_attributes"]
"{'comment': 'Sensible heat flux from the surface', 'long_name': 'Sensible Heat', 'project_name': 'FLUXCOM', 'references': 'Tramontana, Gianluca, et al. \"Predicting carbon dioxide and energy fluxes across global FLUXNET sites with regression algorithms.\" (2016).', 'source_name': 'H', 'standard_name': 'surface_upward_sensible_heat_flux', 'units': 'W m-2', 'url': 'http://www.fluxcom.org/'}"

@DennisHeimbigner
Copy link
Collaborator

Ok, the current netcdf-c github master should solve this problem.

@DennisHeimbigner
Copy link
Collaborator

ok, I think I have this solved. It turns out there was a bug in my JSON parser that
was not handling one of the attribute string values correctly. I will submit a PR for this
tomorrow.

@DennisHeimbigner
Copy link
Collaborator

If you want to fix the bug yourself;

  1. edit netcdf-c/libnczarr/zjson.c
  2. About line 321 change:
    if(c == NCJ_ESCAPE) c++;
    to
    if(c == NCJ_ESCAPE) parser->pos++;

DennisHeimbigner added a commit to DennisHeimbigner/netcdf-c that referenced this issue May 6, 2021
re: github issue Unidata#1982

The problem was that the libnczarr/zsjon.c handling of strings with
embedded double quotes was wrong; a one line fix.
Also added a test case.

Misc. other changes:

1. I Discovered, en passant, that the handling of 64 bit constants
had an error that was fixed.
2. cleanup of the constant conversion code to recurse on arrays of values.
@DennisHeimbigner
Copy link
Collaborator

Fixed by PR #1993

@gdkrmr
Copy link
Author

gdkrmr commented May 10, 2021

Thanks working on this, now on latest master I get a segfault, when using nc_open(file:///...#mode=nczarr,zarr) and sensible error messages when using mode=nczarr,file and mode=nczarr,s3.

Using ncdump on these files I get only a message "No such file or directory"

> library(ncdf4)
> nc_open("file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,file")
Error in R_nc4_open: NetCDF: Attempt to read empty NCZarr map key
Error in nc_open("file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,file") : 
  Error in nc_open trying to open file file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,file
> nc_open("file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,s3")
Error in R_nc4_open: NetCDF: Attempt to use feature that was not turned on when netCDF was built.
Error in nc_open("file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,s3") : 
  Error in nc_open trying to open file file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,s3
> nc_open("file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,zarr")

 *** caught segfault ***
address 0xc49109f38, cause 'memory not mapped'

Traceback:
 1: nc_open("file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,zarr")
$ ncdump "file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,zarr"
ncdump: file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,zarr: file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,zarr: No such file or directory
$ ncdump "file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,file"
ncdump: file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,file: file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,file: No such file or directory
$ ncdump "file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,s3"
ncdump: file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,s3: file:///home/gkraemer/data/DataCube/v2.1.0/esdc-8d-0.083deg-184x270x270-2.1.0.zarr#mode=nczarr,s3: No such file or directory

@DennisHeimbigner
Copy link
Collaborator

Unfortunately, I cannot duplicate this failure. I will have to think about how to resolve it.

@gdkrmr
Copy link
Author

gdkrmr commented Feb 24, 2022

The errors have somewhat changed, so I am closing this issue in favor of these: #2235 and #2234.

I am happy to report, that opening Zarr and reading metadata on Manjaro works fine now! This is great progress.

@gdkrmr gdkrmr closed this as completed Feb 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants