Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing GDAL ZARR _CRS attribute not possible #6448

Closed
wankoelias opened this issue Apr 6, 2022 · 12 comments · Fixed by #6636
Closed

Writing GDAL ZARR _CRS attribute not possible #6448

wankoelias opened this issue Apr 6, 2022 · 12 comments · Fixed by #6636

Comments

@wankoelias
Copy link

wankoelias commented Apr 6, 2022

What is your issue?

Related to #6374

Writing a ZARR which is compatible with GDAL conventions using xarray.Dataset.to_zarr requires all the data variables to have a _CRS attribute which contains the Spatial Reference System encoding (SRS).
This _CRS attribute itself is a dict in which the SRS is encoded in at least one of these keys: wkt, url, projjson

Because attribute values can't be dictionaries during serialization, it does not seem possible to write GDAL compatible zarrs using xarray.

Example:

lets assume we have a Dataset ds like this:

<xarray.Dataset>
Dimensions:  (Y: 180, X: 360)
Coordinates:
  * X        (X) float64 -179.5 -178.5 -177.5 -176.5 ... 176.5 177.5 178.5 179.5
  * Y        (Y) float64 89.5 88.5 87.5 86.5 85.5 ... -86.5 -87.5 -88.5 -89.5
Data variables:
    Band1    (Y, X) uint16 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0
    Band2    (Y, X) uint16 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0
    Band3    (Y, X) uint16 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0

lets also assume we want to encode the _CRS as wkt like so:

wkt = 'GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]'

(encoding the _CRS in any of the other 2 formats results in the same problem at the end)

Setting the attributes of each data variable:

attributes = {
        "_ARRAY_DIMENSIONS": ['Y', 'X'],
         "_CRS": {"wkt": wkt},
        "AREA_OR_POINT": 'Area',
    }

for data_var in ds.data_vars:
    ds[data_var].attrs = attributes

no problem so far, ds.Band1.attrs results in:

{
    "_ARRAY_DIMENSIONS": ["Y", "X"],
    "_CRS": {
        "wkt": 'GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]'
    },
    "AREA_OR_POINT": "Area",
}

the problem now occurs with writing the dataset using:

ds.to_zarr("test.zarr", consolidated=True)
TypeError: Invalid value for attr '_CRS': {'wkt': 'GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]'}. 

For serialization to netCDF files, its value must be of one of the following types: str, Number, ndarray, number, list, tuple
@wankoelias wankoelias added the needs triage Issue that has not been reviewed by xarray team member label Apr 6, 2022
@rabernat
Copy link
Contributor

rabernat commented Apr 6, 2022

I think the core problem here is that Zarr itself supports arbitrary json data structures as attributes, but netCDF does not. The Zarr serialization in Xarray is designed to emulate netCDF, but we could make that optional, for example, with a flag to bypass attribute encoding / decoding and just pass the python data directly through to Zarr.

However, my concern would be that netCDF4 C library would not be able to read those files (nczarr). What happens if you try to open up a GDAL-created Zarr with netCDF4?

FWIW, the new GeoZarr Spec by @christophenoel does not use the GDAL convention for CRS. Instead, it recommends to use CF conventions for encoding CRS. This is more compatible with NetCDF, but won't be parsed correctly by GDAL.

I am a little discouraged that we have not managed to align better across projects so far (e.g. having this conversation before the GDAL Zarr CRS convention was implemented). 😞 For example, either of these two GDAL PRs:

However, it is not too late! Let's try to reach for a standard way of encoding CRS in Zarr that can be used across languages and implementations of Zarr.

My own preference would be to try to get GDAL to support the GeoZarr Spec and thus the CF-convention CRS attribute, rather than trying to get Xarray to be able to write the GDAL CRS convention.

@dcherian dcherian added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Apr 6, 2022
@wankoelias wankoelias changed the title Writing GDAL ZARR _CRS attribute Writing GDAL ZARR _CRS attribute not possible Apr 6, 2022
@christophenoel
Copy link

Very interesting topic. I assume that GDAL proposed something as Zarr specification has no provision for spatial reference system encoding.

From my point of view, by making it close to NetCDF (which is so widely used in Earth Observation missions) and providing supporting librarires, xArray made the success of Zarr in the EO world.

Indeed, my dream would be GDAL align to the GeoZarr spec which is mostly aligned with NetCDF/xArray.

@rabernat
Copy link
Contributor

rabernat commented Apr 7, 2022

@christophenoel - I share your perspective. But there is a huge swath of the geospatial world who basically hate NetCDF and avoid it like the plague. These communities prefer to use geotiff and GDAL. We need to reach for interoperability.

@christophenoel
Copy link

christophenoel commented Apr 7, 2022

Some people will prefer Cloud-Optimised GeoTiff, others (geo)Zarr.

By the way, I use rasterio (based on GDAL), xarray, GDAL and all tools without problems with CF conventions.

The impact of GDAL using CRS attribute are only:

  • when converting a non-netcdf file to Zarr with GDAL, the CRS is encoded in the CRS attribute
  • when converting a Zarr file to a non-netcdf file which encode CRS in grid-mapping, this is not converted to the CRS attribute.

@scottyhq
Copy link
Contributor

One of the main motivations behind the the rioxarray extension is GDAL compatibility. It looks like @snowman2 and @TomAugspurger have discussed saving many geotiffs loaded into xarray as GDAL-compatible Zarr for example corteva/rioxarray#433 (comment).

While it seems that the ultimate solution is agreeing on a format standard, here is another small example using the rioxarray extension where format conversion doesn't currently work as you might expect:

# https://github.com/pydata/xarray-data
ds = xr.open_dataset('xarray-data/air_temperature.nc',
                     engine='rasterio')

# TooManyDimensions: Only 2D and 3D data arrays supported.
ds.rio.to_raster('test.zarr', driver='ZARR')

# Does not error, but output not equivalent to `gdal_translate -of ZARR xarray-data/air_temperature.nc gdal_air_temp.zarr`
# for example, `gdalinfo xarray-tutorial-airtemp.zarr` gives 
# Warning 1: Too many samples along the > 2D dimensions of /air.
ds.to_zarr('xarray-tutorial-airtemp.zarr')

@snowman2
Copy link
Contributor

Instead, it recommends to use CF conventions for encoding CRS. This is more compatible with NetCDF, but won't be parsed correctly by GDAL.

GDAL does support the CF conventions for storing the CRS.

https://gdal.org/drivers/raster/netcdf.html#georeference

"The driver first tries to follow the CF-1 Convention from UNIDATA looking for the Metadata named “grid_mapping”."

@snowman2
Copy link
Contributor

This document is also a useful reference for storing CRS in xarray: https://corteva.github.io/rioxarray/stable/getting_started/crs_management.html

@snowman2
Copy link
Contributor

Top reasons for using CF over a CRS attribute:

  • Supported by geospatial & netCDF software (GDAL/Esri/Unidata)
  • Attributes are easily lost ref
  • Encourage more usage of standard mechanisms of CRS storage

@wankoelias
Copy link
Author

so just to make it clear... GDAL doesn't actually need the CRS stored with the _CRS convention as documented in their ZARR driver specification?

@snowman2
Copy link
Contributor

GDAL doesn't actually need the CRS stored with the _CRS convention as documented in their ZARR driver specification?

GDAL may need to update the ZARR driver. The NetCDF driver does support CF. It would be good for them to be consistent.

@rabernat
Copy link
Contributor

I am guilty of sidetracking this issue into the politics of CRS encoding. That discussion is important. But in the meantime, @wankoelias's original issue reveals is narrower technical issue with Xarray's Zarr writer: Xarray won't let you serialize a dictionary attribute to zarr, even though zarr has no problem with this. That is a problem we can fix pretty easily.

The _validate_attrs helper function was just borrowed from to_netcdf:

def _validate_attrs(dataset, invalid_netcdf=False):
"""`attrs` must have a string key and a value which is either: a number,
a string, an ndarray, a list/tuple of numbers/strings, or a numpy.bool_.

We could refactor this function to be more flexible to account for zarr's broader range of allowed attribute types (as we have evidently already done for h5netcdf). Or we could just bypass it completely in the to_zarr method. That is the only real decision we need to make here right now.

@wankoelias - you seem to understand the issue pretty well. Would you be game for making a PR? We would be glad to support you along the way.

@rouault
Copy link

rouault commented Sep 4, 2022

I am a little discouraged that we have not managed to align better across projects so far (e.g. having this conversation before the GDAL Zarr CRS convention was implemented)

just seeing this conversation. I wasn't aware of GeoZarr when I implemented the _CRS attribute in the GDAL Zarr driver. There is a practical difficulty in reusing the CF reading & writing code from the GDAL netCDF driver as it is tied to that driver, but I'd guess with sufficient effort it could be made agnostic of the carrier. No opposition from me if someone wants to tackle this.
That said, the way CRS is encoding in CF conventions is not super elegant, because you need to create a 0-dim variable, define a set of attributes for each projection method, and have an attribute in the georeferenced variable to point to the variable that holds the CRS attribute. A _CRS attribute on the georeferenced variable is much more simple, and by reusing the WKT or PROJJSON encodings, this save the implementor the business of knowing how to define exactly the CRS (obviously when they have a library that does that job for them)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants