Undersanding the limitations of MultiZarrToZarr #320

croth1 · 2023-03-19T17:39:03Z

croth1
Mar 19, 2023

With virtual datasets HDF5 allows quite a flexible way to make multiple files appear to be one huge dataset. I am exploring to what extend I can use MultiZarrToZarr to achieve similar joint datasets - as zarr itself does not seem to have the concept of virtual datasets.

My current understanding is that with kerchunk's MultiZarrToZarr, one can only combine chunks over a new axis, or if the axis is chunked to 1. Is that correct?

So while HDF5's virtual datasets allow combining two datasets of (899, 50, 4) and (101, 50, 4) both chunked to (50, 50, 4) into one big (1000, 50, 4) dataset, kerchunk would not allow this, because it can only deal with full chunks.

Also even if one would have full chunks, such as two datasets (900, 50, 4) and (100, 50, 4), one would still not be able to make it a (1000, 50, 4) with MultiZarrToZarr, because the axis sorting logic would prevent that. One could however combine datasets in both cases if they were chunked to (1, 50, 4).

My current understanding is that in the second case one would be able to manually write out a json that achieves this and using the reference file system. One would then be able to open the data as a joint dataset with zarr - MultiZarrToZarr however would not be able to do that for you. The first case however cannot really be solved without rechunking. Am I having misconceptions here?

Answered by martindurant

Mar 20, 2023

zarr itself does not seem to have the concept of virtual datasets

Correct, zarr is a strict hierarchy, so the only way to refer to other nodes in a tree or other trees is via storage layer tricks such as kerchunk. There are some ideas floating around zarr V3 that could allow for it, but I am not aware of a concrete proposal. Currently kerchunk does not even follow HDF references to within the same file, but only for lack of compelling need. ( https://github.com/fsspec/kerchunk/blob/main/kerchunk/hdf.py#L125 is a stub to eventually support this)

with kerchunk's MultiZarrToZarr, one can only combine chunks over a new axis, or if the axis is chunked to 1

Yes, MultiZarrToZarr if geared to…

View full answer

martindurant · 2023-03-20T15:39:35Z

martindurant
Mar 20, 2023
Maintainer

zarr itself does not seem to have the concept of virtual datasets

Correct, zarr is a strict hierarchy, so the only way to refer to other nodes in a tree or other trees is via storage layer tricks such as kerchunk. There are some ideas floating around zarr V3 that could allow for it, but I am not aware of a concrete proposal. Currently kerchunk does not even follow HDF references to within the same file, but only for lack of compelling need. ( https://github.com/fsspec/kerchunk/blob/main/kerchunk/hdf.py#L125 is a stub to eventually support this)

with kerchunk's MultiZarrToZarr, one can only combine chunks over a new axis, or if the axis is chunked to 1

Yes, MultiZarrToZarr if geared towards similar use to xarray's open_mfdataset, which in turn implies that the HDF5 layout is netCDF4-like.

kerchunk would not allow this, because it can only deal with full chunks.

Again true; but see https://github.com/zarr-developers/zeps/blob/main/draft/ZEP0003.md , specifically formulated for kerchunk's needs. Without ZEP3, you indeed need some rechunking - in other words, rewrite the data (as zarr!) and don't use kerchunk.

Also even if one would have full chunks, such as two datasets (900, 50, 4) and (100, 50, 4), one would still not be able to make it a (1000, 50, 4) with MultiZarrToZarr

This should work, and will be the basis of upcoming explicit "append"-like operations.

But yes, you can craft a set of references in any other manner. Given that HDF5 has many more uses than netCDF4, if would be worthwhile writing out more combiners, which might even be simpler for some cases (e.g., N files of identical trees). This cannot be said of other filetypes like, I think all of TIFF, grib, netCDF3, FITS(?) are much more strict about what they contain.

0 replies

pixalytics · 2023-04-18T14:31:55Z

pixalytics
Apr 18, 2023

I edited the ERA5_Kerchunk_tutorial by @peterm790 to instead use the precipitation product (precipitation_amount_1hour_Accumulation) from the ERA5 dataset hosted by AWS. Using MultiZarrToZarr worked for 2020 and earlier, but I hit the "Found chunk size mismatch" error message when merging monthly data from 2021 and 2022. From other issues in this repository and the comments here, I believe I can't solve this error as it's an inconsistency with the chunking of the saved NetCDF files in the AWS S3 bucket, but I wanted to double-check.

0 replies

martindurant · 2023-04-18T15:13:57Z

martindurant
Apr 18, 2023
Maintainer

If indeed the original files are chunked differently from one year to the next, there's nothing that kerchunk can do about it (at least until https://github.com/zarr-developers/zeps/blob/main/draft/ZEP0003.md is implemented).

1 reply

pixalytics May 18, 2023

Thanks for the reply and clarification

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undersanding the limitations of MultiZarrToZarr #320

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Undersanding the limitations of MultiZarrToZarr #320

croth1 Mar 19, 2023

Replies: 3 comments · 1 reply

martindurant Mar 20, 2023 Maintainer

pixalytics Apr 18, 2023

martindurant Apr 18, 2023 Maintainer

pixalytics May 18, 2023

croth1
Mar 19, 2023

Replies: 3 comments 1 reply

martindurant
Mar 20, 2023
Maintainer

pixalytics
Apr 18, 2023

martindurant
Apr 18, 2023
Maintainer