Undersanding the limitations of MultiZarrToZarr #320
-
With virtual datasets HDF5 allows quite a flexible way to make multiple files appear to be one huge dataset. I am exploring to what extend I can use MultiZarrToZarr to achieve similar joint datasets - as zarr itself does not seem to have the concept of virtual datasets. My current understanding is that with kerchunk's MultiZarrToZarr, one can only combine chunks over a new axis, or if the axis is chunked to 1. Is that correct? So while HDF5's virtual datasets allow combining two datasets of (899, 50, 4) and (101, 50, 4) both chunked to (50, 50, 4) into one big (1000, 50, 4) dataset, kerchunk would not allow this, because it can only deal with full chunks. Also even if one would have full chunks, such as two datasets (900, 50, 4) and (100, 50, 4), one would still not be able to make it a (1000, 50, 4) with MultiZarrToZarr, because the axis sorting logic would prevent that. One could however combine datasets in both cases if they were chunked to (1, 50, 4). My current understanding is that in the second case one would be able to manually write out a json that achieves this and using the reference file system. One would then be able to open the data as a joint dataset with zarr - MultiZarrToZarr however would not be able to do that for you. The first case however cannot really be solved without rechunking. Am I having misconceptions here? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Correct, zarr is a strict hierarchy, so the only way to refer to other nodes in a tree or other trees is via storage layer tricks such as kerchunk. There are some ideas floating around zarr V3 that could allow for it, but I am not aware of a concrete proposal. Currently kerchunk does not even follow HDF references to within the same file, but only for lack of compelling need. ( https://github.com/fsspec/kerchunk/blob/main/kerchunk/hdf.py#L125 is a stub to eventually support this)
Yes, MultiZarrToZarr if geared towards similar use to xarray's open_mfdataset, which in turn implies that the HDF5 layout is netCDF4-like.
Again true; but see https://github.com/zarr-developers/zeps/blob/main/draft/ZEP0003.md , specifically formulated for kerchunk's needs. Without ZEP3, you indeed need some rechunking - in other words, rewrite the data (as zarr!) and don't use kerchunk.
This should work, and will be the basis of upcoming explicit "append"-like operations. But yes, you can craft a set of references in any other manner. Given that HDF5 has many more uses than netCDF4, if would be worthwhile writing out more combiners, which might even be simpler for some cases (e.g., N files of identical trees). This cannot be said of other filetypes like, I think all of TIFF, grib, netCDF3, FITS(?) are much more strict about what they contain. |
Beta Was this translation helpful? Give feedback.
-
I edited the ERA5_Kerchunk_tutorial by @peterm790 to instead use the precipitation product (precipitation_amount_1hour_Accumulation) from the ERA5 dataset hosted by AWS. Using MultiZarrToZarr worked for 2020 and earlier, but I hit the "Found chunk size mismatch" error message when merging monthly data from 2021 and 2022. From other issues in this repository and the comments here, I believe I can't solve this error as it's an inconsistency with the chunking of the saved NetCDF files in the AWS S3 bucket, but I wanted to double-check. |
Beta Was this translation helpful? Give feedback.
-
If indeed the original files are chunked differently from one year to the next, there's nothing that kerchunk can do about it (at least until https://github.com/zarr-developers/zeps/blob/main/draft/ZEP0003.md is implemented). |
Beta Was this translation helpful? Give feedback.
Correct, zarr is a strict hierarchy, so the only way to refer to other nodes in a tree or other trees is via storage layer tricks such as kerchunk. There are some ideas floating around zarr V3 that could allow for it, but I am not aware of a concrete proposal. Currently kerchunk does not even follow HDF references to within the same file, but only for lack of compelling need. ( https://github.com/fsspec/kerchunk/blob/main/kerchunk/hdf.py#L125 is a stub to eventually support this)
Yes, MultiZarrToZarr if geared to…