-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading virtual references back out into VirtualiZarr Manifests #104
Comments
I'm wondering if this is truly needed. Shouldn't it be possible to append new references without rewriting the manifest for the whole array? Edit: as a first step towards virtualizarr interoperability, I agree that just doing a bulk rewrite is fine. But longer-term, I think more efficient strategies are possible. In general, I'd like libraries using Icechunk (like Virtualizarr) to only have to update / add new or changed references, rather than just rewriting the entire array. It's Icechunk's job to decide how to split this into one or more manifests. I'm curious what sort of API we would need to expose to enable this. |
I was also wondering about this. Kerchunk does have a concept of appending to kerchunk parquet references. I think it comes down to whether or not the "appending" (which in virtualizarr space is a concatenation operation in the example above) happens in user/virtualizarr API or inside the icechunk store. My example above assumes you want to do this in virtualizarr API. But if we instead try to copy the existing xarray new_vds.virtualize.to_icechunk(icechunkstore, append_dim='time') If we had pluggable writer backends for xarray we could maybe even make this work new_vds.to_zarr(icechunkstore, append_dim='time', engine='virtualizarr') which better reflects the fact that in virtualizarr an At some point the concatenate/append operation should check that the dimensions are all consistent, but that could reasonably be done in either library. Regardless I do still think that being able to extract all the references for one snapshot might be useful. |
Should we add a note that it is not currently possible to append virtual refs to icechunk stores in https://icechunk.io/icechunk-python/virtual/ and point to this issue? cc @mpiannucci |
As far as I can tell (and this is the first time I'm reading this code so please correct me if this is wrong), the https://github.com/pydata/xarray/blob/main/xarray%2Fbackends%2Fzarr.py#L857 |
@TomNicholas is correct. Appending in zarr is really just sugar around resizing an existing array and pointing it at existing chunks. An exmaple of appending to an icechunk store is at the bottom of this notebook. |
Ok thanks for explaining how the append method is working @TomNicholas - but if I understand correctly there is no way to append virtual refs at the moment, so I would still see a benefit to others to make this clear in the documentation. Is there any plans to work on this feature in VirtualiZarr @TomNicholas @mpiannucci ? |
👍
I would very much like to see it, but don't personally need it (yet) so will be prioritizing other things. It's a good thing for somebody else to add, especially someone who is (or wants to become) familiar with xarray's existing |
I'm going to use zarr-developers/VirtualiZarr#21 to track the Separately I still think that being able to pull virtual references back out would be useful. Imagine for example someone makes a public icechunk store containing virtual references, then someone from a different organisation wants to refer to those references - they would need to pull them out. (Whether or not people creating catalogs like this is a good idea is another question, but I bet people will ask for that.) |
@rabernat just said in the Pangeo showcase that icechunk is not yet at a state where it guarantees that references written with the current spec will be readable even with future versions of this library. To me that makes the feature of reading virtual references back out into VirtualiZarr actually quite important to have sooner rather than later. If that existed, then if the icechunk store changed on disk, I could always roll back to a version of the icechunk library that could read the references in order to suck them out then re-save them in the new format. |
Can you propose what the ideal API for this would be? It gets confusing for me thinking about how there can be mixed virtual and native chunks |
Yeah good point. All variables in an icechunk store are potentially available as actual loaded data, or as references, so I would say that there should be a new virtualizarr reader, such that vds = open_virtual_dataset(icechunkstore, loadable_variables=...) returns an I thought maybe we could imagine this function returning virtual vs native variables based on how they were stored in icechunk, but IIUC that's actually not possible right now because a single zarr array could be backed by some virtual and some native chunks, which VirtualiZarr has no way to represent in one array. The API needed for icechunk-python at its most basic is just |
Cool. Would it return None if the chunk is native? Or throw an error? |
I wanted to mention that having mixed virtual and non virtual chunks in a single array, is exactly what we want to enable: people starting from a virtual dataset and, with time, update the dataset in Icechunk (which would generate non-vrtual chunks). So we expect this case to be very common. |
I support adding pub enum ChunkPayload {
Inline(Bytes),
Virtual(VirtualChunkRef),
Ref(ChunkRef),
} |
No, it would still return the virtual reference to that chunk, the reference would just refer to deep within this icechunk store. IIUC icechunk chunks are immutable once written, so referring to a chunk in an icechunk store from elsewhere is exactly as reliable as referring to a chunk in a traditional zarr store / legacy-formatted file.
That would be fine for |
Inline chunks I think are the biggest issue here. As @TomNicholas says, "normal" chunks can also be seen as virtual chunks, pointing to the icechunk store. But inline ones live only in the manifest. We could provide a way to reference them via an offset/length, but it would be pretty tricky and ugly. |
Right. Should arrays containing inlined chunks always be directly loaded then? I suppose VirtualiZarr might have to raise a warning if that happened for variables that were supposed to be virtual... Also to do that VirtualiZarr would need a way to easily ask Icechunk if an array contained any inlines chunks. Tbh this makes inlined chunks seem like a leaky abstraction. |
They are, I don't love them, they are leaky and probably a premature optimization. We could make IC not use inline chunks unless the user overrides a config, I suppose. |
Inline chunks don't matter to virtualizarr at all. To virtualizarr it's the same as a native chunk |
Except we're talking about here having native chunks also be extracted as virtual references, which @paraseba is saying is harder to provide for inclined chunks than for native chunks. |
Except we're talking about here having native chunks also be extracted as virtual references, which @paraseba is saying is harder to provide for inclined chunks than for native chunks.
It's unclear to me that they are necessary at all. Why can't you just "inline" a variable by storing it as a single native chunk? That requires an extra get request I guess? Or because then you can't cache it in the rust client? |
This. They are useful for things like coordinate arrays. |
Got itttt sorry I misunderstood :) |
I would like to be able to read virtual references back out from an icechunk store into VirtualiZarr
ManifestArray
objects.Note this issue is the icechunk equivalent of zarr-developers/VirtualiZarr#118, which is about reading kerchunk references into
ManifestArray
objects.The main use case is appending new data to an existing store (see zarr-developers/VirtualiZarr#21 (comment)), so that when some new data arrives (e.g. a new grib file with today's weather data), I can add an updated snapshot just with something like:
In order to implement that Icechunk reader for virtualizarr I would need some API for getting all virtual (and non-virtual) references for a snapshot back out of the Icechunk store, ideally as a
vz.ManifestArray
or something I can cheaply coerce to one (seeChunkManifest.from_arrays()
).Writing the updated references as a new snapshot also requires #103.
(I guess the
.virtualize.to_icechunk
method might also need to know to doarray.resize
in this example... (see the Append example in this notebook.)Running that above snippet as a cron job / event-driven serverless function should go a long way towards making ingestion of regularly-updated data archives easier. (cc @mpiannucci)
This feature might also be useful to allow using icechunk as a serialization format during large serverless reductions (xref zarr-developers/VirtualiZarr#123).
cc @paraseba
The text was updated successfully, but these errors were encountered: