-
-
Notifications
You must be signed in to change notification settings - Fork 331
Allow Stores to opt out of consolidated metadata. #3119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Some Stores don't benefit from Zarr's consolidated metadata mechanism. These Stores usually implement their own consolidation mechanism, or provide good performance for metadata retrieval out of the box. These Stores can now implement the `supports_consolidated_metadata` property returning `False`. In this situation, Zarr will silently ignore any requests to consolidate the metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give some context in to why this is useful? What's an example of a store that would want to put a blanket ban on consolidating metadata?
Can you also explain what the benefit of this approach is over stores overriding consolidate_metadata
to raise an error or warning + noop?
If this is implemented this way, I don't think trying to consolidate in this case should silently do nothing, because that provides confusion for a user who is trying to consolidate but nothing is happening. It should raise an error, or at least a warning, so the user knows what is going on, and can correct their code to remove attempts at consolidation.
Icechunk is an example of such a Store and my main motivation for this PR. Currently we are having issues with consolidated metadata. It's the default in XArray but it produces inconsistencies. Icechunk users have very concurrent workloads, and the whole point of Icechunk is to maintain consistency. Consolidated metadata breaks consistency before information gets to the Store, so it's too late to fix it.
Our first approach In Icechunk was to error, see this PR for example. But we checked that it breaks a lot of code. From XArray backend tests (that Icechunk uses) to, more importantly, most code out there that is not passing
I like the idea of warning in |
What's the proposed behavior of
Would it be fair to say that this is a case where the Note that the top-level Zarr APIs all have a
I see two benefits from consolidated metadata:
Does that sound right to you @paraseba. Does one or the other provide particular challenges for icechunk? If icechunk has its own consolidated metadata mechanism, is there any reason not to use it to populate the in-memory |
I'm basically reinterpreting
Exactly. Zarr is currently assuming the Store is "dumb" in terms of metadata, which is perfectly true for most stores. But smarter Stores such as Icechunk, need a way to indicate their behavior to Zarr. There is a lot more that could (and probably should) be done in this front. Zarr should be able to "reflect" on the Store capabilities to optimize its algorithms.
Yes. In the name of reducing impact I'm reinterpreting
Kind of, but is this three-state value useful? Why XArray users would distinguish between
Yes. There is some risk of people depending on this behavior, but I'd argue they are depending on an implementation detail. The right way to discover hierarchy should be using the
We analyzed that possibility. A couple of thoughts:
To make matters more complex, I understand the key can be stored at any level of the hierarchy, not only the root.
One of the main issues is there is no way to implement the current API consistently in concurrent scenarios. |
In my experience, yes it's useful for "dumb" stores like blob storage. I want to know whether my I think we're saying essentially the same thing. xarray / Zarr should query the store for the "recommended" way to write stuff by default. I'm probably more bothered more than you by
+100. I have a WIP branch for Stores to read (uncompressed) data into a reallocated buffer without any intermediate copies. I think we'll be seeing this pattern in a few places.
LMK if you want to move this to a separate issue, but focusing just on zarr-python's in-memory model, if I do something like group = await open_group(store, path)
async for member in group.members():
... do you expect
I didn't follow these two points, but I feel like we're getting a bit far afield from the PR and it's not clear whether this is worth spending time on. I might ping you on a separate issue if I have time to write up some thoughts.
Not for lack of trying zarr-developers/zarr-specs#309 tl/dr my only qualm right now is with |
I just skimmed through earth-mover/icechunk#962 and pydata/xarray#10122 and I think I agree with pydata/xarray#10122 (comment):
|
This is a very valid point. I think things still work for Icechunk, if I change
Consistency issues aside, I think this is the right behavior. And this is why I think depending on the |
Yep, I think that'll be perfect. |
@TomAugspurger letting the Store select behavior when |
Some Stores don't benefit from Zarr's consolidated metadata mechanism. These Stores usually implement their own consolidation mechanism, or provide good performance for metadata retrieval out of the box.
These Stores can now implement the
supports_consolidated_metadata
property returningFalse
. In this situation, Zarr will silently ignore any requests to consolidate the metadata.