-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we keep implicit groups in Zarr V3? #291
Comments
previous discussion: #184 |
I don't actually have much experience with zarr groups --- I've always just used lone arrays which may be organized in a directory hierarchy, and neither tensorstore nor neuroglancer handles zarr groups. That said, here are my thoughts on the matter:
|
To your first point, it's helpful to know that concurrent array creation is a significant use pattern, and I see how implicit groups could be useful to avoid race conditions when creating arrays. But because concurrent hierarchy modification is so backend-dependent, think our answer here should be something like "the design of the format is such that, within a group that exists on a conventional file system or object store, creating sub-arrays and sub-groups should be safe to perform independently.", which is just another way of stating that the sub-arrays and sub-groups are specified by completely separate keys relative to the key of their parent group. This isn't so different from how we currently think about chunks: they are designed to be safe to write in parallel, but the details really depend on the storage backend you are on. And I completely agree with your second point. An important detail I just thought of: the spec already defines a third type of directory: the directories containing chunk keys. As long as we include implicit groups, then chunk key directories are locally indistinguishable from implicit groups. So any Zarr client, when attempting to classify a directory that doesn't contain |
I don't have a ton more to add to this discussion but just want to give a +1 to the idea of removing implicit groups from the spec. From an implementation perspective, they are a total pain. To @jbms's first point, the fact that neither tensorstore or neuroglancer care about groups (implicit or otherwise) at all indicates to me that there is some value in a directory-type structure of Zarr arrays apart from the Group abstraction. This seems fine, and if you aren't reaching for groups today, then you can continue with the array-only pattern in the absence of implicit groups. |
Perhaps it would be good to get the input from the rest of the @zarr-developers/implementation-council here. I'm curious how other implementations are handling implicit groups at this time. |
The v3 spec permits the existence of Zarr groups without any distinguishing metadata.
In the section comparing v3 with v2, the spec states
So the argument here is that we want to avoid race conditions when creating arrays in parallel. Is this a serious problem for anyone? Personally, I was not aware that parallel hierarchy mutation was a design goal of Zarr. I always thought that the only parallelism guarantees were for separate array chunks; since creating nodes in the hierarchy is so simple (just write a JSON document), there shouldn't be a motivation for parallelizing this process, at least that's how it seems to me.
Later, there is a section comparing explicit and implicit groups, which states
So here we learn that implicit groups actually introduce a new type of race condition, because they make the structure of Zarr hierarchy ambiguous, and there's a suggestion that implementations modify Zarr hierarchies they encounter to insert implicit groups when they are detected. I don't think this is great. First, we have traded the race condition that motivated implicit groups for another one, so we are net 0 race conditions, and we are encouraging implementations to mutate the hierarchies they encounter, perhaps as an admission that implicit groups might be a bit of a headache in practice.
I'm honestly not sure what the advantage is of implicit groups. Here are some disadvantages, from my POV:
zarr-python
, we have an API that consumes paths on a file system / object store and attempts to infer whether that path points to a Zarr array or group. With implicit groups, literally any valid path can be interpreted as a Zarr group. This means that the boundary of a zarr hierarchy is not well defined, and essentially includes the entire file system. It becomes impossible for a user to include an extra non-zarr directory inside a Zarr hierarchy. Do we want this outcome?I think we should reconsider including implicit groups in the v3 spec. Removing implicit groups would simplify some matters over in the ongoing
zarr-python
v3 refactoring effort. The main question I have is whether there is anyone who really needs implicit groups for some reason, in which case I am curious to learn more about that use case.The text was updated successfully, but these errors were encountered: