Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3 mission #140

Closed
jbms opened this issue Apr 21, 2022 · 12 comments
Closed

v3 mission #140

jbms opened this issue Apr 21, 2022 · 12 comments
Labels

Comments

@jbms
Copy link
Contributor

jbms commented Apr 21, 2022

From our discussions on v3 my sense is that we need a clear mission statement for v3.

Creating a new storage format provides the opportunity to address problems of existing formats in a backwards-incompatible way --- in my mind that is the only reason to have a zarr v3 rather than just extending zarr v2.

But once v3 is properly "released", that opportunity will have been spent and is no longer available for future improvements. As noted by @rabernat in the community meeting on 2022-04-20, having a large number of incompatible zarr specification versions and implementations may likely just drive people away from zarr altogether.

Reviewing the current v3 spec proposed by @alimanfoo, I see the following 3 things highlighted as the motivation for v3:

  1. Accommodate high-latency storage by making fewer metadata requests (https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html#distributed-storage)
  2. Improve interoperability across implementations/programming languages (https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html#interoperability)
  3. Improve extensibility (https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html#extensibility)

However, in my mind it is not clear that these are the most pressing issues to address with a backwards-incompatible change, and furthermore I don't think the current v3 spec actually addresses these issues:

  1. (Accommodate high-latency storage) The v3 spec reorganizes metadata, but it is not clear what problem is solved by that reorganization or how the reorganization would be helpful for high-latency storage.
  2. (Interoperability) The only simplification made in the current v3 spec is with regards to the data types. However, this isn't a backwards incompatible change, and it also seems there is interest in adding back the full set of data types supported by v2 for various use cases, which would undo this change.
  3. (Extensibility) This also does not require a backwards incompatible change, and furthermore per the discussion in the community meeting on 2022-04-20 it seems there is no agreement that the current mechanism is the best one, and it was proposed by @joshmoore to remove it.

Additionally, the one change that has primarily motivated the renewed interest in zarr v3, sharding, does not require a breaking change, and has been suggested to be deferred until after the initial release of the v3 spec.

I think the zarr steering council has done great work in trying to establish a process by which the v3 spec can be designed and evolved, and I think it would be a mistake to release v3 (and lose the opportunity to make backwards-incompatible changes) without actually taking advantage of that process.

I also think we can potentially distinguish between zarr v3 the format as a breaking change to zarr v2, and this new process of evolving zarr (where there is input from implementations and the user community rather than just a decision from @alimanfoo).

In the community meeting on 2022-04-20, @joshmoore proposed that ZEP 1 should be the release of zarr v3, and my impression is that this was intended to be a relatively small change to the existing spec, just the removal of the extension mechanism. I would propose instead that ZEP 1 be used to establish agreement on the problems to be solved in a backwards-incompatible way, and then in subsequent ZEPs we could modify the spec to actually address those problems.

To be clear, a lot of changes that have been discussed in the context of v3 are purely additional features that can be added on top without requiring any breaking changes to existing usage:

  • sharding
  • irregular chunk grid
    Obviously new arrays using these features won't work with old implementations, but they don't require any fundamental change in the data model or API that would require a breaking change.

I don't think it is necessary to reach agreement on these types of changes before releasing v3. On the other hand, I think it is important for v3 to have at least one significant improvement as otherwise there is no reason to adopt it.

@rabernat
Copy link
Contributor

Jeremy--thanks spending time thinking through the process. I absolutely agree that it is necessary to understand why we are doing this before moving forward. And I agree 💯 with the idea that we should not "waste" the V3 opportunity to make breaking changes. So overall I support this part of your proposal:

I would propose instead that ZEP 1 be used to establish agreement on the problems to be solved in a backwards-incompatible way, and then in subsequent ZEPs we could modify the spec to actually address those problems.

I think this is the correct order of operations.

At the same time, a ton of work has already gone into the V3 spec and its multiple implementations. So we should do this while recognizing that there is already a lot of consensus and momentum around V3. I hope that the outcome of this discussion will be to formalize this consensus and move forward with V3 in a form that is very similar to how it looks now. It has to be mentioned that we have basically spent a large fraction of our EOSS grant on refactoring zarr-python to support V3. We do not have to stick with it just because of that (sunk cost fallacy); however, changing course strongly now would mean that we have wasted resources by putting the horse (V3 implementation) before the cart (clear mission and consensus). If that turns out to be the case, the blame is on us (Zarr SC).

If agree to follow the sequence you proposed, it means we can defer the actual technical discussion of breaking changes to ZEP-1. That will be the place to go into detail on the many thought-provoking points you raised in your comment. However, I do want to push back against one thing:

The v3 spec reorganizes metadata, but it is not clear what problem is solved by that reorganization or how the reorganization would be helpful for high-latency storage.

The placement of metadata in a separate tree from chunks will have immediate and significant improvements for at least two scenarios:

  • Users of cloud storage (e.g. S3), where listing a path with millions of objects in it can be extremely slow
  • Users of HPC filesystems that have similar issues with listing directories

Yes, there are ways to work around this with V2 (consolidated metadata or separate chunk store). But IMO this improvement is significant enough to justify the breaking change. But I look forward to continuing that discussion in ZEP-1.

I think we should try to keep up the momentum and engagement from this week's community meeting, and try to resolve this quickly. That means getting zarr-developers/governance#16 finalized and merged asap. then moving to ZEP-1.

@jbms
Copy link
Contributor Author

jbms commented Apr 22, 2022

Regarding the concerns about existing implementation effort, I think the most likely outcome of further discussion of the goals would be additional breaking changes as part of v3, and I think it is quite possible that these additional changes would not require all that much implementation effort on top of the existing work that has already been done.

Regarding the metadata organization:

Thanks for the explanation regarding the metadata reorganization. I agree that the existing zarr v2 metadata scheme is sub-optimal when used on top of a key-value store that does not support first-class directories (like S3 or GCS or a flat key-value database), since you can't efficiently obtain a list of all of the datasets/groups if there are a large number of chunks in the datasets, and that the v3 metadata organization solves that issue. In fact I had noticed this change when looking at the spec previously, but I think I was thrown off when reviewing it yesterday while writing my comment above by the fact that the motivation lists "high latency" rather than "lack of directories", and I don't see how the change particularly helps with latency issues --- that would seem to require consolidated metadata. I'm not sure I understand the issue with HPC systems --- the ones I'm familiar with do have directory support in their distributed filesystems.

The one downside to the metadata reorganization is that it is no longer possible to identify a zarr array with a single path; instead it requires two paths: a path to the root and a path within the root. Perhaps that could be solved with a required naming scheme for roots, though (#137) or with a defined URL syntax (#132).

@joshmoore
Copy link
Member

joshmoore commented Apr 24, 2022

General 👍 from me certainly in the spirit of @rabernat's "I think we should try to keep up the momentum and engagement from this week's community meeting", I'll only add a few minor clarifications:

@jbms #140 (comment) it seems there is no agreement that the current mechanism is the best one, and it was proposed by @joshmoore to remove it.

I'll admit, I initially was proposing removing the remote extension support, but I'm certainly up for discussing both in ZEP 1 inline with keeping us moving forward.

@jbms #140 (comment) @joshmoore proposed that ZEP 1 should be the release of zarr v3, and my impression is that this was intended to be a relatively small change to the existing spec, just the removal of the extension mechanism.

In my mind, ZEP 1 is the definition of V3, including the "mission" as you put it as well as the "roadmap" as @alimanfoo put it. Basically, what is in and what is out.

@jbms #140 (comment) I would propose instead that ZEP 1 be used to establish agreement on the problems to be solved in a backwards-incompatible way, and then in subsequent ZEPs we could modify the spec to actually address those problems.

Exactly. 💯

@jbms #140 (comment) I think the most likely outcome of further discussion of the goals would be additional breaking changes as part of v3, and I think it is quite possible that these additional changes would not require all that much implementation effort on top of the existing work that has already been done.

Also agreed. I think @grlee77's work sets us with the necessary infrastructure from a zarr-python point of view, and now the question is how much more to break in the hand-full of existing V3 implementations.

Edit: regarding the metadata optimizations, I think V3 is also an opportunity to integrate consolidated metadata into the spec so that the entire metadata tree could be a single file without duplication.

@joshmoore
Copy link
Member

As of yesterday, @jbms, @rabernat and I were still potentially interpreting your intent here differently. In your mind & after the discussion above, would having ZEP1 contain the motivation for breaking changes in V3 be a useful first step? We all then review ZEP1 and compare against the current changes in #16 and likely #134 and iterate back to changes in ZEP1 until allthe threads are synchronized.

@jbms
Copy link
Contributor Author

jbms commented Apr 29, 2022

@joshmoore I think that is fine --- I thought the intention was that ZEPs would be more fine-grained / generally much smaller in scope, but given the large scope for "ZEP 1" I don't have an objection to that.

@jakirkham
Copy link
Member

The original motivation was handling multilanguage support (Python/C/C++/Java/etc.) better. There are some implicit assumptions in v2 that don't always work well in other languages. So v3 presented an opportunity to fix these and clean them up.

Additionally there has been interest in having a simple implementation that many can implement while being extensible. For example having the ability to add extensions for data types, transformations, different kinds of stores, etc.

At least those were the original motivations that started us down this path. Think the topics bought up above fit with in this framework.

@jbms
Copy link
Contributor Author

jbms commented May 4, 2022

Regarding data types, I would agree that it would be nice to move zarr v3 away from the numpy data model used by zarr v2. However, it was not clear to me that was actually the path v3 was headed down:

  • Currently v3 basically just supports a subset of the numpy data types, but still has a number of traits inherited from numpy that may be good to change:
    • endianness is part of the data type rather than part of the codec
    • naming scheme doesn't easily allow for other floating point types like bfloat16
  • In general I am in favor of a "columnar" data model, where each zarr array is a single "field" --- the current v3 spec does that by eliminating numpy structured data types. However, there has been a proposal (Add minimal drafts of a few more dtype extensions #135) to add back in the numpy data types excluded by current v3 draft, which would bring us back to where we started in v2. Even if these are called "extensions", that doesn't really make any practical difference, because it is already easy for a v2 zarr implementation to just not support certain data types.

@jakirkham
Copy link
Member

My understanding was Alistair wanted to add a minimal set of types and others could be supported by extensions. For example complex and datetime types come up and these would be handled that way. I don't recall us discussing half precision types, but these would likely be handled the same way.

@jstriebel
Copy link
Member

@jbms Can this be closed? I guess we should make the main motivation clear in the ZEP itself, please feel free to update and comment on zarr-developers/zeps#33.

@jakirkham
Copy link
Member

Reading back over this, think there was point that wasn't really called out that maybe should be. Namely there were a few approaches to formats like Zarr that emerged around the same time. One notable one was N5 (though there are a few like this one). So there was interest in merging these efforts into one shared spec/format that other communities could build on.

@jbms
Copy link
Contributor Author

jbms commented Feb 24, 2023

I guess we'll see how that goes :)

standards_2x
https://xkcd.com/927/

@jbms
Copy link
Contributor Author

jbms commented Feb 24, 2023

In any case, I'm happy to close this issue.

@jbms jbms closed this as completed Feb 24, 2023
@github-project-automation github-project-automation bot moved this from Meta to Done in ZEP1 Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

5 participants