Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3] remove old v2 code entirely #1849

Closed
d-v-b opened this issue May 7, 2024 · 15 comments · Fixed by #2182
Closed

[v3] remove old v2 code entirely #1849

d-v-b opened this issue May 7, 2024 · 15 comments · Fixed by #2182
Milestone

Comments

@d-v-b
Copy link
Contributor

d-v-b commented May 7, 2024

This is a more broad alternative to #1791 -- basically we remove all the old v2 code in favor of the v3 efforts. Happy to hear thoughts on this.

@jeromekelleher
Copy link
Member

What are the implications for backward compatibility?

@d-v-b
Copy link
Contributor Author

d-v-b commented May 8, 2024

@jeromekelleher the proposed change would remove API backwards compatibility, but I don't think we plan to keep the v2 API when we release v3 anyway -- v2 has a very large API surface area, it would be a burden to maintain purely for legacy reasons.

The v3 codebase does support reading zarr v2, so removing the v2 code does not change functional compatibility with zarr v2.

@jni suggested that we publish the final release of zarr-python 2.X on pypi as zarr2 or something similar, to ensure that zarr 3.x and zarr 2.x can co-exist in the same python environment. I think this is a good idea, but there might be alternatives to consider.

@jeromekelleher
Copy link
Member

jeromekelleher commented May 8, 2024

I see - if there's good compatibility on v2 stores with the new v3 code, and reasonable backward compatibility with old code bases, then I agree cutting out the old v2 code entirely is the right approach.

As an invested user, I'm just a bit anxious that we're not contemplating a Python 2 to Python 3 style change here!

@d-v-b
Copy link
Contributor Author

d-v-b commented May 8, 2024

As an invested user, I'm just a bit anxious that we're not contemplating a Python 2 to Python 3 style change here!

Can you elaborate on this?

@jeromekelleher
Copy link
Member

jeromekelleher commented May 8, 2024

Well, discussions about making a new v2 package and/or maintaining the old v2 code within the Zarr-Python repo raise questions about compatibility. If good compatibility with existing v2 stores and code was a first-order concern, then these wouldn't even be on the table, I would have thought.

but I don't think we plan to keep the v2 API when we release v3 anyway -- v2 has a very large API surface area, it would be a burden to maintain purely for legacy reasons.

That's quite a big thing, though. I agree there's a lot of sprawl around the v2 API that could be removed without breaking much, but surely there's an idiomatic core of functionality around creating/opening/reading/writing arrays that will be the same? Are there deprecations on the bits of v2 that are going to break when v3 drops?

I think you guys are doing a great job by the way, I just want to get my head around what's being planned here. I would rather not have my code broken unless it's necessary, but I understand you've got to break some eggs sometimes.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 10, 2024

@jeromekelleher

I think you guys are doing a great job by the way, I just want to get my head around what's being planned here. I would rather not have my code broken unless it's necessary, but I understand you've got to break some eggs sometimes.

Your code will not break if you pin your zarr-python dependency to 2.x. However, if you upgrade to zarr-python 3.x, your code will break. You should be able to recover broken functionality by adapting your code to the new APIs in zarr-python 3.x, because 3.x will be fully compatible with the zarr v2 format, but the amount of effort required to adapt your code will depend on what zarr-python 2.x APIs you were depending on. If you do not need to use zarr v3 and you want to keep code maintenance to a minimum, then I recommend not upgrading to zarr-python 3.x.

There are two possibilities to minimize disruption. The first would be to include the old zarr-python 2.x codebase in 3.x. We rejected this possibility for reasons that were discussed earlier in this issue. The second would be to publish the last zarr-python 2.x release on pypi as zarr2 or something similar; in this case, if you want to use zarr-python 3.x alongside the zarr-python 2.x API, you would need to change import zarr to import zarr2 as zarr; import zarr as zarr3 or equivalent and everything should work.

@jeromekelleher
Copy link
Member

I see - so we are doing Python 2 -> Python 3 then... I wouldn't underestimate the disruption and ill-will caused by suddenly breaking people's code with a version update. If it is a major change and widespread breakage is expected, then I think a new "zarr3" package should at least be considered.

Is there some documentation about what's going to break and not?

@d-v-b
Copy link
Contributor Author

d-v-b commented May 10, 2024

I see - so we are doing Python 2 -> Python 3 then...

I don't think breaking changes to zarr-python, signaled by a major version increment, is quite the same scale as breaking changes to python itself. A better model might be the pydantic 1 to 2 transition, which (for me) wasn't too painful (but I know other projects had a worse experience). Although people (including me) did have to change their code or pin pydantic to 1.x, I don't think there was substantial ill will towards the pydantic developers, because the changes were communicated ahead of time, and the changes were largely improvements. If anything, by publishing zarr-python 2.x as zarr2, we would be going above and beyond what the pydantic devs did for backwards compatibility -- as far as I know, the pydantic devs never published pydantic 1.x as a stand-alone package on pypi, which did create friction for people wanted to migrate from pydantic 1 to 2 incrementally.

If it is a major change and widespread breakage is expected, then I think a new "zarr3" package should at least be considered.

If I recall correctly, a stand-alone "zarr3" proposal did come up in past discussions, but we decided against it it for a variety of reasons. First, there are problems with making a completely new package as a response to breaking changes. This would be simple for users who don't want to change their code, but it would harm many other users. New users will be confused when they see zarr and zarr3 in pypi; users who specifically need zarr v3 would be understandably annoyed when they install zarr only to find that it doesn't support zarr v3; as zarr-python 2.x will no longer be developed, pip install zarr would produce increasingly stale code with no developer support.

And, as a matter of precedent, we have signaled breaking changes with the major version number before: zarr-python 1.x supported zarr v1, but via major breaking changes, zarr-python moved to 2.x, dropping support for zarr v1 (and without splitting the project into zarr and zarr2.)

Is there some documentation about what's going to break and not?

We still need to provide resources for people migrating from zarr-python 2.x to 3.x, and a big part of that is documentation, which has not been written yet. I think this #1769 specifically tracks adding docs for breaking changes.

@jeromekelleher
Copy link
Member

Thanks @d-v-b, all very helpful.

Maybe I should open an issue for this, but has adding deprecation warnings to stuff that's going to break in Zarr v3 to the current v2 code been considered? This would be a helpful thing for people who don't necessarily keep up to date with package development, and would at least get a warning that their code will break in the near future (and need to start pinning their deps to zarr<3).

@d-v-b
Copy link
Contributor Author

d-v-b commented May 10, 2024

@jhamman has been putting some deprecation warnings in for features that will be removed completely in v3: #1801, but I don't think we have explored deprecation warnings beyond this area.

@jni
Copy link
Contributor

jni commented May 13, 2024

I don't think there was substantial ill will towards the pydantic developers,

uuuh. 😅

as far as I know, the pydantic devs never published pydantic 1.x as a stand-alone package on pypi, which did create friction for people wanted to migrate from pydantic 1 to 2 incrementally.

Correct. 😉 This is why I think the zarr2 idea is the bare minimum for this migration.

In that situation, even folks who don't get the memo, and whose code inadvertently breaks when v3 is released, have a very quick pressure release valve for e.g. getting their CI green again: replace zarr with zarr2 in their dependencies (potentially (probably) also in their imports), and done. They can then migrate their code, maybe even their data, as leisurely as they like by following our migration guide.

And, as a matter of precedent, we have signaled breaking changes with the major version number before

I wasn't around then, but I suspect the number of folks depending on zarr was much smaller then. So I don't think it's such a direct comparison.

@normanrz
Copy link
Contributor

but surely there's an idiomatic core of functionality around creating/opening/reading/writing arrays that will be the same?

I think this is the main missing piece of this discussion. We are keeping the same top level API for open and create as well as for reading and writing array data. My guess is that for >90% of users their code will still function as before. @jhamman recently did a test where he upgraded xarray to zarr-python 3 without much hassle.
Mostly, users will need to change their code if they want to use the version 3 format (because of different metadata and codecs) and if they use stores that we're deprecating.

@jhamman jhamman added the V3 label May 17, 2024
@jhamman jhamman added this to the 3.0.0 milestone May 17, 2024
@jhamman jhamman moved this to Todo in Zarr-Python - 3.0 May 17, 2024
@d-v-b
Copy link
Contributor Author

d-v-b commented May 21, 2024

see #1898

@jbms
Copy link

jbms commented May 29, 2024

Davis and I discussed this in the community meeting today, where I strongly advocated for using a new name for the new API. At the time I didn't have the context from #1849 (comment) that the new API will be mostly compatible with the existing API.

However, it is still rather problematic to make API changes in Python:

In the npm and rust and go ecosystems, you can more freely make API changes when you change the major version, because you can have a single program that transitively depends on multiple major versions of the same package --- different major versions essentially count as independent packages. In Python not only is this is not possible, it is also fairly common to have unversioned dependencies listed in requirements.txt files.

Creating a zarr2 module name and a zarr2 package is better than nothing but still means everything needs to be updated just to not break. zarr-python is used by a ton of scientific / research code that by its nature is unmaintained and incompatible API changes are sure to create a lot of problems down the road.

Pros to keeping zarr as the module/package name for zarr v3:

  • Name is easier to remember and more aesthetically pleasing
  • Existing users automatically get the new API, and can start using new features without needing to make any other changes

Cons to keeping zarr as the module/package name for zarr v3:

  • Lots of existing code will break if used in a virtualenv with an upgraded zarr-python library

I think the extent of the API breakage determines which option is most favorable. If, say, 95% of existing uses of zarr-python will continue working with the new version, then keeping the same name would be a reasonable choice. But if a significantly larger fraction of existing uses are broken, then I think it would be better to use a new name.

@jhamman
Copy link
Member

jhamman commented Sep 13, 2024

This issue hasn't received an update from us in a while, despite the conversation advancing considerably (specially in the weekly developer meeting). I'm about to put a PR up removing the old v2 code so I'll explain the current rational, as I understand it:

  • The target for API compatibility is >95%. The store API is the notable exception there and we are continuing to iterate on the design to make this as simple as possible for folks to migrate to the 3.0 store API.
  • We will continue to make 2.x releases for up to 6 months following the initial 3.0 release. This will allow folks to continue to rely on zarr-python 2.x while the 3.x API matures
  • Starting with 2.18.0, we have made every effort possible to communicate breaking changes through documentation, public communication, deprecation warnings, and pre-releases

Why aren't we going with a new package name?

  • Development of zarr 2.x slowed significantly. Splitting efforts between two packages would mean even slower development.
  • Practically, putting the version number in package name (e.g. zarr3) indicates a dependence on the spec or package version that is not is not true. With zarr-python 3, we are supporting both the v2 and v3 specifications.

I would encourage folks to read the draft migration guide (#2102) and provide feedback on how we can communicate this decision best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants