Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3] Elevate codec pipeline #1932

Merged
merged 13 commits into from
Jun 23, 2024
Merged

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented May 31, 2024

This PR moves the CodecPipeline data structure off the ArrayMetadata classes and instead localizes it higher in the stack.

design for array metadata classes

In zarr-python, we have ArrayMetadata classes that are expressly designed to model the contents of zarr metadata, e.g. zarr.json or .zarray (and these classes should not do anything else). Designing the ArrayMetadata classes around this goal is very deliberate, largely due to lessons we learned in the v2 codebase, where the zarr.Array API was mixed together with the a model of the .zarray JSON document, and this led to a variety of problems.

we are breaking that design in v3

With that in mind, note that the v3 spec defines that the codecs attribute of zarr.json is a JSON array of Codec objects. But in the current version of the v3 branch, ArrayV3Metadata.codecs has the type CodecPipeline, but CodecPipeline is not a collection of Codec objects; instead, it's an object with a lot of attributes and methods. Putting such an object under the ArrayV3Metadata.codecs key violates the design principle of the ArrayMetadata class.

one way to fix it

But the fix is relatively simple: we remove CodecPipeline from ArrayV3Metadata and instead push the responsibility for creating instances of that class to objects that consume ArrayMetadata. That's what this PR does.

Specifically, I make CodecPipeline an attribute of AsyncArray; that attribute is created on initialization, by calling a function on the metadata.codecs attribute. Sharding codecs also use CodecPipeline; in this PR, they create the CodecPipeline when needed instead of keeping it as an attribute. If we don't like this, we could consider adding a _codec_pipeline attribute to the sharding codecs, as long as it's clear that this attribute is not part of the JSON serialized form of the codec.

I made a few other changes in the process, including changing the signature of the Codec.validate method to explicitly take the parameters that it needs (shape, dtype, chunk_shape), instead of an ArraySpec object that includes extra fields (specifically, ArraySpec.attributes) that no codec will ever need for validation. If that's unpopular in this PR we can spin it out into its own PR or just not do it, but I think this change is an improvement.

@normanrz this touches a lot of code you authored so I will be very interested in getting your feedback here

@d-v-b d-v-b added the V3 label May 31, 2024
@d-v-b
Copy link
Contributor Author

d-v-b commented May 31, 2024

the changes here would address some of the problems described in #1913

Copy link
Member

@normanrz normanrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced that the class structure needs to match the json structure one-to-one. That is why we have the overridable from_dict and to_dict classes. We should be pragmatic, in particular, in places where we can make the user API more convenient.

I am fine with the changes to the CodecPipeline and can live with the changed validate signature. I am not happy with forcing the users to supply typesize and shuffle for the BloscCodec.

@d-v-b d-v-b marked this pull request as ready for review May 31, 2024 17:03
@d-v-b d-v-b requested a review from normanrz June 4, 2024 17:53
@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 4, 2024

@normanrz i reverted the changes to the blosc codec

edit: feel free to ignore this PR until I fix the merge conflicts

@d-v-b d-v-b requested a review from jhamman June 12, 2024 20:02
@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 23, 2024

the pre-commit action is flagging a mypy error that a) I can't replicate locally and b) is in a file that this PR did not touch, so I'm going to merge this despite the pre-commit failing.

@d-v-b d-v-b merged commit 8aadd15 into zarr-developers:v3 Jun 23, 2024
17 of 18 checks passed
@d-v-b d-v-b deleted the elevate_codec_pipeline branch June 23, 2024 15:59
dcherian added a commit to dcherian/zarr-python that referenced this pull request Jun 25, 2024
* v3: (22 commits)
  [v3] `Buffer` ensure correct subclass based on the `BufferPrototype` argument (zarr-developers#1974)
  Fix doc build (zarr-developers#1987)
  Fix doc build warnings (zarr-developers#1985)
  Automatically generate API reference docs (zarr-developers#1918)
  Update `RemoteStore.__str__` and add UPath tests (zarr-developers#1964)
  [v3] Elevate codec pipeline (zarr-developers#1932)
  0 dim arrays: indexing (zarr-developers#1980)
  `parse_shapelike` allows 0 (zarr-developers#1979)
  Clean up typing and docs for indexing (zarr-developers#1961)
  add json indentation to config (zarr-developers#1952)
  chore: update pre-commit hooks (zarr-developers#1973)
  Bump pypa/gh-action-pypi-publish in the actions group (zarr-developers#1969)
  chore: update pre-commit hooks (zarr-developers#1957)
  Update release.rst (zarr-developers#1960)
  doc: update release notes for 3.0.0.alpha (zarr-developers#1959)
  Basic working FsspecStore (zarr-developers#1785)
  Feature: Top level V3 API (zarr-developers#1884)
  Buffer Prototype Argument (zarr-developers#1910)
  Create issue-metrics.yml
  fixes bug in transpose (zarr-developers#1949)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants