-
-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v3] Elevate codec pipeline #1932
Conversation
the changes here would address some of the problems described in #1913 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced that the class structure needs to match the json structure one-to-one. That is why we have the overridable from_dict
and to_dict
classes. We should be pragmatic, in particular, in places where we can make the user API more convenient.
I am fine with the changes to the CodecPipeline and can live with the changed validate
signature. I am not happy with forcing the users to supply typesize
and shuffle
for the BloscCodec.
@normanrz i reverted the changes to the blosc codec edit: feel free to ignore this PR until I fix the merge conflicts |
…nto elevate_codec_pipeline
…nto elevate_codec_pipeline
the pre-commit action is flagging a mypy error that a) I can't replicate locally and b) is in a file that this PR did not touch, so I'm going to merge this despite the pre-commit failing. |
* v3: (22 commits) [v3] `Buffer` ensure correct subclass based on the `BufferPrototype` argument (zarr-developers#1974) Fix doc build (zarr-developers#1987) Fix doc build warnings (zarr-developers#1985) Automatically generate API reference docs (zarr-developers#1918) Update `RemoteStore.__str__` and add UPath tests (zarr-developers#1964) [v3] Elevate codec pipeline (zarr-developers#1932) 0 dim arrays: indexing (zarr-developers#1980) `parse_shapelike` allows 0 (zarr-developers#1979) Clean up typing and docs for indexing (zarr-developers#1961) add json indentation to config (zarr-developers#1952) chore: update pre-commit hooks (zarr-developers#1973) Bump pypa/gh-action-pypi-publish in the actions group (zarr-developers#1969) chore: update pre-commit hooks (zarr-developers#1957) Update release.rst (zarr-developers#1960) doc: update release notes for 3.0.0.alpha (zarr-developers#1959) Basic working FsspecStore (zarr-developers#1785) Feature: Top level V3 API (zarr-developers#1884) Buffer Prototype Argument (zarr-developers#1910) Create issue-metrics.yml fixes bug in transpose (zarr-developers#1949) ...
This PR moves the
CodecPipeline
data structure off theArrayMetadata
classes and instead localizes it higher in the stack.design for array metadata classes
In
zarr-python
, we haveArrayMetadata
classes that are expressly designed to model the contents of zarr metadata, e.g.zarr.json
or.zarray
(and these classes should not do anything else). Designing theArrayMetadata
classes around this goal is very deliberate, largely due to lessons we learned in the v2 codebase, where thezarr.Array
API was mixed together with the a model of the.zarray
JSON document, and this led to a variety of problems.we are breaking that design in v3
With that in mind, note that the v3 spec defines that the
codecs
attribute ofzarr.json
is a JSON array ofCodec
objects. But in the current version of the v3 branch,ArrayV3Metadata.codecs
has the typeCodecPipeline
, butCodecPipeline
is not a collection ofCodec
objects; instead, it's an object with a lot of attributes and methods. Putting such an object under theArrayV3Metadata.codecs
key violates the design principle of theArrayMetadata
class.one way to fix it
But the fix is relatively simple: we remove
CodecPipeline
fromArrayV3Metadata
and instead push the responsibility for creating instances of that class to objects that consumeArrayMetadata
. That's what this PR does.Specifically, I make
CodecPipeline
an attribute ofAsyncArray
; that attribute is created on initialization, by calling a function on themetadata.codecs
attribute. Sharding codecs also useCodecPipeline
; in this PR, they create theCodecPipeline
when needed instead of keeping it as an attribute. If we don't like this, we could consider adding a_codec_pipeline
attribute to the sharding codecs, as long as it's clear that this attribute is not part of the JSON serialized form of the codec.I made a few other changes in the process, including changing the signature of the
Codec.validate
method to explicitly take the parameters that it needs (shape, dtype, chunk_shape), instead of anArraySpec
object that includes extra fields (specifically,ArraySpec.attributes
) that no codec will ever need for validation. If that's unpopular in this PR we can spin it out into its own PR or just not do it, but I think this change is an improvement.@normanrz this touches a lot of code you authored so I will be very interested in getting your feedback here