-
-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use compressor, filters, post_compressor for Array v3 create #1944
base: main
Are you sure you want to change the base?
Conversation
👍 I like this as an API. |
Not sure I am a fan. Codecs are a main difference between the v2 and v3 spec. I think the v3 codec pipeline is superior to the v2 compressors+filters. I would like to see a v3-first interface instead of shoehorning the new pipeline in the legacy interface. I think this will be more confusing than helpful. All typical compressors for v2 arrays (e.g. blosc, zstd, gzip) are actually I don't mind adding |
This is a good point -- I had forgotten that the old compressors are not actually
Given that there are only 2 options for the Are any other ideas for what a "v3 first" API would look like here? I think |
It is closest to the spec, though.
Adding
I think it could work well with some with some UX tweaks such as automatic |
I do not think that we need to constrain the Python API so closely to the spec. We should think about what would be most clear and convenient for our users. Specs are invisible implementation details to 99% of users. They are necessary for interoperability but not something users need to be exposed to directly. Do you think about the HTTP spec when you submit a comment on GitHub? 😆 I definitely think we need an API compatibility layer with the V2 syntax ("compressor", "codecs"). |
Might we consider experimenting with the top level API (e.g. #1884) rather than the Array class constructors? I've been thinking of separate signatures (a la mypy overloads) for v2 and v3 arrays. I suspect we may find a reasonable path there but if not, we could always provide a different API that abstracts over the two sets of spec-specific keywords. |
That sounds very reasonable to me. Maybe the class constructors can adhere more strictly to the spec and internal structure, while the top-level API provides backwards compatibility and syntactic sugar. The main downside is that these two APIs violate the Zen of Python: "There should be one-- and preferably only one --obvious way to do it." |
I still believe that there is a model that can express v2 and v3, see the following table:
Thoughts? I will update this PR along these lines. I really want this API to be good. If it's painful or opaque for users, they will make mistakes or fail to use features in the library. |
I am also thinking about how we can make the sharding conceptualization simple. One idea would be to express unsharded arrays as simply a special case of sharded arrays. |
How would that look like? filters and compressor could be used for the internal codecs. How would things like I don't think the API needs to be the |
I think that's a great idea. When it's time to fetch data to satisfy a user query, we have a data structures kind of like this: class ChunkReference:
store_path: StorePath
range: tuple[int, int] | None # optional range within the path to fetch
ChunkRequest # type: dict[ChunkKey, ChunkReference] We can produce this data structure after scanning the shard index. It's also the same sort of information that is generated by kerchunk-style virtual Zarr datasets. For non-sharded data, the Once we have this data structure, we can make two potential optimizations:
Is this at all compatible with how the sharding codec is currently implemented? |
Currently, the codec issues separate partial get requests, but it could be turned into batched fetching. |
I think we could use |
I've just noticed that not all "filters" in Zarr V2 are ArrayArray codecs. For example, Perhaps it makes the most sense to:
|
I think we should move this conversation over to #2052 |
In terms of abstraction levels, this pushes the
codecs
kwarg below the array creation API. Instead, we use the kwarg "filters" to denote ArrayArray codecs, "compressor" to denote the ArrayBytes codec, and "post_compressors" to denote the BytesBytesCodecs. This makes the top-level array creation API more explicit AND more similar to v2. Implementation of ideas expressed in #1943.