Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use compressor, filters, post_compressor for Array v3 create #1944

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Jun 2, 2024

In terms of abstraction levels, this pushes the codecs kwarg below the array creation API. Instead, we use the kwarg "filters" to denote ArrayArray codecs, "compressor" to denote the ArrayBytes codec, and "post_compressors" to denote the BytesBytesCodecs. This makes the top-level array creation API more explicit AND more similar to v2. Implementation of ideas expressed in #1943.

@rabernat
Copy link
Contributor

rabernat commented Jun 2, 2024

👍 I like this as an API.

@d-v-b d-v-b requested review from normanrz and jhamman and removed request for normanrz June 4, 2024 18:29
@normanrz
Copy link
Contributor

normanrz commented Jun 4, 2024

Not sure I am a fan. Codecs are a main difference between the v2 and v3 spec. I think the v3 codec pipeline is superior to the v2 compressors+filters. I would like to see a v3-first interface instead of shoehorning the new pipeline in the legacy interface.

I think this will be more confusing than helpful. All typical compressors for v2 arrays (e.g. blosc, zstd, gzip) are actually BytesBytesCodecs. Now, that would need to go in compressor for v2 arrays and post_compressor for v3 arrays? Essentially, the only compressors for v3 would be bytes and sharding_indexed. That just doesn't make sense.

I don't mind adding compressor and filters as optional kwargs to the v3 create interface. If set, they could be used to build a codec pipeline: filters + [BytesCodec(), compressor]. But, I would like to keep the codecs kwarg (which should take precedence over compressor+filters) and promote that as the recommended way of setting codecs for v3.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 4, 2024

I think this will be more confusing than helpful. All typical compressors for v2 arrays (e.g. blosc, zstd, gzip) are actually BytesBytesCodecs. Now, that would need to go in compressor for v2 arrays and post_compressor for v3 arrays?

This is a good point -- I had forgotten that the old compressors are not actually ArrayBytesCodec, and I do think this fact would be a source of confusion in the proposed API. But I also think users will find the codecs list confusing. It has structure, but we do not expose that structure in the user-facing API; I think users will find it frustrating to intuit an API that we could make explicit.

Essentially, the only compressors for v3 would be bytes and sharding_indexed. That just doesn't make sense.

Given that there are only 2 options for the ArrayBytesCodec, and one of them is sharding, then I wonder if it would make sense to have a top-level sharding keyword argument that defaults to None (resulting in the BytesCodec getting used). I'm not sure exactly how users would customize the BytesCodec here, but this would at least guide users to sharding more easily than the opaque codecs list.

Are any other ideas for what a "v3 first" API would look like here? I think codecs: tuple[Codec, ...] will not be popular.

@normanrz
Copy link
Contributor

normanrz commented Jun 4, 2024

But I also think users will find the codecs list confusing. It has structure, but we do not expose that structure in the user-facing API; I think users will find it frustrating to intuit an API that we could make explicit.

It is closest to the spec, though.

Given that there are only 2 options for the ArrayBytesCodec, and one of them is sharding, then I wonder if it would make sense to have a top-level sharding keyword argument that defaults to None (resulting in the BytesCodec getting used).

Adding sharding as a kwarg sounds interesting.

Are any other ideas for what a "v3 first" API would look like here? I think codecs: tuple[Codec, ...] will not be popular.

I think it could work well with some with some UX tweaks such as automatic BytesCodecs insertion and single value support. That would automatically turn codecs=BloscCodec() into codecs=(BytesCodec(), BloscCodec()) or codecs=TransposeCodec() into codecs=(TransposeCodec(), BytesCodec()) or codecs=(TransposeCodec(), ZstdCodec()) into codecs=(TransposeCodec(), BytesCodecs(), ZstdCodec()).

@rabernat
Copy link
Contributor

rabernat commented Jun 5, 2024

It is closest to the spec, though.

I do not think that we need to constrain the Python API so closely to the spec. We should think about what would be most clear and convenient for our users. Specs are invisible implementation details to 99% of users. They are necessary for interoperability but not something users need to be exposed to directly. Do you think about the HTTP spec when you submit a comment on GitHub? 😆

I definitely think we need an API compatibility layer with the V2 syntax ("compressor", "codecs").

@jhamman
Copy link
Member

jhamman commented Jun 5, 2024

Might we consider experimenting with the top level API (e.g. #1884) rather than the Array class constructors? I've been thinking of separate signatures (a la mypy overloads) for v2 and v3 arrays. I suspect we may find a reasonable path there but if not, we could always provide a different API that abstracts over the two sets of spec-specific keywords.

@rabernat
Copy link
Contributor

rabernat commented Jun 5, 2024

That sounds very reasonable to me. Maybe the class constructors can adhere more strictly to the spec and internal structure, while the top-level API provides backwards compatibility and syntactic sugar.

The main downside is that these two APIs violate the Zen of Python: "There should be one-- and preferably only one --obvious way to do it."

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 5, 2024

I still believe that there is a model that can express v2 and v3, see the following table:

v2 v3
filters : tuple[ArrayArrayCodec] filters: tuple[ArrayArrayCodec]
N/A pre_compressor: ArrayBytesCodec
compressor : BytesBytesCodec compressor: tuple[BytesBytesCodec]
  • as @normanrz pointed out, "compressor" actually maps better to BytesBytesCodec, so we go with that, and remove the "post_compressor" kwarg.
  • V2 compressor will be internally normalized to a tuple with a single element, so that it plays nice with the v3 compression machinery
  • We introduce "pre_compressor", which we can rename if we want. "pre_compressor" must be empty for v2 arrays. Otherwise, this API handles both v2 and v3.

Thoughts? I will update this PR along these lines. I really want this API to be good. If it's painful or opaque for users, they will make mistakes or fail to use features in the library.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 5, 2024

I am also thinking about how we can make the sharding conceptualization simple. One idea would be to express unsharded arrays as simply a special case of sharded arrays.

@normanrz
Copy link
Contributor

normanrz commented Jun 5, 2024

Thoughts? I will update this PR along these lines. I really want this API to be good. If it's painful or opaque for users, they will make mistakes or fail to use features in the library.

pre_compressor could just be array_bytes_codec. I doubt most users will care about it, if there is a dedicated way to set sharding.

I am also thinking about how we can make the sharding conceptualization simple. One idea would be to express unsharded arrays as simply a special case of sharded arrays.

How would that look like? filters and compressor could be used for the internal codecs. How would things like index_location or index_codecs (or chunk_layout in the future) be controlled? While a bit of an edge case, there is also the possibility to have nested sharding.

I don't think the API needs to be the codecs tuple we have right now, but it shouldn't be more confusing and less expressive.

@rabernat
Copy link
Contributor

rabernat commented Jun 5, 2024

One idea would be to express unsharded arrays as simply a special case of sharded arrays.

I think that's a great idea.

When it's time to fetch data to satisfy a user query, we have a data structures kind of like this:

class ChunkReference:
    store_path: StorePath
    range: tuple[int, int] | None  # optional range within the path to fetch

ChunkRequest  # type: dict[ChunkKey, ChunkReference]

We can produce this data structure after scanning the shard index. It's also the same sort of information that is generated by kerchunk-style virtual Zarr datasets. For non-sharded data, the range would be unknown.

Once we have this data structure, we can make two potential optimizations:

  • Coalescing: combining multiple small requests for the same StorePath into a single request
  • Splitting: taking large requests (e.g. > 8 MB) and splitting them into smaller requests (to improve throughput)

Is this at all compatible with how the sharding codec is currently implemented?

@normanrz
Copy link
Contributor

normanrz commented Jun 5, 2024

Is this at all compatible with how the sharding codec is currently implemented?

Currently, the codec issues separate partial get requests, but it could be turned into batched fetching.

@LDeakin
Copy link

LDeakin commented Jun 6, 2024

I think it is a good idea to simplify the API with something like filters/compressor, but compressor should not strictly map to BytesBytesCodec. Possible future compressors like zfp and pcodec make sense as ArrayBytesCodecs, as they compress arrays/elements, rather than bytes directly.

@normanrz
Copy link
Contributor

normanrz commented Jun 6, 2024

I think we could use compressor: tuple[ArrayBytesCodec | BytesBytesCodec, ...] | ArrayBytesCodec | BytesBytesCodec | None with runtime validation that there is max 1 ArrayBytesCodec. If no ArrayBytesCodec is supplied, we can auto-add a BytesCodec.

@d-v-b d-v-b marked this pull request as draft June 14, 2024 14:31
@jhamman jhamman added the V3 Affects the v3 branch label Jul 1, 2024
@LDeakin
Copy link

LDeakin commented Jul 25, 2024

@d-v-b Instead, we use the kwarg "filters" to denote ArrayArray codecs

I've just noticed that not all "filters" in Zarr V2 are ArrayArray codecs. For example, vlen-* codecs are ArrayBytes. So, I want to echo @normanrz original sentiments that the proposed abstraction is likely to be more confusing than helpful.

Perhaps it makes the most sense to:

  • Replace codecs with array_to_array_codecs, array_to_bytes_codec, and bytes_to_bytes_codecs arguments
  • Retain filters and compressor for backwards compatibility

@normanrz
Copy link
Contributor

I think we should move this conversation over to #2052

@jhamman jhamman added this to the After 3.0.0 milestone Aug 15, 2024
@jhamman jhamman changed the base branch from v3 to main October 14, 2024 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Affects the v3 branch
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

5 participants