-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove attrs #1660
Remove attrs #1660
Conversation
…data to NamedConfig; turn chunk encoding into stand-alone functions; put codecs on ArrayMetadata instead of just CodecMetadata; add runtime_configuration parameter to codec encode / decode methods; add parsers to codecs
Hello @d-v-b! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-02-19 13:12:48 UTC |
Making Codec classes self-contained
…or c / f order; add order to ArrayV2Metadata.__init__
we are using the following pattern here def parse_property(data: Any) -> Literal["correct"]:
# check that the data is correct
if data == 'correct':
return data
raise ValueError('It was not correct')
@dataclass(frozen=True)
class foo:
property: Literal["correct"]
def __init__(self, property_in):
property_parsed = parse_property(property_in)
object.__setattr__(self, "property", property_parsed) I am a little worried about the cost of inheritance with this approach. For example, merely changing the type of |
Not sure I understand what you mean wrt cost of inheritance here? |
I will expand the example: def parse_property(data: Any) -> Literal["correct"]:
# check that the data is correct
if data == 'correct':
return data
raise ValueError('It was not correct')
@dataclass(frozen=True)
class Foo:
property_a: Literal["correct"]
def __init__(self, property_a_in):
property_a_parsed = parse_property(property_a_in)
object.__setattr__(self, "property_a", property_a_parsed)
@dataclass(frozen=True)
class Bar(Foo)
# add a new property with the same type as `Foo.property_a`
property_b: Literal["correct"]
# we have re-implement the __init__ method, even though we "just" added a new attribute
def __init__(self, property_a_in, property_b_in):
property_a_parsed = parse_property(property_a_in)
property_b_parsed = parse_property(property_b_in)
object.__setattr__(self, "property_a", property_a_parsed)
object.__setattr__(self, "property_b", property_b_parsed) Basically, if we subclass and add a new attribute (or subclass and change a type of an attribute), then we end up needing to re-implement the To be clear, I'm not arguing against the approach we are taking in this PR. rather, I'm pointing out a drawback of that approach, and inviting criticism :) |
Thanks for expanding. We could delegate in @dataclass(frozen=True)
class Bar(Foo)
# add a new property with the same type as `Foo.property_a`
property_b: Literal["correct"]
# we have re-implement the __init__ method, even though we "just" added a new attribute
def __init__(self, property_a_in, property_b_in):
super().__init__(property_a_in)
property_b_parsed = parse_property(property_b_in)
object.__setattr__(self, "property_b", property_b_parsed) The method signature would have to deal with all arguments. I don't it is a bad thing to have all arguments listed. That is what users will also have to know about. |
While it works if an attribute is being added, delegation with Of course there's a direct solution for these situations -- we can just re-write |
I don't think so. Most init functions are only a few lines of code. |
I fixed some typing issues to make mypy happier. I think it would be great to turn that back on soon. I think it really helps for development. |
I think this is OK? I want to get this design right, since it feels pretty core to the library. I worry that if we sign off on a design in this PR that is provisional, then we won't have a chance to fix it in the future. In particular, one thing I'm still turning over in my head is whether we want these metadata classes to be easy to subclass. Our current design adds just enough friction to this process that I worry that it would discourage people, or lead them to make mistakes in the A concrete example: suppose a user wants to subclass import numpy as np
from dataclasses import dataclass
def parse_uint8_dtype(data: Any) -> Literal[np.uint8]:
if data == 'uint8':
return(np.uint8)
if data == np.uint8:
return data
raise ValueError("data wasn't uint8")
@dataclass(frozen=True)
class Uint8ArrayMetadata(ArrayMetadata):
data_type: Literal[np.uint8]
def __init__(self, **kwargs):
super().__init__(**kwargs) # this will redundantly set the `data_type` attribute
dtype_parsed = parse_uint8_dtype(kwargs["data_type"]) # the user had to write a parser that handles this type
object.__setattr__(self, 'data_typ', dtype_parsed) I included a subtle, but realistic bug in the top example -- An alternative design that i'm leaning towards is to have a central registry that's basically a def parse_int(data: Any) -> int:
if isinstance(data, int):
return data
raise TypeError(f'Expected an int, got {type(data)}')
# the real thing wouldn't be a mere dict
parser_registry = {int: parse_int}
@dataclass(frozen=True)
class Metadata:
def __init__(self, **kwargs):
for field in fields(self):
parser = parser_registry.get(field.type)
parsed_value = parser(kwargs[field.name])
object.__setattr__(self, field.name, parsed_value)
@dataclass(frozen=True)
class Foo(Metadata)
a: int # because this is annotated and the type is registered, no need to implement `__init__` for this class With this design, a user who wants to subclass @dataclass(frozen=True)
class Uint8ArrayMetadata(ArrayMetadata):
data_type: Literal[np.uint8] Provided that the type Logically, the type annotation of a field should determine which parser is run against that field. If we can implement that constraint in code, then we should. I think our task is simplified by the finite set of input types we have to deal with with -- we don't need to handle arbitrary python types, only the types that we are using for zarr metadata. I have some travel coming up but I will try to find time to push on this. |
I don't think designs have to be fixed once we merge PRs into I am not a fan of a parser registry. It makes things more complex for imo very little gain. The current approach is much more simple and easy to reason about. Also, I find that most attributes are specific to a particular class. Where they are shared among multiple class, we can share the function--simple. Using the Python type system is not sufficient. For example, how would you specify a type that is an integer between -131072 and 22 (zstd level). I am not concerned with making it easy to subclass. I don't see real use cases for subclassing the ArrayMetadata. Zarr 3 has explicit extension points (e.g. codecs, chunk grids, data types). We should aim to make those easier to extend with base classes and registries. |
I would use
You personally may not be concerned with subclassing All that being said, I do take your point that we don't need to settle this API in this PR, so I'm happy deferring these decisions for later, and I'm happy to merge this PR as-is |
Maybe I am just not understanding it, sorry!
I also thought about the structured attributes use case. I'm not sure the only way to do that is through subclassing. I think composition would also work well. The ArrayMetadata class could have a hook for parsing/validating attributes, e.g. an init arg. I can see the need for subclassing Arrays to implement different behavior. With the ArrayMetadata, I like that it is quite strict and not as easy to customize, except for the well-defined hooks. |
and to be fair, our baseline in |
any objections to me merging this? |
Please go ahead |
thanks for your help here @normanrz! |
This PR removes the use of
attrs
from the v3 branch. We were usingattrs
because it provides a convenient way to define classes with runtime type validation and serialization to / from dicts, among other features. However we decided that, if possible, we would like to avoid the dependency onattrs
. Therefore, we need to implement the following features:this PR uses
@dataclass(frozen=True)
to decorate metadata classes; for each attributea
of classX
, there should be a parser function (typically calledparse_a
) thatX.__init__
calls; these functions either return parsed input, or raise an exception. These functions generally have the type signatureparse_x(data: JSON) -> SpecificType
, that is, they narrow the type of their input. These functions are not tightly coupled to classes, so they can be reused easily.After parsing an attribute,
__init__
then callsobject.__setattr__
to assign the parsed attribute toself
(we cannot doself.a = parsed_a
because the dataclass is frozen). Writing theparse_
functions is tedious but important, because if we make these functions strict, then we can avoid type checking inside elsewhere.See an illustrative example here for v3 array metadata.
For to/from dictification, all the metadata classes inherit from this base class (right now it's abstract but maybe this is pointless, since I define concrete methods) which defines base
to_dict
andfrom_dict
methods, that do what you would expect from the name. Subclasses refine these methods as needed.This is still evolving, and there are a lot of other changes in here that might get obviated by other PRs, so I'm happy to wade through messy merges on my side.
Specifically, I am working through major changes to the codec API here that were necessary to ensure that we do all of our input validation in one place (i.e., on construction of the
ArrayMetadata
). Inv3
codecs get validated in theCodecPipeline
object, which isn't associated withArrayMetadata
, but this means we are doing validation in two places, and ideally we only do it in one place. So in this branch,ArrayMetadata.codecs
is a list of validatedCodec
instances, instead of CodecMetadata, and the codec pipeline object is obviated, which entails a lot of changes that I'm still ironing out.Because of the above, tests are definitely not passing, but I'm working on it!