Skip to content

Dataclass descriptor behavior inconsistent #102646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ikonst opened this issue Mar 13, 2023 · 4 comments
Open

Dataclass descriptor behavior inconsistent #102646

ikonst opened this issue Mar 13, 2023 · 4 comments
Labels
stdlib Python modules in the Lib dir topic-dataclasses type-bug An unexpected behavior, bug, or error

Comments

@ikonst
Copy link

ikonst commented Mar 13, 2023

In #94424, the behavior of dataclasses for a descriptor that was assigned as a field default was defined and documented.

While I might be wrong here, a number of things lead me to believe it was an "accidental feature" being documented:

  1. It only works when the descriptor is assigned through the "Pythonic syntax" (my_field: T = descriptor), not my_field: T = field(default=descriptor).
  2. The descriptor's __get__ is called twice with instance=None, but not called for instance attribute gets.
  3. Inconsistently with that, __set__ is called for each instance attribute assignment.

We can probably address those with some amount of effort, but first I'd like to make sure this behavior is intentional and desired.

  1. What is the merit of this, vs. a property or a descriptor assigned as a ClassVar?
  2. Given that the __get__ is called only during class initialization with instance=None, then wouldn't the following be equivalent?
    -my_field: T = descriptor
    +my_field: T = field(default=descriptor.__get__())
    (Clearly the first line is shorter, but more vague - descriptors are not usually used for this purpose.)

Motivation: python/mypy#14869 brought up mypy's dataclasses plugin not properly understanding the finer semantics of assigning a descriptor, which led me to look at what exactly are the semantics.

@ikonst ikonst added the type-bug An unexpected behavior, bug, or error label Mar 13, 2023
@AlexWaygood AlexWaygood added the stdlib Python modules in the Lib dir label Mar 13, 2023
@carljm
Copy link
Member

carljm commented Mar 13, 2023

Thanks for the report! Detailed response below. In summary, I'm leaving this issue open because I think there is one potential code change follow-up (to make the behavior more consistent in the case of = field(default=ADescriptor())) and one potential docs follow-up (to clarify the difference in behavior between data and non-data descriptors.)

a number of things lead me to believe it was an "accidental feature" being documented

This is true, in the sense that the interaction with descriptors was not initially considered in the dataclasses design. But the choice to clarify and document the behaviors was made carefully and with a lot of discussion (much of it on the typing-sig mailing list IIRC.)

It only works when the descriptor is assigned through the "Pythonic syntax" (my_field: T = descriptor), not my_field: T = field(default=descriptor).

This is a good point; these two should probably be equivalent. In the latter case, the descriptor is (already today) assigned to the class and functions as a descriptor at runtime, but the default value in dataclass' internal Field object remains set to the descriptor object, instead of the value returned from its __get__ when fetched on the class. This means that on creation of a new instance, the descriptor's __set__ method is called with its own instance as the value to set. This definitely seems like broken behavior, which means I think we can fix it as a bug and shouldn't have to worry about breaking code relying on the current behavior.

The descriptor's get is called twice with instance=None

This doesn't seem like a problem; it just means that internally dataclasses is fetching the default value off the class more than once. A descriptor should be fine with having its __get__ called any number of times.

but not called for instance attribute gets

This is not true. Maybe you tested the __get__ behavior with a non-data descriptor (one that defines only __get__); these are overridden by a value in the instance attributes dict, so they would display the behavior you saw. (Even in this case you can still cause the descriptor __get__ to be called, if you first del the attribute off the instance.) But I agree that using non-data descriptors for dataclass fields doesn't have much if any use. Perhaps this should be called out more clearly in the documentation.

If you use a data descriptor (one defining __set__ and/or __delete__, as in the example added to the docs in #94424), then it takes precedence over the instance attribute dictionary, and you will see its __get__ method called on every instance attribute access.

All of this is normal Python descriptor behavior that is not particular to dataclasses.

Inconsistently with that, set is called for each instance attribute assignment.

This is true, and useful, but not inconsistent, because __get__ is also called on each attribute access.

What is the merit of this, vs. a property or a descriptor assigned as a ClassVar?

The field will be treated as a dataclass field, and thus e.g. included in the default generated __init__ etc methods.

Given that the get is called only during class initialization with instance=None, then wouldn't the following be equivalent?

No, because the premise (that __get__ is called only during class initialization) is false.

@ikonst
Copy link
Author

ikonst commented Mar 13, 2023

Agree that being called twice or any number of times is not a problem :)

Also, you are right that it does get called for instances if it's a data descriptor. It took me a while to understand why, but it checks out.

Also, I was wrong about my_field: T = field(default=descriptor) ("non Pythonic assignment") not working. It merely doesn't trigger the class __get__s, but if it's a data descriptor then the per-instance __get__s do get called.

Sorry for the confusion, and thanks for taking the time to explain point by point.

@carljm
Copy link
Member

carljm commented Mar 13, 2023

Also, I was wrong about my_field: T = field(default=descriptor) ("non Pythonic assignment") not working. It merely doesn't trigger the class __get__s, but if it's a data descriptor then the per-instance __get__s do get called.

Yes, but the problem is that we don't ever do the class __get__ and set that as the default value in the Field, and thus on instantation without a value for that field, the descriptor __set__ ends up getting called with the descriptor itself as value, which is weird and will likely break. E.g. using the example descriptor from the docs:

>>> class IntConversionDescriptor:
...     def __init__(self, *, default):
...             self._default = default
...     def __set_name__(self, owner, name):
...             self._name = "_" + name
...     def __get__(self, obj, type):
...             if obj is None:
...                     return self._default
...             return getattr(obj, self._name, self._default)
...     def __set__(self, obj, value):
...             setattr(obj, self._name, int(value))
...
>>> @dataclass
... class B:
...     x: int = field(default=IntConversionDescriptor(default=24))
...
>>> b = B()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 3, in __init__
  File "<stdin>", line 11, in __set__
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'IntConversionDescriptor'
>>> B.__dataclass_fields__['x'].default
<__main__.IntConversionDescriptor object at 0x7faefd797c60>

See the confusing error message due to the IntConversionDescriptor being passed to __set__ as the value for the field. The reasion is the contents of the default attribute of the Field; contrast with this version:

>>> @dataclass
... class A:
...     x: int = IntConversionDescriptor(default=23)
...
>>> A.__dataclass_fields__['x'].default
23

Although the descriptor is attached to the class and used in both cases, I think the behavior of B in the above examples is just buggy and not useful to anyone; it would be better if it behaved like A. (The use case is that there may be a reason to supply a different argument to field, e..g init or repr or compare. And in general the two forms of supplying a default to a dataclass field should behave the same.)

@daveware-nv
Copy link

Just ran into this today, the documentation on: https://docs.python.org/3/library/dataclasses.html#descriptor-typed-fields only mentions the case when using the descriptor directly instead of via dataclasses.field(default=my_descriptor()). I'd consider the behavior of calling __set__ on the descriptor with the descriptor's instance a bug.

Following is a demonstration based off the example on the above documentation page:

from dataclasses import dataclass, field
from typing import Any


class IntConversionDescriptor:
    def __init__(self, *, default):
        self._default = default

    def __set_name__(self, owner, name):
        self._name = "_" + name

    def __get__(self, obj, type):
        if obj is None:
            return self._default

        return getattr(obj, self._name, self._default)

    def __set__(self, obj, value):
        setattr(obj, self._name, int(value))

@dataclass
class InventoryItemDirectDefault:
    quantity_on_hand: IntConversionDescriptor = IntConversionDescriptor(default=100)


@dataclass
class InventoryItemFieldDefault:
    quantity_on_hand: IntConversionDescriptor = field(default=IntConversionDescriptor(default=100))


@dataclass
class InventoryItemFieldDefaultInitFalse:
    quantity_on_hand: IntConversionDescriptor = field(init=False, default=IntConversionDescriptor(default=100))


def test(dc_type: type[Any], init_val: int | None) -> None:
    if init_val is None:
        i = dc_type()
    else:
        i = dc_type(init_val)
    print(f"dc_type: {dc_type.__name__}, init_val: {init_val}")
    print(i.quantity_on_hand)   # 100
    i.quantity_on_hand = 2.5    # calls __set__ with 2.5
    print(i.quantity_on_hand)   # 2


if __name__ == '__main__':
    test(InventoryItemFieldDefault, 9)
    test(InventoryItemDirectDefault, 9)
    test(InventoryItemDirectDefault, None)
    test(InventoryItemFieldDefaultInitFalse, None)
    test(InventoryItemFieldDefault, None)

expected output:

dc_type: InventoryItemFieldDefault, init_val: 9
9
2
dc_type: InventoryItemDirectDefault, init_val: 9
9
2
dc_type: InventoryItemDirectDefault, init_val: None
100
2
dc_type: InventoryItemFieldDefaultInitFalse, init_val: None
100
2
dc_type: InventoryItemFieldDefault, init_val: None
100
2

actual output:

dc_type: InventoryItemFieldDefault, init_val: 9
9
2
dc_type: InventoryItemDirectDefault, init_val: 9
9
2
dc_type: InventoryItemDirectDefault, init_val: None
100
2
dc_type: InventoryItemFieldDefaultInitFalse, init_val: None
100
2
Traceback (most recent call last):
  File "/redacted/scratch/bug_report.py", line 52, in <module>
    test(InventoryItemFieldDefault, None)
  File "/redacted/scratch/bug_report.py", line 38, in test
    i = dc_type()
        ^^^^^^^^^
  File "<string>", line 3, in __init__
  File "/redacted/scratch/bug_report.py", line 19, in __set__
    setattr(obj, self._name, int(value))
                             ^^^^^^^^^^
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'IntConversionDescriptor'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-dataclasses type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants