Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EA: _can_hold_element, _validate_fill_value #36226

Closed
jbrockmendel opened this issue Sep 8, 2020 · 10 comments
Closed

EA: _can_hold_element, _validate_fill_value #36226

jbrockmendel opened this issue Sep 8, 2020 · 10 comments
Labels
API Design Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action

Comments

@jbrockmendel
Copy link
Member

TL;DR: we should add _validate_fill_value to the EA interface and define _can_hold_element in terms of it.

ATM ExtensionBlock._can_hold_element incorrectly always returns True. This needs to be an EA method that returns True if and only if self[0] = value is allowed on a non-empty array.

This method can in turn to defined in terms of a more broadly useful _validate_fill_value that we use DTA/TDA/PA/PandasArray/Categorical:

def _can_hold_element(self, value):
    try:
        self._validate_fill_value(value)
        return True
    except (ValueError, TypeError):
        return False
@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member API Design ExtensionArray Extending pandas with custom dtypes or arrays. and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 8, 2020
@jorisvandenbossche
Copy link
Member

What would be the difference between _validate_fill_value and _can_hold_element? (or how it is different now for any of our arrays?)


Also, before adding it to the EA interface, I think we should discuss for a moment the use case. I think right now _can_hold_element is typically used to know if a setitem operation (or another operation like fillna) which modifies values inplace will preserve the dtype? (and thus block type in the BlockManager) Is that correct?

In practice (and for the reasons that _can_hold_element indeed always returns True for ExtensionBlock), some of the EAs currently disallow such dtype change for inplace edits:

In [16]: s = pd.Series([1, 2, 3], dtype="Int64")          

In [17]: s[0] = "a"      
...
TypeError: <U1 cannot be converted to an IntegerDtype

which contrasts the default dtypes:

In [18]: s = pd.Series([1, 2, 3], dtype="int64")  

In [19]: s[0] = "a"      

In [20]: s   
Out[20]: 
0    a
1    2
2    3
dtype: object

Personally, I kind of like this stricter behaviour of the EA dtypes. For example also numpy doesn't allow such dtype changes on setitem.

@jbrockmendel
Copy link
Member Author

What would be the difference between _validate_fill_value and _can_hold_element? (or how it is different now for any of our arrays?)

_can_hold_element would be accessed from Block, so would need to be part of the official interface. validate_fill_value is more a pattern that would be useful to provide default implementations for other methods which include can_hold_element

@jbrockmendel
Copy link
Member Author

In practice (and for the reasons that _can_hold_element indeed always returns True for ExtensionBlock), some of the EAs currently disallow such dtype change for inplace edits [...] Personally, I kind of like this stricter behaviour of the EA dtypes. For example also numpy doesn't allow such dtype changes on setitem.

I also like the dont-silently-copy/cast behavior, but right now we are getting it because of a _can_hold_element implementation that is just wrong. Moreover we are getting it inconsistently.

If you wanted to write an EA to get the raise-instead-of-cast Series behavior you could do:

def _can_hold_element(self, element):
   if self._can_hold_element_actual(element):
        return True
    raise TypeError(...)

def _can_hold_element_actual(self, element):
   [...]

@jbrockmendel
Copy link
Member Author

@jreback @jorisvandenbossche @TomAugspurger any objection to adding a _can_hold_element to the EA interface? This is a) a blocker for merging DTBlock into ExtensionBlock and b) likely the source of some other bugs

Spitballing the default implementation, we could either start with the existing (wrong) return True in ExtensionBlock, or could do something like

def _can_hold_element(self, element: Any) -> bool:
    """
    Check if the given element can be set (via setitem) into an array with this type and dtype.

    Parameters
    -----------
    element : Any

    Returns
    --------
    bool

    Notes
    -----
    This check ignores all length considerations, e.g. may return True even if len(self) == 0
    """
    if not is_list_like(element):  # <-- may need something more configurable than this
        element = [element]
    try:
        type(self)._from_sequence(element, dtype=self.dtype)
    except (ValueError, TypeError):
        return False
    return True

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 16, 2020 via email

@jbrockmendel
Copy link
Member Author

Do we know if this is going to be a scalar (as defined by the type?).

Block._can_hold_element is also used for arraylike

@jorisvandenbossche
Copy link
Member

I want to repeat my question from above #36226 (comment) related to the precise use case of this.

Assume we would decide to have EAs to be strict regarding dtype (and not let it upcast on setitem). Do we then actually need this method in the interface?

I know that the datetime-like EAs don't have this strict behaviour, and that's not something we can simply change. But if it's only for them, we could also leave that as a special case to deal with in ExtensionBlock, instead of adding it to the general EA interface.

@jbrockmendel
Copy link
Member Author

Assume we would decide to have EAs to be strict regarding dtype (and not let it upcast on setitem). [...] I know that the datetime-like EAs don't have this strict behaviour,

I think we need to distinguish between where casting can occur. DTA/TDA/PA will upcast element when __setitem__ is called, but the array itself doesn't get upcast. By contrast, Block's setitem-like methods will cast the array if not self._can_hold_element(element).

Assume we would decide to have EAs to be strict regarding dtype

Are you suggesting that what EA.__setitem__ accepts would be strict somehow?


In the status quo we have EABlock._can_hold_element always returning True, and we don't learn that it is wrong until we try to do the setitem and it raises.

Working in the core.indexing code, particularly in ILoc._setitem_with_indexer, I'm finding things would be a lot simpler if we could reliably pre-check if a setitem can be done inplace or if it will require casting.


Another frequently-discussed method/capability is deciding what dtype we need to promote to for certain operations. It might make sense to have a can_hold_element-like method that returns the upcast dtype needed when can_hold is False


Another option is adding _can_hold_element to ExtensionArray but not make it part of the interface (at least not yet).

@jorisvandenbossche
Copy link
Member

Are you suggesting that what EA.__setitem__ accepts would be strict somehow?

Yes, or at least that is what I think we should discuss / decide.

@jbrockmendel
Copy link
Member Author

We're in the process of moving away from the _can_hold_element pattern, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants