Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/DEPR: int downcasting in DataFrame.where #44597

Closed
jbrockmendel opened this issue Nov 24, 2021 · 5 comments · Fixed by #45009
Closed

API/DEPR: int downcasting in DataFrame.where #44597

jbrockmendel opened this issue Nov 24, 2021 · 5 comments · Fixed by #45009
Labels
API Design Deprecate Functionality to remove in pandas
Milestone

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Nov 24, 2021

Block.where has special downcasting logic that splits blocks differently from any other Block methods. I would like to deprecate and eventually remove this bespoke logic.

The relevant logic is only reached AFAICT when we have integer dtype (non-int64) and an integer other too big for this dtype, AND the passed cond has all-True columns.

(Identifying the affected behavior is difficult in part because it relies on can_hold_element incorrectly returning True in these cases)

import numpy as np
import pandas as pd

arr = np.arange(6).astype(np.int16).reshape(3, 2)
df = pd.DataFrame(arr)

mask = np.zeros(arr.shape, dtype=bool)
mask[:, 0] = True

res = df.where(mask, 2**17)

>>> res.dtypes
0    int16
1    int32
dtype: object

The simplest thing to do would be to not do any downcasting in these cases, in which case we would end up with all-int32. The next simplest would be to downcast column-wise, which would give the same end result but with less consolidation.

We do not have any test cases that fail if I disable this downcasting (after I fix a problem with an expressions.where call that the downcasting somehow makes irrelevant). This makes me think the current behavior is not intentional, or at least not a priority.

Any objection to deprecating the integer downcasting entirely?

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member API Design Deprecate Functionality to remove in pandas and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 24, 2021
@jorisvandenbossche
Copy link
Member

I am +1 on deprecating the "try casting back to original dtype" for this specific case. We actually had a keyword for that before (try_cast, which you deprecated in #38836, and which was as that point already ignored), maybe at some point those were connected.

One note though: if we don't "cast back" for for columns that were not affected, we should IMO also not preserve the original dtype if none of the columns were affected (so if the mask is fully True).

@jbrockmendel
Copy link
Member Author

we should IMO also not preserve the original dtype if none of the columns were affected (so if the mask is fully True).

if none of the columns were affected, then i'd expect this to be a no-op. You're suggesting we would do some casting instead? An example might be helpful.

@jorisvandenbossche
Copy link
Member

Yes, eg your example but with the mask completely True:

In [7]: arr = np.arange(6).astype(np.int16).reshape(3, 2)
   ...: df = pd.DataFrame(arr)

In [8]: mask = np.ones(arr.shape, dtype=bool)

In [9]: df.where(mask, 2**17).dtypes
Out[9]: 
0    int16
1    int16
dtype: object

So this is indeed a no-op currently, preserving the dtype. But if we want behaviour that doesn't depend on the exact content of mask (and only of the dtype of the calling df and other), then the above should also give int32.

If we don't want to do that because of "let's not cast if we don't have to (i.e. in case of a no-op)", then I think we should keep the "downcast" of the original example here, as that is not actually a downcast, but undoing the upcast, so a preservation of the original dtype for a no-op (when considering just that column).

@jorisvandenbossche
Copy link
Member

For reference, numpy always gives the same dtype as result regardless of the mask being fully True/False or not:

In [11]: arr = np.array([1, 2, 3], dtype="int8")

In [12]: np.where(arr < 10, arr, 2**17)
Out[12]: array([1, 2, 3], dtype=int32)

In [13]: np.where(arr > 10, arr, 2**17)
Out[13]: array([131072, 131072, 131072], dtype=int32)

@jbrockmendel
Copy link
Member Author

makes sense, thanks.

will need to look into how this would affect putmask; id like to keep the behaviors symmetric where possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Deprecate Functionality to remove in pandas
Projects
None yet
3 participants