Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG?: using None as replacement value in replace() typically upcasts to object dtype #60284

Open
jorisvandenbossche opened this issue Nov 12, 2024 · 2 comments
Labels
API - Consistency Internal Consistency of API/Behavior Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate replace replace method

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 12, 2024

I noticed that in certain cases, when replacing a value with None, that we always cast to object dtype, regardless of whether the dtype of the calling series can actually hold None (at least, when considering None just as a generic "missing value" indicator).

For example, a float Series can hold None in the sense of holding missing values, which is how None is treated in setitem:

>>> ser = pd.Series([1, 2, 3], dtype="float")
>>> ser[1] = None
>>> ser
0    1.0
1    NaN
2    3.0
dtype: float64

However, when using replace() to change the value 2.0 with None, it depends on the exact way to specify the to_replace/value combo, but typically it will upcast to object:

# with list
>>> ser.replace([1, 2], [10, None])
0    10.0
1    None
2     3.0
dtype: object

# with Series -> here it gives NaN but that is because the Series constructor already coerces the None
>>> ser.replace(pd.Series({1: 10, 2: None}))
0    10.0
1     NaN
2     3.0
dtype: float64

# with scalar replacements
>>> ser.replace(1, 10).replace(2, None)
0    10.0
1    None
2     3.0
dtype: object

In all the above cases, when replacing None with np.nan, it of course just results in a float Series with NaN.

The reason for this is two-fold. First, in Block._replace_coerce there is a check specifically for value is None and in that case we always cast to object dtype:

if value is None:
# gh-45601, gh-45836, gh-46634
if mask.any():
has_ref = self.refs.has_reference()
nb = self.astype(np.dtype(object))

The above is used when replacing with a list of values. But for the scalar case, we also cast to object dtype because in this case we check for if self._can_hold_element(value) to do the replacement with a simple setitem (and if not cast to object dtype first before trying again). But it seems that can_hold_element(np.array([], dtype=float), None) gives False ..


Everything is tested with current main (3.0.0.dev), but I see the same behaviour on older releases (2.0 and 1.5)


Somewhat related issue:

@jorisvandenbossche jorisvandenbossche added Bug replace replace method API - Consistency Internal Consistency of API/Behavior labels Nov 12, 2024
@jorisvandenbossche
Copy link
Member Author

Given that for the scalar case, it depends on can_hold_element, which for datetimelike data has custom logic (and so does handle None as missing), I expect that for that kind of data we actually see an inconsistency between the scalar and list case. Which is indeed the case:

>>> ser = pd.Series(["2020-01-01", "2020-01-02", "2020-01-03"], dtype="datetime64[ns]")
>>> ser.replace(["2020-01-01", "2020-01-02"], ["2020-01-10", None])
0    2020-01-10 00:00:00
1                   None
2    2020-01-03 00:00:00
dtype: object

>>> ser.replace("2020-01-02",  None)
0   2020-01-01
1          NaT
2   2020-01-03
dtype: datetime64[ns]

@jorisvandenbossche
Copy link
Member Author

I suppose one use case of ending up with object dtype is if you want to replace all missing values to None in your DataFrame, for example if some next step (typically that would iterate of the values in the dataframe or convert to list/dicts python objects, I assume) cannot handle NaN and requires None: df.replace({np.nan: None}).

That currently somewhat works to actually get None values (xref #44485). Of course, an alternative here is to first manually cast to object dtype before replacing: df.astype(object).replace({np.nan: None})

@jorisvandenbossche jorisvandenbossche added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate replace replace method
Projects
None yet
Development

No branches or pull requests

1 participant