BUG?: using `None` as replacement value in `replace()` typically upcasts to object dtype #60284

jorisvandenbossche · 2024-11-12T10:45:46Z

I noticed that in certain cases, when replacing a value with None, that we always cast to object dtype, regardless of whether the dtype of the calling series can actually hold None (at least, when considering None just as a generic "missing value" indicator).

For example, a float Series can hold None in the sense of holding missing values, which is how None is treated in setitem:

>>> ser = pd.Series([1, 2, 3], dtype="float")
>>> ser[1] = None
>>> ser
0    1.0
1    NaN
2    3.0
dtype: float64

However, when using replace() to change the value 2.0 with None, it depends on the exact way to specify the to_replace/value combo, but typically it will upcast to object:

# with list
>>> ser.replace([1, 2], [10, None])
0    10.0
1    None
2     3.0
dtype: object

# with Series -> here it gives NaN but that is because the Series constructor already coerces the None
>>> ser.replace(pd.Series({1: 10, 2: None}))
0    10.0
1     NaN
2     3.0
dtype: float64

# with scalar replacements
>>> ser.replace(1, 10).replace(2, None)
0    10.0
1    None
2     3.0
dtype: object

In all the above cases, when replacing None with np.nan, it of course just results in a float Series with NaN.

The reason for this is two-fold. First, in Block._replace_coerce there is a check specifically for value is None and in that case we always cast to object dtype:

pandas/pandas/core/internals/blocks.py

Lines 906 to 910 in 5f23ace

    
           if value is None: 
        
               # gh-45601, gh-45836, gh-46634 
        
               if mask.any(): 
        
                   has_ref = self.refs.has_reference() 
        
                   nb = self.astype(np.dtype(object))

The above is used when replacing with a list of values. But for the scalar case, we also cast to object dtype because in this case we check for if self._can_hold_element(value) to do the replacement with a simple setitem (and if not cast to object dtype first before trying again). But it seems that can_hold_element(np.array([], dtype=float), None) gives False ..

Everything is tested with current main (3.0.0.dev), but I see the same behaviour on older releases (2.0 and 1.5)

Somewhat related issue:

Inconsistent behavior for df.replace() with NaN, NaT and None #29024

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2024-11-12T10:49:48Z

Given that for the scalar case, it depends on can_hold_element, which for datetimelike data has custom logic (and so does handle None as missing), I expect that for that kind of data we actually see an inconsistency between the scalar and list case. Which is indeed the case:

>>> ser = pd.Series(["2020-01-01", "2020-01-02", "2020-01-03"], dtype="datetime64[ns]")
>>> ser.replace(["2020-01-01", "2020-01-02"], ["2020-01-10", None])
0    2020-01-10 00:00:00
1                   None
2    2020-01-03 00:00:00
dtype: object

>>> ser.replace("2020-01-02",  None)
0   2020-01-01
1          NaT
2   2020-01-03
dtype: datetime64[ns]

jorisvandenbossche · 2024-11-12T12:45:14Z

I suppose one use case of ending up with object dtype is if you want to replace all missing values to None in your DataFrame, for example if some next step (typically that would iterate of the values in the dataframe or convert to list/dicts python objects, I assume) cannot handle NaN and requires None: df.replace({np.nan: None}).

That currently somewhat works to actually get None values (xref #44485). Of course, an alternative here is to first manually cast to object dtype before replacing: df.astype(object).replace({np.nan: None})

jorisvandenbossche added Bug replace replace method API - Consistency Internal Consistency of API/Behavior labels Nov 12, 2024

jorisvandenbossche added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG?: using `None` as replacement value in `replace()` typically upcasts to object dtype #60284

BUG?: using `None` as replacement value in `replace()` typically upcasts to object dtype #60284

jorisvandenbossche commented Nov 12, 2024 •

edited

Loading

jorisvandenbossche commented Nov 12, 2024

jorisvandenbossche commented Nov 12, 2024

BUG?: using None as replacement value in replace() typically upcasts to object dtype #60284

BUG?: using None as replacement value in replace() typically upcasts to object dtype #60284

Comments

jorisvandenbossche commented Nov 12, 2024 • edited Loading

jorisvandenbossche commented Nov 12, 2024

jorisvandenbossche commented Nov 12, 2024

BUG?: using `None` as replacement value in `replace()` typically upcasts to object dtype #60284

BUG?: using `None` as replacement value in `replace()` typically upcasts to object dtype #60284

jorisvandenbossche commented Nov 12, 2024 •

edited

Loading