Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: RecursionError when attempting to replace np.nan values (#45725) #45749

Merged
merged 11 commits into from
Feb 3, 2022

Conversation

roib20
Copy link
Contributor

@roib20 roib20 commented Feb 1, 2022

@roib20
Copy link
Contributor Author

roib20 commented Feb 1, 2022

@github-actions pre-commit

@mroeschke mroeschke added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate replace replace method labels Feb 1, 2022
@mroeschke mroeschke added this to the 1.5 milestone Feb 1, 2022
@@ -248,6 +248,7 @@ Conversion
- Bug in :meth:`Series.astype` and :meth:`DataFrame.astype` from floating dtype to unsigned integer dtype failing to raise in the presence of negative values (:issue:`45151`)
- Bug in :func:`array` with ``FloatingDtype`` and values containing float-castable strings incorrectly raising (:issue:`45424`)
- Bug when comparing string and datetime64ns objects causing ``OverflowError`` exception. (:issue:`45506`)
- Bug when attempting to replace ``numpy.nan`` values causing ``RecursionError``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i read the original issue correctly this then this was working in 1.4 so no actual change (its only a change in master), if so delete this note

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right. I removed the note from the changelog.

@@ -661,6 +661,14 @@ def test_replace_simple_nested_dict_with_nonexistent_value(self):
result = df.replace({"col": {-1: "-", 1: "a", 4: "b"}})
tm.assert_frame_equal(expected, result)

@pytest.mark.parametrize("to_replace, value", [(np.nan, pd.NA), (np.nan, None)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add np.nan as a target

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, but had to change the tests to explicitly use object type even after replacing. Replacing np.nan with np.nan can cause an implicit type conversion to float64. There is still discussion going on about this, but for now this is the same behavior in pandas 1.4.0 and 1.3.5 so it shouldn't be changed in this bug fix.

def test_replace_numpy_nan(self, to_replace, value):
# GH#45725 ensure numpy.nan can be replaced with pandas.NA or None
df = DataFrame({"A": [to_replace]}, dtype=object)
result = df.replace({to_replace: value})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also test usage with df.replace(to_replace, value) or is this tested close by here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes done.

@@ -36,6 +36,15 @@ def test_replace_explicit_none(self):
assert expected.iloc[-1] is None
tm.assert_series_equal(result, expected)

@pytest.mark.parametrize("to_replace, value", [(np.nan, pd.NA), (np.nan, None)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as above. consider just using the null_fixture which has all the nans

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nulls_fixture was a good solution. Implemented.

@jreback jreback merged commit d0a687c into pandas-dev:main Feb 3, 2022
@jreback
Copy link
Contributor

jreback commented Feb 3, 2022

thanks @roib20 very nice! keep em coming!

@simonjayhawkins
Copy link
Member

The tests added in this PR don't seem to fail on master with value = self._standardize_fill_value(value) # GH#45725 removed.

Also before #45568, which maybe the cause of the regression, the values were replaced.

2022-03-16T22:14:48.3750817Z 1.5.0.dev0+209.g0f5d934be7
2022-03-16T22:14:48.3751119Z 0    <NA>
2022-03-16T22:14:48.3751330Z 1    <NA>
2022-03-16T22:14:48.3751555Z dtype: object
2022-03-16T22:14:48.3751789Z       0
2022-03-16T22:14:48.3751988Z 0  None
2022-03-16T22:14:48.3752170Z 1  None

from bisect with following code sample

# BUG: RecursionError when attempting to replace np.nan values under main branch #45725


import numpy as np

import pandas as pd

print(pd.__version__)

NaN_values = [np.nan, np.nan]
ser = pd.Series(data=NaN_values)
df = pd.DataFrame(data=NaN_values)


ser = ser.replace({np.nan: pd.NA})
print(ser)

df = df.replace({np.nan: None})
print(df)

on master, same code sample gives

1.5.0.dev0+544.g49937b935d
0   NaN
1   NaN
dtype: float64
    0
0 NaN
1 NaN

not sure yet, but I think we will need to remove the ``value = self._standardize_fill_value(value) # GH#45725` added in this PR to fix #45601 and #45836

might propose reverting this PR until the regressions in 1.4.0 are fixed

@simonjayhawkins
Copy link
Member

The tests added in this PR don't seem to fail on master with value = self._standardize_fill_value(value) # GH#45725 removed.

but to be clear, the recursion error still exists when the fix added here is removed. Looks like need to revisit these tests.

For the record the traceback is currently

Traceback (most recent call last):
  File "/home/simon/45725.py", line 15, in <module>
    ser = ser.replace({np.nan: pd.NA})
  File "/home/simon/pandas/pandas/core/series.py", line 5000, in replace
    return super().replace(
  File "/home/simon/pandas/pandas/core/generic.py", line 6681, in replace
    return self.replace(
  File "/home/simon/pandas/pandas/core/series.py", line 5000, in replace
    return super().replace(
  File "/home/simon/pandas/pandas/core/generic.py", line 6730, in replace
    new_data = self._mgr.replace_list(
  File "/home/simon/pandas/pandas/core/internals/managers.py", line 421, in replace_list
    bm = self.apply(
  File "/home/simon/pandas/pandas/core/internals/managers.py", line 304, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/simon/pandas/pandas/core/internals/blocks.py", line 725, in replace_list
    result = blk._replace_coerce(
  File "/home/simon/pandas/pandas/core/internals/blocks.py", line 780, in _replace_coerce
    return self.replace(
  File "/home/simon/pandas/pandas/core/internals/blocks.py", line 605, in replace
    return blk.replace(
  File "/home/simon/pandas/pandas/core/internals/blocks.py", line 605, in replace
    return blk.replace(
  File "/home/simon/pandas/pandas/core/internals/blocks.py", line 605, in replace
    return blk.replace(
  [Previous line repeated 977 more times]
  File "/home/simon/pandas/pandas/core/internals/blocks.py", line 604, in replace
    blk = self.coerce_to_target_dtype(value)
  File "/home/simon/pandas/pandas/core/internals/blocks.py", line 449, in coerce_to_target_dtype
    return self.astype(new_dtype, copy=False)
  File "/home/simon/pandas/pandas/core/internals/blocks.py", line 523, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
  File "/home/simon/pandas/pandas/core/dtypes/astype.py", line 283, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
  File "/home/simon/pandas/pandas/core/dtypes/astype.py", line 215, in astype_array
    if is_datetime64tz_dtype(dtype) and is_datetime64_dtype(values.dtype):
  File "/home/simon/pandas/pandas/core/dtypes/common.py", line 386, in is_datetime64tz_dtype
    return DatetimeTZDtype.is_dtype(arr_or_dtype)
  File "/home/simon/pandas/pandas/core/dtypes/base.py", line 312, in is_dtype
    if isinstance(dtype, (ABCSeries, ABCIndex, ABCDataFrame, np.dtype)):
  File "/home/simon/pandas/pandas/core/dtypes/generic.py", line 45, in _check
    return getattr(inst, attr, "_typ") in comp
RecursionError: maximum recursion depth exceeded while calling a Python object

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Mar 17, 2022

Because the behavior no longer matches the pre-regression behavior and the fact that the tests do not actually catch the reported regression, I now propose to revert this PR and reopen #45725.

Even though this fix prevents the recursion, I think it will be cleaner to undo this and redo properly rather than patching it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate replace replace method
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: RecursionError when attempting to replace np.nan values under main branch
4 participants