Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Dtypes change when using replace with nullable dtypes #40732

Closed
2 of 3 tasks
tamargrey opened this issue Apr 1, 2021 · 3 comments · Fixed by #44940
Closed
2 of 3 tasks

BUG: Dtypes change when using replace with nullable dtypes #40732

tamargrey opened this issue Apr 1, 2021 · 3 comments · Fixed by #44940
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions NA - MaskedArrays Related to pd.NA and nullable extension arrays replace replace method
Milestone

Comments

@tamargrey
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

floats = pd.Series([1.0, 2.0, 3.999, 4.4], dtype=pd.Float64Dtype())
ints = pd.Series([1, 2, 3, 4], dtype=pd.Int64Dtype())

assert floats.replace({1.0: 9}).dtype == 'int64' # results in [9, 2, 3, 4]
assert floats.replace({1.0: 9.0}).dtype == 'float64'

assert ints.replace({1: 9}).dtype == 'int64'
assert ints.replace({1: 9.0}).dtype == 'float64'

Problem description

The dtype of the resulting series is being changed when replace is being called with Int64Dtype and Float64Dtype series'.

There are two problems here, from what I can tell:

  • We lose the nullable dtypes in favor of int64 and float64.
  • When doing replace on a float series with an integer value, the resulting series gets converted to ints, truncating the decimals

Expected Output

assert floats.replace({1.0: 9}).dtype == pd.Float64Dtype() # Matching how an int passed to replace gets converted to a float with float64
assert floats.replace({1.0: 9.0}).dtype == pd.Float64Dtype() 

assert ints.replace({1: 9}).dtype == pd.Int64Dtype()
assert ints.replace({1: 9.0}).dtype ==  pd.Float64Dtype() # Matching how int -> float if a float is passed to replace

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f2c8480
python : 3.8.2.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Sun Jul 5 00:43:10 PDT 2020; root:xnu-6153.141.1~9/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.3
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 41.2.0
Cython : None
pytest : 6.0.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.7
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@tamargrey tamargrey added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 1, 2021
@jorisvandenbossche
Copy link
Member

@tamargrey thanks for the report!

This is indeed not the desired behaviour, and I agree that we should try to preserve the nullable dtype instead of returning the numpy dtype.

On the development branch, I get slightly different behaviour:

In [2]: floats.replace({1.0: 9})
Out[2]: 
0        9
1      2.0
2    3.999
3      4.4
dtype: object

so not resulting in truncated integers, but in object dtype (it at least doesn't result in data loss, but it should still preserve the Float64 dtype).

@jorisvandenbossche jorisvandenbossche added NA - MaskedArrays Related to pd.NA and nullable extension arrays replace replace method and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 1, 2021
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Apr 1, 2021
@zkessel
Copy link

zkessel commented Apr 1, 2021

Using to_replace and value parameters seem to produce the desired behavior in 3 of the 4 cases.

assert floats.replace(to_replace=1.0, value=9).dtype == pd.Float64Dtype() #True
assert floats.replace(to_replace=1.0, value=9.0).dtype == pd.Float64Dtype()  #True

assert ints.replace(to_replace=1, value=9).dtype == pd.Int64Dtype()  #True
assert ints.replace(to_replace=1, value=9.0).dtype == pd.Int64Dtype()  #True

However, the undesired behavior is reproduced when passing lists for to_replace and value params. For example,

assert floats.replace(to_replace=[1.0,2.0], value=[9.0,10.0]).dtype == 'float64' #True, desired dtype is pd.Float64Dtype() 

@jorisvandenbossche
Copy link
Member

Also for the string dtype, it resuls in object dtype:

>>> pd.__version__
'1.3.0.dev0+1219.gccc6a707e2'
>>> 
>>> s = pd.Series(["one", "two", np.nan], dtype="string")
>>> 
>>> s
0     one
1     two
2    <NA>
dtype: string
>>> 
>>> s.replace({"one": "1", "two": "2"})
0       1
1       2
2    <NA>
dtype: object
>>> 

(from #40755 (comment))

@mroeschke mroeschke added the Dtype Conversions Unexpected or buggy dtype conversions label Aug 19, 2021
jbrockmendel added a commit to jbrockmendel/pandas that referenced this issue Dec 18, 2021
@jreback jreback modified the milestones: Contributions Welcome, 1.4 Dec 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions NA - MaskedArrays Related to pd.NA and nullable extension arrays replace replace method
Projects
None yet
5 participants