Skip to content

BUG: Series.astype is unable to handle NaN #46377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
ingted opened this issue Mar 15, 2022 · 7 comments
Open
2 of 3 tasks

BUG: Series.astype is unable to handle NaN #46377

ingted opened this issue Mar 15, 2022 · 7 comments
Assignees
Labels
Apply Apply, Aggregate, Transform, Map Bug Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@ingted
Copy link

ingted commented Mar 15, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[2.0, np.nan, 1.0]})
df = df.astype(object)
def iif(a, b, c):
    if a:
        return b
    else:
        return c

df["a"].astype(object).apply(lambda r: iif(np.isnan(r),None,r)).astype(object).apply(lambda r: iif(np.isnan(r),None,r))
df

Issue Description

The result is:

0    2.0
1    NaN
2    1.0

Expected Behavior

The result should be:

0    2.0
1    None
2    1.0

I checked BUG: Replacing NaN with None in Pandas 1.3 does not work
But it seems astype doesn't work here.

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.17763
machine : AMD64
processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.950

pandas : 1.4.1
numpy : 1.22.3
pytz : 2021.3
dateutil : 2.8.2
pip : 21.1.1
setuptools : 56.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.1.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
sqlalchemy : 1.4.32
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@ingted ingted added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 15, 2022
@ingted ingted changed the title BUG: Series is unable to handle NaN BUG: Series.astype is unable to handle NaN Mar 15, 2022
@ingted
Copy link
Author

ingted commented Mar 15, 2022

However,

df.where(pd.notnull(df), None)

could lead to

0    2.0
1    None
2    1.0

@norisa118 norisa118 removed their assignment Mar 31, 2022
@rhshadrach
Copy link
Member

Thanks for the report! I wonder if apply is coercing the result into a numeric dtype. I would agree this is not expected, further investigations and PRs to fix are welcome!

@rhshadrach rhshadrach added Apply Apply, Aggregate, Transform, Map Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 28, 2022
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Jun 28, 2022
@kapiliyer kapiliyer removed their assignment Jul 18, 2022
@kapiliyer
Copy link
Contributor

take

@kapiliyer
Copy link
Contributor

kapiliyer commented Jul 25, 2022

Hi, @ingted . I have been looking over your example, and I think that maybe the issue is not with .astype(). The reason why this example produces the provided Series with the second value being NaN rather than None is because, by default, as @rhshadrach guessed, .apply() has convert_dtype=True. This coerces the dtype of each element the provided function is applied to to be whatever it deems the best possible dtype. If you instead change your example to the following, with convert_dtype=False, your desired output is achieved:

import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[2.0, np.nan, 1.0]})
df = df.astype(object)
def iif(a, b, c):
    if a:
        return b
    else:
        return c

df["a"].apply(lambda r: iif(np.isnan(r),None,r), convert_dtype=False)
df

Including the then redundant occurrences of .astype() also works:

import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[2.0, np.nan, 1.0]})
df = df.astype(object)
def iif(a, b, c):
    if a:
        return b
    else:
        return c

df["a"].astype(object).apply(lambda r: iif(np.isnan(r),None,r), convert_dtype=False).astype(object)
df

Let me know if I am misunderstanding the issue, and that there is a problem with .astype().

That being said, it is interesting that the conversion chosen for .apply(lambda r: iif(np.isnan(r),None,r), convert_dtype=True) is converting the Series to float64 instead of keeping it as object. Will look into this more.

@rhshadrach provided this information, would you agree that this behavior of .apply() with convert_dtype=True is unexpected?

@kapiliyer
Copy link
Contributor

kapiliyer commented Jul 25, 2022

.apply() in this case calls map_infer() which, after applying the function to all elements, if convert_dtype=True, calls maybe_convert_objects(). This decides to convert the Series to float64 because it saw that the Series consisted of floats and a None. From the code, it actually does look intentional. The specific control flow I am talking about is (in pandas/_libs/lib.pyx):

    floats = cnp.PyArray_EMPTY(1, objects.shape, cnp.NPY_FLOAT64, 0)

...

    for i in range(n):
        val = objects[i]
        if itemsize_max != -1:
            itemsize = get_itemsize(val)
            if itemsize > itemsize_max or itemsize == -1:
                itemsize_max = itemsize

        if val is None:
            seen.null_ = True
            floats[i] = complexes[i] = fnan
            mask[i] = True
        elif val is NaT:
            seen.nat_ = True
            if convert_datetime:
                idatetimes[i] = NPY_NAT
            if convert_timedelta:
                itimedeltas[i] = NPY_NAT
            if not (convert_datetime or convert_timedelta or convert_period):
                seen.object_ = True
                break
        elif val is np.nan:
            seen.nan_ = True
            mask[i] = True
            floats[i] = complexes[i] = val
        elif util.is_bool_object(val):
            seen.bool_ = True
            bools[i] = val
        elif util.is_float_object(val):
            floats[i] = complexes[i] = val
            seen.float_ = True

...

    if not seen.object_:
        result = None
        if not safe:
            if seen.null_ or seen.nan_:
                if seen.is_float_or_complex:
                    if seen.complex_:
                        result = complexes
                    elif seen.float_:
                        result = floats

...

        if result is uints or result is ints or result is floats or result is complexes:
            # cast to the largest itemsize when all values are NumPy scalars
            if itemsize_max > 0 and itemsize_max != result.dtype.itemsize:
                result = result.astype(result.dtype.kind + str(itemsize_max))
            return result
        elif result is not None:
            return result

@ingted
Copy link
Author

ingted commented Jul 25, 2022

Not sure why convert_dtype=False should be required if the user doesn't know it...

At least similar operation on datafrime doesn't need to specify convert_dtype=false...

mroeschke pushed a commit that referenced this issue Aug 1, 2022
…lace (#46673) (#47780)

* BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673)

* BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673)

* BUG: Series.astype is unable to handle pd.nan.

* BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673)

* BUG: Series.astype is unable to handle pd.nan (#46377)

* BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673)

* BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673)

* BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673)
@rhshadrach
Copy link
Member

@rhshadrach provided this information, would you agree that this behavior of .apply() with convert_dtype=True is unexpected?

At current state - there are various places where currently we treat None as np.nan and this is consistent with that.

df = pd.DataFrame({'a': [1, 2, None]})
print(df)

#      a
# 0  1.0
# 1  2.0
# 2  NaN

df = pd.DataFrame({'a': [1, np.nan, None], 'b': [2, 3, 4]})
gb = df.groupby('a', dropna=False)
print(gb.sum())

#      b
# a     
# 1.0  2
# NaN  7

That said, I have a hunch that it would be better if pandas were to treat None as a Python object instead of np.nan. However, this issue needs more study for me to feel more certain of this.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

5 participants