BUG: Series.astype is unable to handle NaN #46377

ingted · 2022-03-15T22:07:13Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[2.0, np.nan, 1.0]})
df = df.astype(object)
def iif(a, b, c):
    if a:
        return b
    else:
        return c

df["a"].astype(object).apply(lambda r: iif(np.isnan(r),None,r)).astype(object).apply(lambda r: iif(np.isnan(r),None,r))
df

Issue Description

The result is:

0    2.0
1    NaN
2    1.0

Expected Behavior

The result should be:

0    2.0
1    None
2    1.0

I checked BUG: Replacing NaN with None in Pandas 1.3 does not work
But it seems astype doesn't work here.

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.17763
machine : AMD64
processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.950

pandas : 1.4.1
numpy : 1.22.3
pytz : 2021.3
dateutil : 2.8.2
pip : 21.1.1
setuptools : 56.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.1.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
sqlalchemy : 1.4.32
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

ingted · 2022-03-15T22:09:30Z

However,

df.where(pd.notnull(df), None)

could lead to

0    2.0
1    None
2    1.0

rhshadrach · 2022-06-28T20:46:55Z

Thanks for the report! I wonder if apply is coercing the result into a numeric dtype. I would agree this is not expected, further investigations and PRs to fix are welcome!

kapiliyer · 2022-07-18T19:21:51Z

take

kapiliyer · 2022-07-25T03:03:00Z

Hi, @ingted . I have been looking over your example, and I think that maybe the issue is not with .astype(). The reason why this example produces the provided Series with the second value being NaN rather than None is because, by default, as @rhshadrach guessed, .apply() has convert_dtype=True. This coerces the dtype of each element the provided function is applied to to be whatever it deems the best possible dtype. If you instead change your example to the following, with convert_dtype=False, your desired output is achieved:

import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[2.0, np.nan, 1.0]})
df = df.astype(object)
def iif(a, b, c):
    if a:
        return b
    else:
        return c

df["a"].apply(lambda r: iif(np.isnan(r),None,r), convert_dtype=False)
df

Including the then redundant occurrences of .astype() also works:

import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[2.0, np.nan, 1.0]})
df = df.astype(object)
def iif(a, b, c):
    if a:
        return b
    else:
        return c

df["a"].astype(object).apply(lambda r: iif(np.isnan(r),None,r), convert_dtype=False).astype(object)
df

Let me know if I am misunderstanding the issue, and that there is a problem with .astype().

That being said, it is interesting that the conversion chosen for .apply(lambda r: iif(np.isnan(r),None,r), convert_dtype=True) is converting the Series to float64 instead of keeping it as object. Will look into this more.

@rhshadrach provided this information, would you agree that this behavior of .apply() with convert_dtype=True is unexpected?

kapiliyer · 2022-07-25T03:36:40Z

.apply() in this case calls map_infer() which, after applying the function to all elements, if convert_dtype=True, calls maybe_convert_objects(). This decides to convert the Series to float64 because it saw that the Series consisted of floats and a None. From the code, it actually does look intentional. The specific control flow I am talking about is (in pandas/_libs/lib.pyx):

    floats = cnp.PyArray_EMPTY(1, objects.shape, cnp.NPY_FLOAT64, 0)

...

    for i in range(n):
        val = objects[i]
        if itemsize_max != -1:
            itemsize = get_itemsize(val)
            if itemsize > itemsize_max or itemsize == -1:
                itemsize_max = itemsize

        if val is None:
            seen.null_ = True
            floats[i] = complexes[i] = fnan
            mask[i] = True
        elif val is NaT:
            seen.nat_ = True
            if convert_datetime:
                idatetimes[i] = NPY_NAT
            if convert_timedelta:
                itimedeltas[i] = NPY_NAT
            if not (convert_datetime or convert_timedelta or convert_period):
                seen.object_ = True
                break
        elif val is np.nan:
            seen.nan_ = True
            mask[i] = True
            floats[i] = complexes[i] = val
        elif util.is_bool_object(val):
            seen.bool_ = True
            bools[i] = val
        elif util.is_float_object(val):
            floats[i] = complexes[i] = val
            seen.float_ = True

...

    if not seen.object_:
        result = None
        if not safe:
            if seen.null_ or seen.nan_:
                if seen.is_float_or_complex:
                    if seen.complex_:
                        result = complexes
                    elif seen.float_:
                        result = floats

...

        if result is uints or result is ints or result is floats or result is complexes:
            # cast to the largest itemsize when all values are NumPy scalars
            if itemsize_max > 0 and itemsize_max != result.dtype.itemsize:
                result = result.astype(result.dtype.kind + str(itemsize_max))
            return result
        elif result is not None:
            return result

ingted · 2022-07-25T04:43:24Z

Not sure why convert_dtype=False should be required if the user doesn't know it...

At least similar operation on datafrime doesn't need to specify convert_dtype=false...

…lace (#46673) (#47780) * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673) * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673) * BUG: Series.astype is unable to handle pd.nan. * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673) * BUG: Series.astype is unable to handle pd.nan (#46377) * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673) * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673) * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673)

rhshadrach · 2022-08-07T12:55:59Z

@rhshadrach provided this information, would you agree that this behavior of .apply() with convert_dtype=True is unexpected?

At current state - there are various places where currently we treat None as np.nan and this is consistent with that.

df = pd.DataFrame({'a': [1, 2, None]})
print(df)

#      a
# 0  1.0
# 1  2.0
# 2  NaN

df = pd.DataFrame({'a': [1, np.nan, None], 'b': [2, 3, 4]})
gb = df.groupby('a', dropna=False)
print(gb.sum())

#      b
# a     
# 1.0  2
# NaN  7

That said, I have a hunch that it would be better if pandas were to treat None as a Python object instead of np.nan. However, this issue needs more study for me to feel more certain of this.

ingted added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 15, 2022

ingted changed the title ~~BUG: Series is unable to handle NaN~~ BUG: Series.astype is unable to handle NaN Mar 15, 2022

github-actions bot assigned norisa118 Mar 23, 2022

norisa118 removed their assignment Mar 31, 2022

rhshadrach added Apply Apply, Aggregate, Transform, Map Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 28, 2022

rhshadrach added this to the Contributions Welcome milestone Jun 28, 2022

github-actions bot assigned kapiliyer Jul 18, 2022

kapiliyer removed their assignment Jul 18, 2022

github-actions bot assigned kapiliyer Jul 18, 2022

kapiliyer added a commit to kapiliyer/pandas that referenced this issue Jul 22, 2022

BUG: Series.astype is unable to handle pd.nan (pandas-dev#46377)

ab1d2a7

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Series.astype is unable to handle NaN #46377

BUG: Series.astype is unable to handle NaN #46377

ingted commented Mar 15, 2022

INSTALLED VERSIONS

ingted commented Mar 15, 2022

rhshadrach commented Jun 28, 2022

kapiliyer commented Jul 18, 2022

kapiliyer commented Jul 25, 2022 •

edited

Loading

kapiliyer commented Jul 25, 2022 •

edited

Loading

ingted commented Jul 25, 2022

rhshadrach commented Aug 7, 2022

BUG: Series.astype is unable to handle NaN #46377

BUG: Series.astype is unable to handle NaN #46377

Comments

ingted commented Mar 15, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

ingted commented Mar 15, 2022

rhshadrach commented Jun 28, 2022

kapiliyer commented Jul 18, 2022

kapiliyer commented Jul 25, 2022 • edited Loading

kapiliyer commented Jul 25, 2022 • edited Loading

ingted commented Jul 25, 2022

rhshadrach commented Aug 7, 2022

kapiliyer commented Jul 25, 2022 •

edited

Loading

kapiliyer commented Jul 25, 2022 •

edited

Loading