Skip to content

BUG: series max aggregate function throws a TypeError. #37574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
aniaan opened this issue Nov 2, 2020 · 12 comments
Closed
3 tasks done

BUG: series max aggregate function throws a TypeError. #37574

aniaan opened this issue Nov 2, 2020 · 12 comments
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member Reduction Operations sum, mean, min, max, etc.

Comments

@aniaan
Copy link
Contributor

aniaan commented Nov 2, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd

series = pd.Series(['2018-01-01', None])
print(series.max())

Problem description

The max function skipna parameter defaults to true, which means that the current series only has ["2018-01-01"] when the comparison is actually done, so the max should be 2018-01-01, but an exception is thrown here.

The main error messages: TypeError: '>=' not supported between instances of 'str' and 'float'

I debugged the source code and found that the line nanmax = _nanminmax("max", fill_value_typ="-inf") converts None to inf by default, which ends up comparing the type types

Expected Output

2018-01-01

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 821f90c
python : 3.8.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0.dev0+1025.g821f90c23
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.1
setuptools : 49.2.1
Cython : 0.29.21
pytest : 6.1.1
hypothesis : 5.37.4
sphinx : 3.2.1
blosc : 1.9.2
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.4
fastparquet : 0.4.1
gcsfs : 0.7.1
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 2.0.0
pyxlsb : None
s3fs : 0.5.1
scipy : 1.5.3
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2

@aniaan aniaan added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 2, 2020
@GYHHAHA
Copy link
Contributor

GYHHAHA commented Nov 3, 2020

If you want to treat 2018-01-01 as a datetime instead of a string, a specific dtype should be set.

s = pd.Series(['2018-01-01', None], dtype='datetime64[ns]')
s.max()

@aniaan
Copy link
Contributor Author

aniaan commented Nov 4, 2020

If you want to treat 2018-01-01 as a datetime instead of a string, a specific dtype should be set.

s = pd.Series(['2018-01-01', None], dtype='datetime64[ns]')
s.max()

Maybe there's some ambiguity in what I'm expressing here in my example, but the bug here is mainly intended to say that series of type string, where the max function will report an error if there's a null value.

@GYHHAHA
Copy link
Contributor

GYHHAHA commented Nov 4, 2020

String dtype works fine.

s = pd.Series(['2018-01-01', None], dtype='string')
s.max()

It's better for users to clearify what they actually want to do.

@aniaan
Copy link
Contributor Author

aniaan commented Nov 4, 2020

If dtype='string' is specified, it does not report an error, since the underlying is StringArray.
However, it would be better if all types were ignored, because when we manipulate the DataFrame, such as unstack or stack, the type of the series sometimes changes itself back to object, but the underlying layer is actually still string, in which case the max function will still report an error.

data = [
    {"area": 'bj', 'year': 2014, "value": 100},
    {"area": 'bj', 'year': 2015, "value": 101},
    {"area": 'sh', 'year': 2014, "value": 102},
    {"area": 'hz', 'year': 2014, "value": 103},
    
]
df = pd.DataFrame.from_records(data)
df['area'] = df['area'].astype('string')
df
area year value
bj 2014 100
bj 2015 101
sh 2014 102
hz 2014 103
df.dtypes
area     string
year      int64
value     int64
dtype: object
pivot = df.groupby(['area', 'year']).agg("mean")
pivot = pivot.unstack(['year'])
# ... Do some specific processing
pivot.stack(['year']).reset_index().dtypes

area      object
year       int64
value    float64
dtype: object

You can see that after this series of processing, series has lost its original type. So I think one of the better ways to do the max function here is to really skip the nan and let the elements with specific values participate in the numpy calculation. It's a better experience that way.

@aniaan
Copy link
Contributor Author

aniaan commented Nov 4, 2020

I see that StringArray's max function(

def _minmax(func: Callable, values: np.ndarray, mask: np.ndarray, skipna: bool = True):
) handles subset = values[~mask], which filters out NA values, but not `
def _nanminmax(meth, fill_value_typ):
' here, and I think I'd need
@jbrockmendel to look at it!
def _minmax(func: Callable, values: np.ndarray, mask: np.ndarray, skipna: bool = True):

@jbrockmendel jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Nov 4, 2020
@evanaze
Copy link

evanaze commented Nov 29, 2020

I can confirm this bug, and it is super frustrating. I have a DataFrame with some datetime64 objects in them. Since Pandas defaults to making the DataFrame an object dtype, I am seeing the same bug as described above. In my case, I would like to do df.min(axis=1) but it is broken in the same way. It actually returns a series with only NaN values of dtype float64, but if you dig in line by line you can see the error described above.

It is frustrating because I cannot figure out how to fix this without doing some logical statement or calling pd.to_datetime() on every row.

I am struggling to change the DataFrame type to avoid this bug. Even doing df.astype('datetime64[ns]').dtypes returns this:

t1    datetime64[ns]
pt    datetime64[ns]
sl    datetime64[ns]
dtype: object

Is this intended? It looks like this was not as difficult in previous versions of Pandas.

Edit: It looks like the functionality is the same for np.nanmin. I get the same error if I do np.nanmin(df.iloc[0].values). I also get an error with np.isnan(df.iloc[0].values), possibly because I have a Pandas NaT object in there.

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The issue is mentioned here and is about a year old.

Installed versions INSTALLED VERSIONS ------------------ commit : 67a3d42 python : 3.8.5.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-54-generic Version : #60-Ubuntu SMP Fri Nov 6 10:37:59 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.4
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.0.2
setuptools : 44.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.4
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@jreback
Copy link
Contributor

jreback commented Nov 29, 2020

I can confirm this bug, and it is super frustrating. I have a DataFrame with some datetime64 objects in them. Since Pandas defaults to making the DataFrame an object dtype, I am seeing the same bug as described above. In my case, I would like to do df.min(axis=1) but it is broken in the same way. It actually returns a series with only NaN values of dtype float64, but if you dig in line by line you can see the error described above.

It is frustrating because I cannot figure out how to fix this without doing some logical statement or calling pd.to_datetime() on every row.

I am struggling to change the DataFrame type to avoid this bug. Even doing df.astype('datetime64[ns]').dtypes returns this:

t1    datetime64[ns]
pt    datetime64[ns]
sl    datetime64[ns]
dtype: object

Is this intended? It looks like this was not as difficult in previous versions of Pandas.

Edit: It looks like the functionality is the same for np.nanmin. I get the same error if I do np.nanmin(df.iloc[0].values). I also get an error with np.isnan(df.iloc[0].values), possibly because I have a Pandas NaT object in there.

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The issue is mentioned here and is about a year old.

Installed versions

@evanaze not even clear what bug you are reporting. we need a full reproducible example with the versions to even attempt a patch.

@jreback
Copy link
Contributor

jreback commented Nov 29, 2020

@BEANNAN pls edit the top of this issue. I am not sure that the example reflects the text you are saying (e.g. the example uses 'hello', while the text is talking about datetimes).

@evanaze
Copy link

evanaze commented Nov 29, 2020

@jreback I apologize, I am very new to this so please bear with me.

If anybody has the same problem and sees this, I was able to solve it with df.astype('datetime64[ns]').min(axis=1).

@jreback
Copy link
Contributor

jreback commented Nov 29, 2020

@jreback I apologize, I am very new to this so please bear with me.

If anybody has the same problem and sees this, I was able to solve it with df.astype('datetime64[ns]').min(axis=1).

@evanaze this is not the issue at hand at all. please open a new issue with a repoducible example.

@aniaan
Copy link
Contributor Author

aniaan commented Nov 30, 2020

@BEANNAN pls edit the top of this issue. I am not sure that the example reflects the text you are saying (e.g. the example uses 'hello', while the text is talking about datetimes).

It has been corrected. In fact, I want to express that a string series with a value of None will report an error during max calculation.

@mroeschke
Copy link
Member

Thanks for the report, but this is the expected behavior. series = pd.Series(['2018-01-01', None]) is an object series and not a string series. As mentioned, dtype should be explicitly set in this case if you want the series to be set with a string dtype. Closing as the expected behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests

6 participants