-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: series max aggregate function throws a TypeError. #37574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If you want to treat 2018-01-01 as a datetime instead of a string, a specific dtype should be set. s = pd.Series(['2018-01-01', None], dtype='datetime64[ns]')
s.max() |
Maybe there's some ambiguity in what I'm expressing here in my example, but the bug here is mainly intended to say that series of type string, where the max function will report an error if there's a null value. |
String s = pd.Series(['2018-01-01', None], dtype='string')
s.max() It's better for users to clearify what they actually want to do. |
If dtype='string' is specified, it does not report an error, since the underlying is StringArray. data = [
{"area": 'bj', 'year': 2014, "value": 100},
{"area": 'bj', 'year': 2015, "value": 101},
{"area": 'sh', 'year': 2014, "value": 102},
{"area": 'hz', 'year': 2014, "value": 103},
]
df = pd.DataFrame.from_records(data)
df['area'] = df['area'].astype('string')
df
df.dtypes
area string
year int64
value int64
dtype: object
You can see that after this series of processing, series has lost its original type. So I think one of the better ways to do the max function here is to really skip the nan and let the elements with specific values participate in the numpy calculation. It's a better experience that way. |
I see that StringArray's max function(
subset = values[~mask] , which filters out NA values, but not `Line 934 in e38e987
@jbrockmendel to look at it!
|
I can confirm this bug, and it is super frustrating. I have a DataFrame with some It is frustrating because I cannot figure out how to fix this without doing some logical statement or calling I am struggling to change the DataFrame type to avoid this bug. Even doing
Is this intended? It looks like this was not as difficult in previous versions of Pandas. Edit: It looks like the functionality is the same for
The issue is mentioned here and is about a year old. Installed versionsINSTALLED VERSIONS ------------------ commit : 67a3d42 python : 3.8.5.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-54-generic Version : #60-Ubuntu SMP Fri Nov 6 10:37:59 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8pandas : 1.1.4 |
@evanaze not even clear what bug you are reporting. we need a full reproducible example with the versions to even attempt a patch. |
@BEANNAN pls edit the top of this issue. I am not sure that the example reflects the text you are saying (e.g. the example uses 'hello', while the text is talking about datetimes). |
@jreback I apologize, I am very new to this so please bear with me. If anybody has the same problem and sees this, I was able to solve it with |
It has been corrected. In fact, I want to express that a string series with a value of None will report an error during max calculation. |
Thanks for the report, but this is the expected behavior. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
The max function skipna parameter defaults to true, which means that the current series only has ["2018-01-01"] when the comparison is actually done, so the max should be 2018-01-01, but an exception is thrown here.
The main error messages:
TypeError: '>=' not supported between instances of 'str' and 'float'
I debugged the source code and found that the line
nanmax = _nanminmax("max", fill_value_typ="-inf")
converts None to inf by default, which ends up comparing the type typesExpected Output
2018-01-01
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 821f90c
python : 3.8.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.0.dev0+1025.g821f90c23
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.1
setuptools : 49.2.1
Cython : 0.29.21
pytest : 6.1.1
hypothesis : 5.37.4
sphinx : 3.2.1
blosc : 1.9.2
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.4
fastparquet : 0.4.1
gcsfs : 0.7.1
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 2.0.0
pyxlsb : None
s3fs : 0.5.1
scipy : 1.5.3
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2
The text was updated successfully, but these errors were encountered: