Skip to content

API / BUG: How do we differentiate between -9223372036854775808 and iNaT? #16674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gfyoung opened this issue Jun 12, 2017 · 5 comments
Open
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions Resample resample method

Comments

@gfyoung
Copy link
Member

gfyoung commented Jun 12, 2017

From #3707 (at 028188):

from datetime import datetime
from pandas import DataFrame

import numpy as np

max_int = np.iinfo(np.int64).max
min_int = np.iinfo(np.int64).min

df = DataFrame([max_int, min_int], index=[datetime(2013, 1, 1), datetime(2013, 1, 1)])
assert df.resample("M").apply(np.sum)[0][0] == -1
...
AssertionError

The assertion error occurs because during the aggregation, pandas checks in cython_operation in core/groupby.py via _is_cython_func from core/base.py whether there are any "missing" integer values (assuming the data is integer) before and after the aggregation, which are defined as iNaT = -9223372036854775808. If there are any such values, we automatically cast the data to float.

This logic is quite prevalent in the codebase, but it does seem quite fraught with pitfalls. For example, what if the output of a computation got the value -9223372036854775808 ? Also, what if the user intended to use -9223372036854775808 as a legitimate data point?

Unlikely, sure. But reasonable, absolutely.

@gfyoung gfyoung changed the title API: How do we differentiate between -9223372036854775808 and iNaT? API / BUG: How do we differentiate between -9223372036854775808 and iNaT? Jun 12, 2017
@jreback
Copy link
Contributor

jreback commented Jun 12, 2017

so we already do this for transforms. https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby.py#L2100. should be easy to extend generally.

@jreback jreback added Bug Difficulty Advanced Dtype Conversions Unexpected or buggy dtype conversions Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Resample resample method labels Jun 12, 2017
@jreback jreback added this to the Next Major Release milestone Jun 12, 2017
@gfyoung
Copy link
Member Author

gfyoung commented Jun 13, 2017

I don't follow you here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby.py#L2102 seems to suggest otherwise.

@jreback
Copy link
Contributor

jreback commented Jun 13, 2017

@gfyoung the output type of datetime/timedelta ops must be integers (though I suppose and this seems like the case here), that we have a non-datetimelike that returns an integer as well. So may need to sort thru these checks to avoid false positives.

@jbrockmendel
Copy link
Member

I think the example in the OP still fails, but only because of floating point error, not iNaT ambiguity. IIRC there was a discussion about supporting int64_t directly in libgroupby.group_add and i think that would solve that particular example.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Feb 7, 2023
@rhshadrach rhshadrach added Needs Tests Unit test(s) needed to prevent regressions and removed Closing Candidate May be closeable, needs more eyeballs labels Jul 15, 2023
@rhshadrach
Copy link
Member

I now get the expected result on main; not sure if this needs tests or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions Resample resample method
Projects
None yet
Development

No branches or pull requests

5 participants