Skip to content

EA: Should IntegerNA support Inf, -Inf? #28423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Sep 13, 2019 · 16 comments
Closed

EA: Should IntegerNA support Inf, -Inf? #28423

jbrockmendel opened this issue Sep 13, 2019 · 16 comments
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action

Comments

@jbrockmendel
Copy link
Member

It uses a boolean mask with 8 bits under the hood, so it wouldn't be too tough to implement. Doing this would make it easier to keep IntegerNA arithmetic in sync with Series arithmetic.

@TomAugspurger
Copy link
Contributor

inf / -inf are necessarily floating point, right? Or do you have some masked-based approach in mind (which seems hard to do if we want to switch to a bit mask)?

@jbrockmendel
Copy link
Member Author

the approach i have in mind depends on the fact that the mask is uint8. If we switch to an actual bitmask that wouldnt work

@jorisvandenbossche
Copy link
Member

I don't think integer should support this. Inf/-Inf is a float concept, it makes IMO no sense to have that in an integer array conceptually.

@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Sep 13, 2019
@jreback
Copy link
Contributor

jreback commented Sep 13, 2019

-1 on this as it would be very very odd for an integer array to support a naturaly float (inf).

@jbrockmendel
Copy link
Member Author

Inf/-Inf is a float concept
a naturaly float (inf).

There is nothing float-specific about inf conceptually/mathematically. What you're describing is an artifact of the existing implementations, and getting around those is the point of IntegerNA.

@jorisvandenbossche
Copy link
Member

For me the main point of IntegerNA is to provide support for missing values, not necessarily for adding concepts like infinity or non-existent numbers.

But I also don't really understand the reason. You say that it would make it easier to keep IntegerNA arithmetic in sync with Series arithmetic. Can you explain this? Cases where you end up with infinity (eg division) should normally end up as float anyway?

@jbrockmendel
Copy link
Member Author

You say that it would make it easier to keep IntegerNA arithmetic in sync with Series arithmetic. Can you explain this? Cases where you end up with infinity (eg division) should normally end up as float anyway?

Good question.

In IntegerArray._maybe_mask_result entries with np.inf or -np.inf are masked to be treated as nan. As a result, we get the following behavior:

>>> ser = pd.Series(range(-1, 2))
>>> (ser / 0)._values
array([-inf,  nan,  inf])

>>> ser2 = ser.astype("Int64")
>>> (ser2 / 0)._values
array([nan, nan, nan])

(See also: #27829.) If IntegerArray handled inf and -inf in the mask, then the second case here could avoid casting to float.

@jorisvandenbossche
Copy link
Member

That's a bug I would say. But note that ser2 / 0 is not a Int64 Series, but a float64 Series, so I don't see the relationship with Int64 needing to hold inf to get the same result

@jorisvandenbossche
Copy link
Member

the second case here could avoid casting to float.

Reading your comment fully now. Why would we want to avoid casting to float? I think divisions should always give float? (at least that's the rule in numpy, I think it is good to follow that to have predictable types)

@jbrockmendel
Copy link
Member Author

I think divisions should always give float?

Fair enough, then consider floordiv. ATM the Int64 case gives back an all-zero Int64 result (will have to track down whats causing that) while the Series[int64] result is the same as for the truediv example.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 13, 2019 via email

@jbrockmendel
Copy link
Member Author

I think that making _mask a bitmask is more important than supporting +/- inf.

Agreed.

@jreback
Copy link
Contributor

jreback commented Sep 13, 2019

I had a PR (closed) which created a MaskArray backed by pyarrow which does exactly this (use a bit mask)

@jbrockmendel
Copy link
Member Author

MaskArray backed by pyarrow

If we had pyarrow wouldn't we just use their implementation of IntegerNA?

@jreback
Copy link
Contributor

jreback commented Sep 13, 2019

MaskArray backed by pyarrow

If we had pyarrow wouldn't we just use their implementation of IntegerNA?

in theory yes, but there are too many missing methods

so the storage is useful at this point (not a lot else)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 14, 2019

Personally, I am not sure a bitmask is necessarily worth if without the advantanges of going full pyarrow memory (eg it would make compatibility with future numpy masked dtypes more difficult; will need our own implementation, ..).
But anyway, that's a different discussion (worth having somewhere else), and even without the argument of future use of bitmask, I personally don't think trying to support the concept of infinity is worth the added complexity just for the current mask-based implementation, certainly given that I don't know of any other programming language that does this.

Fair enough, then consider floordiv. ATM the Int64 case gives back an all-zero Int64 result (will have to track down whats causing that) while the Series[int64] result is the same as for the truediv example.

Yep, floordiv is indeed an example of undefined behaviour in this case for ints. The zeros come from numpy (which also raises a warning in that case), I would say that the Series[int§4] behaviour is a bug (maybe that behaviour was done intentional in the past, but I would rather keep the result integer).
But still, I don't think this corner case of floordiv by 0 is worth the added complexity of a mask a non-binary meaning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants