-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: quantile for ExtensionArray #39606
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I think this should work for arbitrary EAs, but we only have tests for sparse, dt64tz, and this adds Period (and some untested cases for dt64tz)
I suppose this will also work for the numeric EAs? Probably best to add a minimal test for it as well (I don't think we have coverage for it)
mask : np.ndarray[bool] | ||
mask = isna(values) | ||
For ExtensionArray, this is computed before calling _value_for_factorize | ||
fill_value : Scalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also call this na_value
? (like nanpercentile does)
(I know we use both fill_value and na_value in many places, and somewhat interchangeably, but here I personally find na_value clearer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this and decided on fill_value on the theory that "na_value" means "the value that, when we see it, indicates we have an NA" and "fill_value" means "the value that we use when we need to fill in an NA value".
e.g. we get here with fill_value=iNaT, which would be weird to have as an na_value
mask = np.asarray(isna(self.values)) | ||
mask = np.atleast_2d(mask) | ||
|
||
values, fill_value = self.values._values_for_factorize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, _values_for_factorize
is not necessarily correct, but since quantile only makes sense for numeric-like dtypes (and for those dtypes, _values_for_factorize
will most likely be the underlying numbers), that's probably OK.
(eg in geopands, _values_for_factorize
returns serialized bytes, but quantile/percentile doesn't work on bytes anyway, so no direct problem)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think quantile makes sense for any ordered types, and IIUC values_for_factorize is supposed to preserve ordering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC values_for_factorize is supposed to preserve ordering
In theory I think that's not required for factorize (since we don't sort in factorization). In practice not sure if that matters though (can't think of a situation where it wouldn't be orderable).
But looking at my overview in #33276 again, I mentioned there that in principle we shouldn't use _values_for_factorize
directly, but rather use the .factorize()
method. But of course that's not usable for this use case.
hmm Interval breaks. im inclined to leave testing the masked dtypes to @jorisvandenbossche |
marks=pytest.mark.xfail(reason="raises when trying to add Intervals"), | ||
), | ||
pd.period_range("2016-01-01", periods=9, freq="D"), | ||
pd.date_range("2016-01-01", periods=9, tz="US/Pacific"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add eg pd.array(range(9), dtype="Float64")
here?
I think it is important to add some basic testing for the different EAs we have internally. Because right now |
BTW, I think it is perfectly fine to start with something that "works for us", without having it a fully general mechanism for external ExtensionArrays) |
lgtm. thanks. |
As mentioned above, this actually breaks it for nullable float/integer (which were working before) |
I think this should work for arbitrary EAs, but we only have tests for sparse, dt64tz, and this adds Period (and some untested cases for dt64tz)
Part of the idea is to get the code out of the Block method for ArrayManager compat.