-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EA interface - requirements for "hashable, value+order-preserving ndarray" #33276
Comments
So you could summarize my long post as "let's put back |
Thanks for putting this together. You're definitely right that the threads have become hard to follow. Is there anything in this that is a) currently actionable and b) to the best of your knowledge we have consensus on? For the list of "places where we use _values_for_foo and maybe shouldnt": Short-term we can get rid of the Some more considerations:
Quick Hits:
|
Looking at removing the 4 _values_for_factorize usages in reshape.merge. The natural thing to do would be to use |
Not yet, I think, so let's first discuss a bit more before jumping to PRs (now, properly thinking the requirements through might need some code experiments though).
I would wait a bit with removing those, until we figure out what we want to replace it with. Yes, for our own Index subclasses, we can just redefine the engine values in the subclasses, but that's not a general solution for EAs. And ideally, our own index subclasses that are based on EAs, will be able to use this general solution (they are actually the best test cases for checking a solution).
The hashtable is only implementd for int64, so that is the reason that a cast to int64 is indeed necessary. See eg this (used by pandas/pandas/core/algorithms.py Lines 69 to 77 in 5e21be0
So since
Why do you think this would be the case? Only the non-missing values need to sort correctly within the array, which seems ok with the integer periods?
Hmm, yes. But I think when we implemented that, we didn't look very well at what
Yeah, that was not how it was originally, but changed to fix a bug, apparently (looking back for it, I found discussion about this here #27137 (comment) (and comments below)). There is an open issue about fixing this: #27218 |
Is it actually possible to use So simplified we have something like this right now in
We could do like this:
but the problem here is that the left keys and right keys get "factorized independently" (while in the actualy code as above, we are using the same Factorizer object to factorize both left and right keys). |
discussed in an unrelated PR i see quite a lot of code that must convert an EA to something we can pass to cython; so kind of like _values_for_argsort but with Nans filled (though if we always used masks that would maybe be better) eg something like vals = vals.to_numpy(dtype=float, na_value=np.nan) for nullable integers, but ideally we just a method on EAs (and of course we want a non object type here) |
For future reference, the PR is this one: https://github.com/pandas-dev/pandas/pull/33138/files#r404458720, where we need to decide which values get passed into a cython algorithm (in this case a grouped one). |
AFAICT we haven't made any progress on getting rid of our own usages of _values_for_argsort and _values_for_factorize. If we don't expect that to change, we should move them from "optional convenience methods" to required methods. |
I've made some progress on _values_for_argsort. We're down to 3 places in the non-test code where we use it, only one of which is clearly A Problem and will be removed in an upcoming PR (in merge_asof). The remaining two are in
In particular for MaskedArrays this means we can safely just return |
I think for this case is also fine, because in the end this is 1) only used in the EA implementation (so it is similar as how We should probably mainly update the implementers note in the base class that if you implement
Yes, that's correct I think. And in the meantime, you already have documented this better and updated it for the masked array implementation in #45434 So I think with that, I think the main use case we need to resolve is its usage in joining ( |
Correct. Also one in utill.hashing.hash_array |
That's reasonable. We should document that authors that override argsort to not use _values_for_argsort probably also need to implement argmin/argmax. |
Based on a review of the outstanding issues and conversation last week with @jorisvandenbossche and @TomAugspurger: We have a bunch of places where we
Some places that broadly fit this pattern:
The "pseudo" part of "pseudo-round-trip" is because some of the operations in question aren't quite dtype-preserving, but can be e.g. Int64->Float64 The "X" part of "X-preserving way" is bc we need to retain different characteristics in different contexts (could make this configurable by making an EA/EADtype method analogous to groupby.ops.WrappedCythonOp._disallow_invalid_ops)
Finally, we have a usage of
This is a natural enough assumption that I think it's worth codifying. Thoughts on a way forward. I'm optimistic-but-not-certain that we can eventually combine all of these cases into a pattern like:
Shorter-term, some steps that I think will be useful regardless of whether we can ultimately get down to a single _values_for_X:
|
We had some discussion about this at the last dev meeting. @jbrockmendel do you remember what was more or less the conclusion, or can you summarize how you are now thinking to move forward on this?
Indeed, I think we said here that it is probably fine to simply document this de-facto stability requirement (since we are already relying on that anyway, and if this resulted in wrong merges for some downstream EA, we would probably have heard about it). Although I think the more problematic aspect of this usage in the merge code is that EAs are not required to implement Making
This could probably also be done without a deprecation, as it could be relatively easy to inspect the return value to see if it includes a mask or not.
+1 on both.
I mentioned this on the call as well, but an example where
What's the goal / reason for this? |
@jorisvandenbossche thanks for the reminder. I've gone down a non-nano rabbit hole and let things slip through the cracks.
I also recall there being consensus about this on the call.
Yes, this was a compelling example and has really helped clarify my thinking on the topic. Thanks for explaining it. I'll amend the relevant docstrings to make the differences clear.
I prefer the explicit-over-implciit, but don't care enough to make a stink over it.
I find the status quo odd and complicated since AFAICT
Yah I don't think we landed on any great solutions here. IIRC there was some speculation about using factorize itself for [something that might have been merge?] in the same ballpark. Big picture, I think the discussion is going to allow us to more precisely nail down the requirements for values_for_argsort and values_for_factorize, and do to so in a way that makes them potentially re-useable (the example I have in mind is reusing values_for_argsort for |
Across a set of issues/PRs (eg #32586, #32673, #33064), there has lately been quite some discussion regarding
_values_for_factorzie
/_values_for_argsort
, the need for roundtrippability, the need for order-preserving, the need for_ndarray_values
, ...Those different questions and topics that came up (as far as I kept track, probably not complete! but already too long .. ;)):
In EA: revisit interface #32586, the question was raised what the difference is between
_values_for_argsort
and_values_for_factorize
, and whether we need both.Some difference that came up:
_values_for_factorize
need to be hashable, while the ones in_values_for_argsort
don't need to be (although this difference is not properly documented). Looking back at ENH: Sorting of ExtensionArrays #19957 from @TomAugspurger, it was mentioned the that "sortable" is an easier requirement than what other algos like factorize might need.Related to this difference is that
_values_for_factorize
needs to return a dtype supported by the hashtables (int64, uint64, float64, object), while_values_for_argsort
can return any sortable dtype (so also int8, int32, etc)._values_for_factorize
also returns ana_value
sentinel, which means you can encode missing values in a different way than a "missing value" (eg nan in float dtype). While for_values_for_argsort
, it simply returns one array (I would need to look into the details how missing values are handled here, it seems they are filtered out innargsort
, so it might not matter how they are encoded in the returned array).Is this sufficiently different to warrant two methods? Probably, with a bit work, they could be combined in a single method. However, their original purpose was only to help implement
EA.factorize()
andEA.argsort()
. So for that purpose only, it might not necessarily be worth trying to combine them. And see the last bullet point for a more general "hashable, orderable array"In addition, we actually also have
_values_for_rank
for Categorical, which we probably should try to get rid off as well -> BUG: Categorical.values_for_(factorize|argsort) dont preserve order #33245I have argued that in general, we should also look at a "masked" version of eg
_values_for_factorize
. Having the option to return a (values, mask) tuple in addition to (values, na_sentinel) in case this is easier/cheaper to provide (which is especially the case for the masked arrays; this will need to support of the factorize algos for masks though -> eg ENH/PERF: use mask in factorize for nullable dtypes #33064)We also had a vaguely defined
_ndarray_values
(API / internals: exact semantics of _ndarray_values #23565), that was recently removed (CLN: remove _ndarray_values #32768). It was eg used in indexing code (index engine, joining indexes), where it was replaced with_values_for_argsort
(CLN: use _values_for_argsort for join_non_unique, join_monotonic #32467, REF: implement _get_engine_target #32611).What else can they be used for internally? (
_values_for_factorize
/_values_for_argsort
)As mentioned above,
_values_for_argsort
is since recently used for ExtensionIndex joining and engine values. Further,_values_for_factorize
is used in the general merging code.However, the initial purpose of
_values_for_factorize
/_values_for_argsort
was not to be used internally in pandas, but only has a helper toEA.factorize()
andEA.argsort()
. So following our current EA interface spec, we should not use them internally (which means we should fix the few cases where we started using them).The spec about
factorize
is clear that there are two ways to override its behaviour: implement_values_for_factorize
/_from_factorized
, or implementfactorize
itself:pandas/pandas/core/arrays/base.py
Lines 740 to 747 in c47e9ca
So external EAs are not guaranteed to have an efficient implementation of
_values_for_factorize
/_values_for_argsort
(they still have the defaultastype(object)
implementation).Fletcher is an example of external EAs that implement
factorize
and not_values_for_factorize
.So ideally, for anything factorize/argsort-related, we should actually always call the
EA.factorize()
orEA.argsort()
methods.In API: should values_for_factorize and _from_factorized round-trip missing values? #32673, @jbrockmendel questioned whether the
_values_for_factorize
and_from_factorized
combo should faithfully roundtrip? Currently, they do, but not necessarily when missing values are included.However, when only considering them as "internal" to the
EA.factorize()
implementation, this question doesn't actually matter. But it does matter when we want to use those values more generally.I mentioned above that ideally we should use
factorize()
orargsort()
directly as much as possible and avoid_values_for_factorize/argsort
(since this is the official EA interface). However, there are still cases where such direct usage is not sufficient, and where we actually need some "values".For example, in the merging/joining code, you can't "just"
factorize()
the left and right array, because then the integer codes of left and right both don't necessarily can be matched (it depends on the uniques being present what those integers mean).I think it is clear we have some use case for "ndarray values", but so we should think about for which use cases we need that and what requirements we have for those.
@jbrockmendel started to list some requirements here: EA: revisit interface #32586 (comment)
Having read again through all the recent issues and having written up the above, my current take-away point are:
To start, we should maybe put the questions around
_values_for_factorize
/_values_for_argsort
aside for a moment. In principle they are internal toEA.factorize()
/EA.argsort()
, and so we could also remove those methods (if we wanted that) if we "just" require EA authors to implement factorize/argsort directly instead of through the helper _values_for.. .And if we figure out the general "ndarray values" requirements (see below), we can still come back to this to see if we can actually replace both
_values_for_factorize
/_values_for_argsort
with this single "ndarray values" interface.I now think that replacing
_ndarray_values
with_values_for_argsort
to be able to remove_ndarray_values
actually didn't solve much. We replaced one vaguely specified property (_ndarray_values
) with another (_values_for_argsort
for other purposes than just sorting, as there are currnetly also no guarantees / requirements outside of sorting specified for_values_for_argsort
).Instead, I would focus on figuring out what the requirements are for the "hashable / value preserving ndarray values":
It might be that this ends up to be something close to what
_values_for_argsort
or_values_for_factorize
now are. But I also think that my comment above about the possibility to include a mask in this interface is important for the nullable dtypes.An alternative that we didn't really mention yet, is adding more to the EA interface instead of requiring this "ndarray values" concept. For example, if we want that external EAs can have control over joining, we could have a
EA.__join__(other_EA, how) -> Tuple[ndarray[int], ndarray[int]]
that returns indices into left and right EAs that determine how to join them.For joining that might be a relatively straightforward interface, for the indexing engine that looks more complex though (but so let's first define the use cases).
The text was updated successfully, but these errors were encountered: