EA interface - requirements for "hashable, value+order-preserving ndarray"

Across a set of issues/PRs (eg https://github.com/pandas-dev/pandas/issues/32586, https://github.com/pandas-dev/pandas/issues/32673, https://github.com/pandas-dev/pandas/pull/33064), there has lately been quite some discussion regarding `_values_for_factorzie` / `_values_for_argsort`, the need for roundtrippability, the need for order-preserving, the need for `_ndarray_values`, ... 

Those different questions and topics that came up (as far as I kept track, probably not complete! but already too long .. ;)):

* In https://github.com/pandas-dev/pandas/issues/32586, the question was raised what the difference is between `_values_for_argsort` and `_values_for_factorize`, and whether we need both.

  Some difference that came up:

  * The main difference might be that the values in  `_values_for_factorize` need to be hashable, while the ones in `_values_for_argsort` don't need to be (although this difference is not properly documented). Looking back at https://github.com/pandas-dev/pandas/pull/19957 from @TomAugspurger, it was mentioned the that "sortable" is an easier requirement than what other algos like factorize might need.  
  Related to this difference is that `_values_for_factorize` needs to return a dtype supported by the hashtables (int64, uint64, float64, object), while `_values_for_argsort` can return any sortable dtype (so also int8, int32, etc). 
  * The return type is different: `_values_for_factorize` also returns a `na_value` sentinel, which means you can encode missing values in a different way than a "missing value" (eg nan in float dtype). While for `_values_for_argsort`, it simply returns one array (I would need to look into the details how missing values are handled here, it seems they are filtered out in `nargsort`, so it might not matter how they are encoded in the returned array).

  Is this sufficiently different to warrant two methods? Probably, with a bit work, they could be combined in a single method. However, their original purpose was *only* to help implement `EA.factorize()` and `EA.argsort()`. So for that purpose only, it might not necessarily be worth trying to combine them. And see the last bullet point for a more general "hashable, orderable array"

* In addition, we actually also have `_values_for_rank` for Categorical, which we probably should try to get rid off as well -> https://github.com/pandas-dev/pandas/issues/33245

* I have argued that in general, we should also look at a "masked" version of eg `_values_for_factorize`. Having the option to return a (values, mask) tuple in addition to (values, na_sentinel) in case this is easier/cheaper to provide (which is especially the case for the masked arrays; this will need to support of the factorize algos for masks though -> eg https://github.com/pandas-dev/pandas/pull/33064)

* We also had a vaguely defined `_ndarray_values` (https://github.com/pandas-dev/pandas/issues/23565), that was recently removed (https://github.com/pandas-dev/pandas/pull/32768). It was eg used in indexing code (index engine, joining indexes), where it was replaced with `_values_for_argsort` (https://github.com/pandas-dev/pandas/pull/32467, https://github.com/pandas-dev/pandas/pull/32611). 

* *What else can they be used for internally?* (`_values_for_factorize` / `_values_for_argsort`)  
  As mentioned above, `_values_for_argsort` is since recently used for ExtensionIndex joining and engine values. Further, `_values_for_factorize` is used in the general merging code. 

  However, the initial purpose of `_values_for_factorize` / `_values_for_argsort` was *not* to be used internally in pandas, but *only* has a helper to `EA.factorize()` and `EA.argsort()`. So following our current EA interface spec, we should not use them internally (which means we should fix the few cases where we started using them). 
  The spec about `factorize` is clear that there are two ways to override its behaviour: implement `_values_for_factorize`/`_from_factorized`, or implement `factorize` itself:

  https://github.com/pandas-dev/pandas/blob/c47e9ca8b042881d44c9e679a9bf42bacabbb732/pandas/core/arrays/base.py#L740-L747

  So external EAs are not guaranteed to have an efficient implementation of `_values_for_factorize`/`_values_for_argsort` (they still have the default `astype(object)` implementation). 
  Fletcher is an example of external EAs that implement `factorize` and not `_values_for_factorize`.

  So ideally, for anything factorize/argsort-related, we should actually always call the `EA.factorize()` or `EA.argsort()` methods. 

* In https://github.com/pandas-dev/pandas/issues/32673, @jbrockmendel questioned whether the `_values_for_factorize` and `_from_factorized` combo should faithfully roundtrip? Currently, they do, but not necessarily when missing values are included. 
  However, when only considering them as "internal" to the `EA.factorize()` implementation, this question doesn't actually matter. But it *does* matter when we want to use those values more generally.

* I mentioned above that ideally we should use `factorize()` or `argsort()` directly as much as possible and avoid `_values_for_factorize/argsort` (since this is the official EA interface). However, there are still cases where such direct usage is not sufficient, and where we actually need some "values".
  
  For example, in the merging/joining code, you can't "just" `factorize()` the left and right array, because then the integer codes of left and right both don't necessarily can be matched (it depends on the uniques being present what those integers mean). 

  I think it is clear we have *some* use case for "ndarray values", but so we should think about for which use cases we need that and what requirements we have for those. 
  @jbrockmendel started to list some requirements here: https://github.com/pandas-dev/pandas/issues/32586#issuecomment-605082900

---

Having read again through all the recent issues and having written up the above, my *current* take-away point are:

- To start, we should maybe put the questions around `_values_for_factorize` / `_values_for_argsort` aside for a moment. In principle they are internal to `EA.factorize()` / `EA.argsort()`, and so we *could* also remove those methods (if we wanted that) if we "just" require EA authors to implement factorize/argsort directly instead of through the helper _values_for.. .  
  And if we figure out the general "ndarray values" requirements (see below), we can still come back to this to see if we can actually replace both `_values_for_factorize` / `_values_for_argsort` with this single "ndarray values" interface.

- I now think that replacing `_ndarray_values` with `_values_for_argsort` to be able to remove `_ndarray_values` actually didn't solve much. We replaced one vaguely specified property (`_ndarray_values`) with another (`_values_for_argsort` for *other* purposes than just sorting, as there are currnetly also no guarantees / requirements outside of sorting specified for `_values_for_argsort`). 

- Instead, I would focus on figuring out what the requirements are for the "hashable / value preserving ndarray values":
  1) What are the exact use cases we need this for?
  2) What are the exact semantics needed for those use cases? (hashable, orderable, deterministic across arrays, ..)
  3) Do those use cases need roundtripping of those values? 
  4) How would we implement those values for the internal EAs?*

  It might be that this ends up to be something close to what `_values_for_argsort` or `_values_for_factorize` now are. But I also think that my comment above about the possibility to include a mask in this interface is important for the nullable dtypes.

- An alternative that we didn't really mention yet, is adding more to the EA interface instead of requiring this "ndarray values" concept. For example, if we want that external EAs can have control over joining, we could have a `EA.__join__(other_EA, how) -> Tuple[ndarray[int], ndarray[int]]` that returns indices into left and right EAs that determine how to join them. 
  For joining that might be a relatively straightforward interface, for the indexing engine that looks more complex though (but so let's first define the use cases).



	# Implementer note: There are two ways to override the behavior of
	# pandas.factorize
	# 1. _values_for_factorize and _from_factorize.
	# Specify the values passed to pandas' internal factorization
	# routines, and how to convert from those values back to the
	# original ExtensionArray.
	# 2. ExtensionArray.factorize.
	# Complete control over factorization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

EA interface - requirements for "hashable, value+order-preserving ndarray" #33276

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

EA interface - requirements for "hashable, value+order-preserving ndarray" #33276

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions