Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support ExtensionArray (and masked EAs speficially) in indexing #39133

Open
jorisvandenbossche opened this issue Jan 12, 2021 · 15 comments
Open
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Index Related to the Index class or subclasses Indexing Related to indexing on series/frames, not to indexes themselves NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 12, 2021

One area where the general ExtensionArray support is lacking is to store them in the index (right now they get converted to ndarray when storing in an Index), and have efficient indexing operations (hashtable, index engines).

Several of our long-time extension dtypes have their own subclass (Categorical, Period, Datetime, IntervalIndex), but we need to solve this generally for ExtensionArrays (so it can also work for external EAs), and should also focus on solving it well for the new nullable ExtensionArrays (using masked arrays).

I think there are multiple aspects to this (probably more, but currently thinking of those):

1) Storing ExtensionArrays in an Index object

Supporting to "just" store EAs in the Index and support its methods (and eg falling back to ndarray for the indexing engine) is probably not that hard. There are PRs #34159 (storing EAs in base Index class) and #37869 (having specific ExtensionIndex subclass).

I think both approaches are technically not that different (put the required special cases in if blocks in the base class vs in overridden methods in the subclass), but for me it's mainly a user API design discussion (summarized as "I don't think that end users should see an "ExtensionIndex").

So for this part, we should have that API discussion.

2) A protocol for specifying the values (ndarray) used for indexing operations

While for an initial version of support, we can use np.asarray(EA) as the values passed to the IndexEngine, we should ideally have a general method in the EA interface to be able to specify which values can be used for indexing.

There is some discussion related to this in #32586 and #33276 (eg can we re-use some of the existing _values_for_.. methods? ...). And we can probably continue this aspect over there.

A general method is mostly important for external EAs, because we will probably have special support for our own EAs: the existing Index subclasses already do this, and for the nullable EAs we need to add this (see next section below).

3) Support for masked arrays in the indexing operations (IndexEngine, HashTable, etc)

Specifically to have better support for the nullable dtypes (without needing to convert to ndarray), I think we should look into adding support for using masks in the low-level index operations (IndexEngine, HashTable, etc).

Some (not-index related) hashtable methods like HashTable.unique already have optional support for masks.

I think this is technically the most challenging item, and needs to be worked out more in detail what this work item would entail.

cc @jbrockmendel @TomAugspurger @jreback

@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves ExtensionArray Extending pandas with custom dtypes or arrays. Index Related to the Index class or subclasses NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Jan 12, 2021
@jbrockmendel
Copy link
Member

FWIW the blocker on moving forward on #37869 is #37367, which I'm hopeful will get through before too long.

@jorisvandenbossche
Copy link
Member Author

I would like to start discussing this in more detail, especially the third point from the top post:

  1. Support for masked arrays in the indexing operations (IndexEngine, HashTable, etc)
    ...

For me the end goal is that the operations for which we now use the IndexEngine can support masked arrays without any conversion (eg no conversion of nullable integer array to a float array with NaNs or object array, as we are doing now).

One option for this that I see is to add a mask keyword to several of the IndexEngine methods.
For the HashTable, the unique and factorize method already support this. But I suppose that are the easy ones to add this, since those are "one-shot" methods, that don't keep state and can be used for look-up later. Eg HashTable.map_locations would need to store (the information in) the mask, so that a lookup with a missing value can return the correct index (since the missing value itself cannot be stored in the khash table).

This might be quite invasive, though (especially since missing values in the index is often not that useful anyway .. but we still need to support it), so interested to hear your thoughts about this.

@jbrockmendel
Copy link
Member

One option for this that I see is to add a mask keyword to several of the IndexEngine methods. [...] But I suppose that are the easy ones to add this,

This seems reasonable. If the existing IndexEngine/hashtable code can be cleanly updated to handle masks, that'd be great. If that is too messy, then implementing mask-specific hashtable subclasses seems like a fine plan B.

@Hoeze
Copy link

Hoeze commented May 17, 2021

What is the current way to implement an index supporting my custom ExtensionArray?

I noticed that when calling .set_index(my_array).reset_index(), my ExtensionArray always gets converted to a NumPy array of lists and then cannot be converted back.
How do I solve this issue?

@jbrockmendel
Copy link
Member

What is the current way to implement an index supporting my custom ExtensionArray?

At the moment there isn't one. You might be able to kludge something together by patching Index._dtype_to_subclass, but I wouldn't rely on it until we get real support implemented.

@TomAugspurger
Copy link
Contributor

Brief call today, it feels like (uncertain) that it'd be hard to do this at the Hashtable level. Probably handle NA values at the Index or IndexEngine level.

@jbrockmendel
Copy link
Member

I've been looking at this and am currently thinking of 4 areas that need attention

  1. Index.putmask implicitly assumes it is backed by an ndarray, will need to be updated
  2. Not-yet-identified places in the codebase where we implicitly assume that the base Index class is ndarray[object]
  3. Methods which use _get_join_target or _get_engine_target on an other or target arg, i.e. cases where we the EA/NullableIndex is not self
    • For join_target we can route around it pretty easily, like we do for IntervalIndex in Index.intersection not using _inner_indexer.
    • _get_engine_target i don't have a good solution for (that doesn't involve casting to ndarray[object])
  4. methods that access ._engine: [is_monotonic_increasing, is_monotonic_decreasing, is_unique, get_loc, _get_indexer, _get_fill_indexer, memory_usage, __contains__, get_indexer_non_unique]
    • get_loc, _get_indexer, and get_indexer_non_unique
    • is_unique, is_monotonic_increasing, is_monotonic_decreasing could be pretty easily addressed with something like
class NullableIndex(Index):
    @cache_readonly
    def _engine(self):
        return Index(self._data[~self._mask])._engine
   
     @cache_readonly
     def _has_missing(self):
          return self._values._mask.any()

     @cache_readonly
    def is_monotonic_increasing(self):
        if self._has_missing:
            return False
        return super().is_monotonic_increasing

This introduces trouble for the other _engine-using methods bc e.g. engine.get_loc would return indices on the non-missing subset, which would need to be adjusted. This may be simpler to handle in something like a MaskedIndexEngine, not sure yet.

@Hoeze
Copy link

Hoeze commented Aug 13, 2021

Is there any chance to get this feature into v1.4?

@jreback
Copy link
Contributor

jreback commented Aug 13, 2021

@Hoeze PRs are welcome. core can provide review but features are generally done by interested parties

@jbrockmendel
Copy link
Member

I've recently been coming around towards Joris's preference of stuffing EAs into an Index object instead of into ExtensionIndex (xref #43002). But looking at what ExtensionIndex methods can be refactored away I noticed searchsorted

    def searchsorted(self, value, side="left", sorter=None) -> np.ndarray:
        # overriding IndexOpsMixin improves performance GH#38083
        return self._data.searchsorted(value, side=side, sorter=sorter)

#38103 fixed #38083 which reported a 5x slowdown in Series.asof due to sharing the implementation. Will need to see if this happens for other methods.

@jorisvandenbossche
Copy link
Member Author

@jreback are you on board now with storing an EA in the base Index class? (since you expressed objections to that before in #34159 and #37869, but that is what we are now actually planning to do)

@jreback
Copy link
Contributor

jreback commented Nov 16, 2021

well i have always been ok with a proper implementation

@jbrockmendel
Copy link
Member

@phofl did the MaskedIndexEngine PR close this?

@phofl
Copy link
Member

phofl commented Feb 12, 2023

Not sure, since we might want to support for generic eas eventually, I left this open

@jbrockmendel
Copy link
Member

since we might want to support for generic eas eventually

I'm confused by this because we do support generic EAs.

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Index Related to the Index class or subclasses Indexing Related to indexing on series/frames, not to indexes themselves NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests

7 participants