ENH: ExtensionEngine #45514

jbrockmendel · 2022-01-20T21:19:54Z

Follow-up on #43930

Avoid object-dtype cast.

Next/last PR in this sequence introduces NullableEngine.

jreback

lgtm ping on green

…extensionengine

jorisvandenbossche · 2022-01-31T11:24:14Z

(given that I asked for the status of this work, a small ping is always handy)

Repeating my question from on the original PR (#43930):

Can you give a bit more high-level context on how you approached it?
For example, for the NullableEngine (here only ExtensionEngine for now), you are currently not using any hash table. Did you look at that / decide that is not possible or desirable? Or is that a potential future improvement, and you are focusing first on getting it working with a base implementation?
The general ExtensionEngine seems to work with an actual ExtensionArray. An alternative could be to have it work an an ndarray that the EA could provide? Although a potential disadvantage of that approach is then that such an ndarray needs to be materialized always (in case this EA -> ndarray conversion is costly), while the the current way doesn't need that (but also cannot make use of existing optimized engines). So for the general ExtensionEngine, this is maybe a good approach. But are you also planning to do that for the NullableEngine?

This might also require some performance checking to investigate which approach is preferable. For example, I did a quick test comparing the performance of the ExtensionEngine (doesn't use hash-table) vs ObjectEngine (uses hash-table, which was used as fall-back before this PR) on one use case (get_indexer, which makes use of the hashtable if possible):

In [1]: idx_ea = pd.Index(np.arange(1_000_000), dtype="Int64")

In [2]: idx_object = idx_ea.astype(object)

In [4]: indexer = np.arange(500, 1000)

In [5]: %timeit idx_ea.get_indexer(indexer)
19 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit idx_object.get_indexer(indexer)
122 µs ± 4.43 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

So this is a case where the custom ExtensionEngine of this PR actually caused a big slowdown compared to the object engine fallback that we used before for Index[EA].
(now, I didn't profile this specifically, there might be some bottleneck in some EA method that gets called here, so there might also be other ways to improve the performance here instead of using a hashtable

(and specifically for the masked arrays as I used in this example, this might also be solvable in a specialized NullableEngine

jorisvandenbossche · 2022-01-31T13:01:13Z

#45652 reported a performance regression which is I think essentially the same as what I showed above (or in any case, also caused by ExtensionIndex being slower than ObjectIndex).

We should probably also add benchmark cases for those new index dtypes / engines (if we don't have that yet, didn't check)

jbrockmendel added 2 commits January 20, 2022 13:17

ENH: ExtensionEngine

2f903e1

Merge branch 'main' into ref-extensionengine

6f33578

jreback approved these changes Jan 23, 2022

View reviewed changes

jreback added this to the 1.5 milestone Jan 23, 2022

jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves labels Jan 23, 2022

jbrockmendel added 4 commits January 22, 2022 17:33

Merge branch 'main' of https://github.com/pandas-dev/pandas into ref-…

f278ba0

…extensionengine

_hasnans->_hasna

201a524

Merge branch 'main' into ref-extensionengine

dc80a20

mypy fixup

7e9994f

jreback merged commit 4248b23 into pandas-dev:main Jan 25, 2022

jbrockmendel deleted the ref-extensionengine branch January 25, 2022 19:26

jorisvandenbossche mentioned this pull request Jan 31, 2022

PERF: Regression with indexing with ExtensionEngine #45652

Closed

3 tasks

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

ENH: ExtensionEngine (pandas-dev#45514)

7060d75

jorisvandenbossche mentioned this pull request Sep 8, 2022

PERF: keep using ObjectEngine for ExtensionArrays for 1.5 #48472

Merged

phofl mentioned this pull request Oct 24, 2022

BUG: inner join with StringDtype multi-indexes gives unexpected result in 1.5.x #49277

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: ExtensionEngine #45514

ENH: ExtensionEngine #45514

jbrockmendel commented Jan 20, 2022 •

edited by jorisvandenbossche

Loading

jreback left a comment

jorisvandenbossche commented Jan 31, 2022 •

edited

Loading

jorisvandenbossche commented Jan 31, 2022

ENH: ExtensionEngine #45514

ENH: ExtensionEngine #45514

Conversation

jbrockmendel commented Jan 20, 2022 • edited by jorisvandenbossche Loading

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 31, 2022 • edited Loading

jorisvandenbossche commented Jan 31, 2022

jbrockmendel commented Jan 20, 2022 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Jan 31, 2022 •

edited

Loading