Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: ExtensionEngine #45514

Merged
merged 6 commits into from
Jan 25, 2022
Merged

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Jan 20, 2022

Follow-up on #43930

Avoid object-dtype cast.

Next/last PR in this sequence introduces NullableEngine.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ping on green

@jreback jreback added this to the 1.5 milestone Jan 23, 2022
@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves labels Jan 23, 2022
@jreback jreback merged commit 4248b23 into pandas-dev:main Jan 25, 2022
@jbrockmendel jbrockmendel deleted the ref-extensionengine branch January 25, 2022 19:26
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 31, 2022

(given that I asked for the status of this work, a small ping is always handy)

Repeating my question from on the original PR (#43930):

Can you give a bit more high-level context on how you approached it?
For example, for the NullableEngine (here only ExtensionEngine for now), you are currently not using any hash table. Did you look at that / decide that is not possible or desirable? Or is that a potential future improvement, and you are focusing first on getting it working with a base implementation?
The general ExtensionEngine seems to work with an actual ExtensionArray. An alternative could be to have it work an an ndarray that the EA could provide? Although a potential disadvantage of that approach is then that such an ndarray needs to be materialized always (in case this EA -> ndarray conversion is costly), while the the current way doesn't need that (but also cannot make use of existing optimized engines). So for the general ExtensionEngine, this is maybe a good approach. But are you also planning to do that for the NullableEngine?

This might also require some performance checking to investigate which approach is preferable. For example, I did a quick test comparing the performance of the ExtensionEngine (doesn't use hash-table) vs ObjectEngine (uses hash-table, which was used as fall-back before this PR) on one use case (get_indexer, which makes use of the hashtable if possible):

In [1]: idx_ea = pd.Index(np.arange(1_000_000), dtype="Int64")

In [2]: idx_object = idx_ea.astype(object)

In [4]: indexer = np.arange(500, 1000)

In [5]: %timeit idx_ea.get_indexer(indexer)
19 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit idx_object.get_indexer(indexer)
122 µs ± 4.43 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

So this is a case where the custom ExtensionEngine of this PR actually caused a big slowdown compared to the object engine fallback that we used before for Index[EA].
(now, I didn't profile this specifically, there might be some bottleneck in some EA method that gets called here, so there might also be other ways to improve the performance here instead of using a hashtable

(and specifically for the masked arrays as I used in this example, this might also be solvable in a specialized NullableEngine

@jorisvandenbossche
Copy link
Member

#45652 reported a performance regression which is I think essentially the same as what I showed above (or in any case, also caused by ExtensionIndex being slower than ObjectIndex).

We should probably also add benchmark cases for those new index dtypes / engines (if we don't have that yet, didn't check)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants