Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: allow storing ExtensionArrays in Index #43930

Merged
merged 110 commits into from
Dec 31, 2021

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Oct 8, 2021

Still TODO

cc @jorisvandenbossche @TomAugspurger

xref #39133

@jbrockmendel
Copy link
Member Author

Woops, meant to make this a Draft PR. Is there a way to convert it?

@mzeitlin11
Copy link
Member

Woops, meant to make this a Draft PR. Is there a way to convert it?

Under the reviewers section, there should be a Convert to draft option (have gone ahead and clicked it :)

@mzeitlin11 mzeitlin11 marked this pull request as draft October 8, 2021 17:07
@pep8speaks
Copy link

pep8speaks commented Oct 10, 2021

Hello @jbrockmendel! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-29 22:43:08 UTC

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Can you give a bit more high-level context on what is now implemented and how you approached it?
For example, for the NullableEngine, you are currently not using any hash table. Did you look at that / decide that is not possible or desirable? Or is that a potential future improvement, and you are focusing first on getting it working with a base implementation?
The general ExtensionEngine seems to work with an actual ExtensionArray. An alternative could be to have it work an an ndarray that the EA could provide? Although a potential disadvantage of that approach is then that such an ndarray needs to be materialized always (in case this EA -> ndarray conversion is costly), while the the current way doesn't need that (but also cannot make use of existing optimized engines).

For reviewability: I suppose that in theory some of the changes in Index class are mostly for allowing to store an EA in the Index, somewhat independent of the engine changes, and thus could be done separately? But I don't know if that would work in practice of course (eg for an initial PR to store EA in Index, the nullable dtypes could use an object dtype array for the engine (if that's not buggy with NA), and that could then also work for starting the Index implementation and tests).

res = np.empty(N, dtype=np.intp)

for i in range(N):
val = values[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be best to extract data/mask from the MaskedArray values and index those inside the loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah, i want to see if i can do this in a way that allows for sharing code with ExtensionEngine without a big perf hit.

@jbrockmendel
Copy link
Member Author

The general ExtensionEngine seems to work with an actual ExtensionArray. An alternative could be to have it work an an ndarray that the EA could provide? Although a potential disadvantage of that approach is then that such an ndarray needs to be materialized always (in case this EA -> ndarray conversion is costly), while the the current way doesn't need that (but also cannot make use of existing optimized engines).

I chose to keep the EA intact instead of casting to ndarray bc most cases (for get_loc, the main concern) can use EA methods (searchsorted, __eq__) directly.

I suspect many cases will be able to use NDArrayBackedExtensionArray, in which case using one of the non-EA engines would be nice. I haven't implemented a way of doing that.

For example, for the NullableEngine, you are currently not using any hash table. Did you look at that / decide that is not possible or desirable? Or is that a potential future improvement, and you are focusing first on getting it working with a base implementation?

Right, first trying to get everything working, then will look at optimizations. (Also for NullableEngine.get_loc at least I have a different optimization in mind I want to try first).

For reviewability: I suppose that in theory some of the changes in Index class are mostly for allowing to store an EA in the Index, somewhat independent of the engine changes, and thus could be done separately?

Yep, a bunch of my recent PRs have been exactly that. More coming up, e.g. eq_NA_compat fixes problems with Index[object] containing pd.NA (though the function needs to be re-written) so i'll break that off before long. Also the float16 check in FloatingArray and the isna check in testing.pyx.

the nullable dtypes could use an object dtype array for the engine (if that's not buggy with NA), and that could then also work for starting the Index implementation and tests).

ATM the NullableEngine isn't a pain point. The remaining test failures are mostly in setops (xref #44000) and value_counts ordering.

pandas/core/arrays/floating.py Outdated Show resolved Hide resolved
pandas/core/arrays/floating.py Outdated Show resolved Hide resolved
pandas/core/arrays/masked.py Outdated Show resolved Hide resolved
pandas/core/indexes/base.py Show resolved Hide resolved
pandas/core/indexes/base.py Outdated Show resolved Hide resolved
pandas/core/indexes/base.py Outdated Show resolved Hide resolved
@jbrockmendel
Copy link
Member Author

Updated, whatsnew added, SparseArray behavior deprecated.

@jbrockmendel
Copy link
Member Author

gentle ping; the Index[SparseArray] deprecation isn't the most important, but it'd be nice to do it right.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just a small question.

pandas/_libs/lib.pyx Show resolved Hide resolved
pandas/core/dtypes/common.py Show resolved Hide resolved
@jreback jreback added this to the 1.4 milestone Dec 30, 2021
@jreback
Copy link
Contributor

jreback commented Dec 30, 2021

prob can move the todo's to another issue and close the original #39133

@jreback jreback merged commit e750c94 into pandas-dev:master Dec 31, 2021
@jreback
Copy link
Contributor

jreback commented Dec 31, 2021

thanks @jbrockmendel glad to get this in

@jorisvandenbossche
Copy link
Member

@jbrockmendel did you already open a PR with the NullableEngine / ExtensionEngine to review that part?

@jbrockmendel
Copy link
Member Author

did you already open a PR with the NullableEngine / ExtensionEngine to review that part?

No. I've got a branch near-ready, but there is a "values_for_argsort" usage that i want to get rid of before pushing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add interface for defining an ExtensionIndex
5 participants