-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: allow storing ExtensionArrays in Index #43930
Conversation
Woops, meant to make this a Draft PR. Is there a way to convert it? |
Under the reviewers section, there should be a |
Hello @jbrockmendel! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-12-29 22:43:08 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Can you give a bit more high-level context on what is now implemented and how you approached it?
For example, for the NullableEngine
, you are currently not using any hash table. Did you look at that / decide that is not possible or desirable? Or is that a potential future improvement, and you are focusing first on getting it working with a base implementation?
The general ExtensionEngine
seems to work with an actual ExtensionArray. An alternative could be to have it work an an ndarray that the EA could provide? Although a potential disadvantage of that approach is then that such an ndarray needs to be materialized always (in case this EA -> ndarray conversion is costly), while the the current way doesn't need that (but also cannot make use of existing optimized engines).
For reviewability: I suppose that in theory some of the changes in Index class are mostly for allowing to store an EA in the Index, somewhat independent of the engine changes, and thus could be done separately? But I don't know if that would work in practice of course (eg for an initial PR to store EA in Index, the nullable dtypes could use an object dtype array for the engine (if that's not buggy with NA), and that could then also work for starting the Index implementation and tests).
pandas/_libs/index.pyx
Outdated
res = np.empty(N, dtype=np.intp) | ||
|
||
for i in range(N): | ||
val = values[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be best to extract data/mask from the MaskedArray values
and index those inside the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah, i want to see if i can do this in a way that allows for sharing code with ExtensionEngine without a big perf hit.
I chose to keep the EA intact instead of casting to ndarray bc most cases (for get_loc, the main concern) can use EA methods (searchsorted, I suspect many cases will be able to use NDArrayBackedExtensionArray, in which case using one of the non-EA engines would be nice. I haven't implemented a way of doing that.
Right, first trying to get everything working, then will look at optimizations. (Also for NullableEngine.get_loc at least I have a different optimization in mind I want to try first).
Yep, a bunch of my recent PRs have been exactly that. More coming up, e.g.
ATM the NullableEngine isn't a pain point. The remaining test failures are mostly in setops (xref #44000) and value_counts ordering. |
Updated, whatsnew added, SparseArray behavior deprecated. |
gentle ping; the Index[SparseArray] deprecation isn't the most important, but it'd be nice to do it right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, just a small question.
prob can move the todo's to another issue and close the original #39133 |
thanks @jbrockmendel glad to get this in |
@jbrockmendel did you already open a PR with the NullableEngine / ExtensionEngine to review that part? |
No. I've got a branch near-ready, but there is a "values_for_argsort" usage that i want to get rid of before pushing. |
Still TODO
28815 not-yet-passing tests (with --skip-slow --skip-db)deprecation cycle(decided we can probably just change)cc @jorisvandenbossche @TomAugspurger
xref #39133