ENH: allow storing ExtensionArrays in Index #43930

jbrockmendel · 2021-10-08T17:02:57Z

closes Add interface for defining an ExtensionIndex #22861
whatsnew

Still TODO

~~288~~15 not-yet-passing tests (with --skip-slow --skip-db)
flesh out tests/extension/base/ea_index.py
~~deprecation cycle~~ (decided we can probably just change)
optimize EAIndexEngine/NullableIndexEngine once BUG: get_indexer_non_unique with np.datetime64("NaT") and np.timedelta64("NaT") #43870 is resolved
EA.putmask implementation ATM is just a stub (and needs tests)

cc @jorisvandenbossche @TomAugspurger

xref #39133

jbrockmendel · 2021-10-08T17:03:40Z

Woops, meant to make this a Draft PR. Is there a way to convert it?

mzeitlin11 · 2021-10-08T17:07:51Z

Woops, meant to make this a Draft PR. Is there a way to convert it?

Under the reviewers section, there should be a Convert to draft option (have gone ahead and clicked it :)

pep8speaks · 2021-10-10T20:22:02Z

Hello @jbrockmendel! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-29 22:43:08 UTC

jorisvandenbossche

Nice work!

Can you give a bit more high-level context on what is now implemented and how you approached it?
For example, for the NullableEngine, you are currently not using any hash table. Did you look at that / decide that is not possible or desirable? Or is that a potential future improvement, and you are focusing first on getting it working with a base implementation?
The general ExtensionEngine seems to work with an actual ExtensionArray. An alternative could be to have it work an an ndarray that the EA could provide? Although a potential disadvantage of that approach is then that such an ndarray needs to be materialized always (in case this EA -> ndarray conversion is costly), while the the current way doesn't need that (but also cannot make use of existing optimized engines).

For reviewability: I suppose that in theory some of the changes in Index class are mostly for allowing to store an EA in the Index, somewhat independent of the engine changes, and thus could be done separately? But I don't know if that would work in practice of course (eg for an initial PR to store EA in Index, the nullable dtypes could use an object dtype array for the engine (if that's not buggy with NA), and that could then also work for starting the Index implementation and tests).

jorisvandenbossche · 2021-10-13T15:41:00Z

pandas/_libs/index.pyx

+        res = np.empty(N, dtype=np.intp)
+
+        for i in range(N):
+            val = values[i]


Might be best to extract data/mask from the MaskedArray values and index those inside the loop?

yah, i want to see if i can do this in a way that allows for sharing code with ExtensionEngine without a big perf hit.

jbrockmendel · 2021-10-13T20:30:08Z

The general ExtensionEngine seems to work with an actual ExtensionArray. An alternative could be to have it work an an ndarray that the EA could provide? Although a potential disadvantage of that approach is then that such an ndarray needs to be materialized always (in case this EA -> ndarray conversion is costly), while the the current way doesn't need that (but also cannot make use of existing optimized engines).

I chose to keep the EA intact instead of casting to ndarray bc most cases (for get_loc, the main concern) can use EA methods (searchsorted, __eq__) directly.

I suspect many cases will be able to use NDArrayBackedExtensionArray, in which case using one of the non-EA engines would be nice. I haven't implemented a way of doing that.

For example, for the NullableEngine, you are currently not using any hash table. Did you look at that / decide that is not possible or desirable? Or is that a potential future improvement, and you are focusing first on getting it working with a base implementation?

Right, first trying to get everything working, then will look at optimizations. (Also for NullableEngine.get_loc at least I have a different optimization in mind I want to try first).

For reviewability: I suppose that in theory some of the changes in Index class are mostly for allowing to store an EA in the Index, somewhat independent of the engine changes, and thus could be done separately?

Yep, a bunch of my recent PRs have been exactly that. More coming up, e.g. eq_NA_compat fixes problems with Index[object] containing pd.NA (though the function needs to be re-written) so i'll break that off before long. Also the float16 check in FloatingArray and the isna check in testing.pyx.

the nullable dtypes could use an object dtype array for the engine (if that's not buggy with NA), and that could then also work for starting the Index implementation and tests).

ATM the NullableEngine isn't a pain point. The remaining test failures are mostly in setops (xref #44000) and value_counts ordering.

pandas/core/arrays/floating.py

pandas/core/arrays/masked.py

pandas/core/indexes/base.py

jbrockmendel · 2021-12-22T00:29:10Z

Updated, whatsnew added, SparseArray behavior deprecated.

jbrockmendel · 2021-12-30T19:47:41Z

gentle ping; the Index[SparseArray] deprecation isn't the most important, but it'd be nice to do it right.

jreback

looks good, just a small question.

pandas/_libs/lib.pyx

pandas/core/dtypes/common.py

jreback · 2021-12-30T22:00:05Z

prob can move the todo's to another issue and close the original #39133

jreback · 2021-12-31T14:58:49Z

thanks @jbrockmendel glad to get this in

jorisvandenbossche · 2022-01-18T16:15:47Z

@jbrockmendel did you already open a PR with the NullableEngine / ExtensionEngine to review that part?

jbrockmendel · 2022-01-18T16:20:54Z

did you already open a PR with the NullableEngine / ExtensionEngine to review that part?

No. I've got a branch near-ready, but there is a "values_for_argsort" usage that i want to get rid of before pushing.

jbrockmendel added 2 commits October 8, 2021 09:54

ENH/WIP/POC: EA-backed Index

df9c228

Merge branch 'master' into enh-nullable-index

3952027

mzeitlin11 marked this pull request as draft October 8, 2021 17:07

jbrockmendel added 2 commits October 8, 2021 12:30

BUG: NumericIndex.insert

95e0129

Merge branch 'bug-insert' into enh-nullable-index

c52d459

jbrockmendel mentioned this pull request Oct 8, 2021

REF: Share Index.delete #43934

Merged

jbrockmendel added 2 commits October 8, 2021 18:04

Merge branch 'master' into enh-nullable-index

cf0c171

Merge branch 'master' into enh-nullable-index

0a3b7d7

jbrockmendel mentioned this pull request Oct 10, 2021

ENH: support zoneinfo tzinfo objects #37654

Closed

fix a few more tests; ignoring linting for now

d53377d

jbrockmendel added 8 commits October 10, 2021 14:53

Merge branch 'master' into enh-nullable-index

69fb0bd

Merge branch 'master' into enh-nullable-index

1952cd7

fix test

1ed588a

down to 38 tests failing

34d5dde

Merge branch 'master' into enh-nullable-index

42be4e6

Merge branch 'master' into enh-nullable-index

91b3716

down to 15 tests failing

544d9fe

Merge branch 'master' into enh-nullable-index

e14d6f1

jorisvandenbossche reviewed Oct 13, 2021

View reviewed changes

Merge branch 'master' into enh-nullable-index

22a0939

jbrockmendel added 4 commits October 15, 2021 21:27

fix value_counts

900978c

Merge branch 'master' into enh-nullable-index

f9c8791

fix map test

c0ae18c

Merge branch 'master' into enh-nullable-index

4ab7f0d

jorisvandenbossche mentioned this pull request Oct 20, 2021

BUG: Index.union with both bools and ints, duplicates #44000

Open

3 tasks

jorisvandenbossche reviewed Oct 20, 2021

View reviewed changes

jbrockmendel added 5 commits December 21, 2021 15:39

revert

c8072c5

remove no-longer-necessary

7e0ac18

whatsnew

80453b4

deprecation for SparseArray

7231a9e

share _na_value method

f78aa0f

jbrockmendel added 7 commits December 22, 2021 09:51

mypy fixup, npdev catch warnings

453d6ae

mypy fixup

8750248

Merge branch 'master' into enh-nullable-index

0b01bf9

compat for older numpy

d2e0266

Merge branch 'master' into enh-nullable-index

8daa2dc

Merge branch 'master' into enh-nullable-index

cf95f32

Merge branch 'master' into enh-nullable-index

7fba6a2

jreback requested changes Dec 30, 2021

View reviewed changes

pandas/_libs/lib.pyx Show resolved Hide resolved

pandas/core/dtypes/common.py Show resolved Hide resolved

jreback added this to the 1.4 milestone Dec 30, 2021

jreback mentioned this pull request Dec 30, 2021

DEPR: log of deprecations in 1.x (to be removed in 2.0) #30228

Closed

jreback approved these changes Dec 31, 2021

View reviewed changes

jreback merged commit e750c94 into pandas-dev:master Dec 31, 2021

topper-123 mentioned this pull request Jan 1, 2022

API: hide NumericIndex from public top-level namespace in favor of pd.Index #44819

Merged

jbrockmendel deleted the enh-nullable-index branch January 1, 2022 17:52

jbrockmendel mentioned this pull request Jan 2, 2022

REF: Implement CFTimeIndex via pandas ExtensionArray pydata/xarray#6129

Open

jdmcbr mentioned this pull request Jan 4, 2022

TST: test_value_counts breaking against pandas latest geopandas/geopandas#2287

Closed

jorisvandenbossche mentioned this pull request Jan 31, 2022

ENH: ExtensionEngine #45514

Merged

simonjayhawkins mentioned this pull request Feb 8, 2022

BUG: Pandas 1.4; df.drop method raises an AttributeError when Int64 index is used and index is not unique #45860

Closed

3 tasks

simonjayhawkins mentioned this pull request May 9, 2022

BUG: Error writing DataFrame with categorical type column and "Int" data to a CSV file ("int" works of course) #46812

Closed

3 tasks

Uh oh!

ENH: allow storing ExtensionArrays in Index #43930

ENH: allow storing ExtensionArrays in Index #43930

Uh oh!

Conversation

jbrockmendel commented Oct 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbrockmendel commented Oct 8, 2021

Uh oh!

mzeitlin11 commented Oct 8, 2021

Uh oh!

pep8speaks commented Oct 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-12-29 22:43:08 UTC

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Oct 13, 2021

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Oct 13, 2021

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Oct 13, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jbrockmendel commented Dec 22, 2021

Uh oh!

jbrockmendel commented Dec 30, 2021

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jreback commented Dec 30, 2021

Uh oh!

jreback commented Dec 31, 2021

Uh oh!

jorisvandenbossche commented Jan 18, 2022

Uh oh!

jbrockmendel commented Jan 18, 2022

Uh oh!

Uh oh!

jbrockmendel commented Oct 8, 2021 •

edited

Loading

pep8speaks commented Oct 10, 2021 •

edited

Loading