Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: NumericIndex for any numpy int/uint/float dtype #41153

Merged
merged 55 commits into from
Aug 5, 2021

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Apr 25, 2021

This PR presents a unified index type called NumericIndex for numerical indexing, supporting int64/32/16/8, uint64/32/16/8 and float64/32/16 numpy dtypes. This means that indexes with other numeric numpy dtypes than int64, uint64 and float64 will become possible. This index type is therefore functionally a super type of Int64Index, UInt64Index and Float64Index and is intended to eventually replace those.

Comments very welcome, of course, as this is probably quite a fundamental change to Pandas. @pandas-dev/pandas-core.

Status

Missing is especially:

  • docs and better doc string
  • marking NumericIndex as experimental in 1.3?

Deprecation cycle

ThIs PR is intended to be fully backwards compatible.

This means that public operations that currently return e.g. a Int64Index, will continue working unchanged. My view is that NumericIndex should replace Int64Index, UInt64Index and Float64Index in Pandas 2.0 and those index types should therefore be removed by 2.0. They should therefore get a deprecation warning in timely manner, before 2.0.

Internally there are some use of especially Int64Index, for example in groupby and join ops. That could be replaced with NumricIndex straight away, giving a performance benefit in cases where casting to higher dtypes can be avoided.

Outside of scope of this PR

I do not intend to add support for nullable numeric dtypes in this PR. This is partly for technical reasons (there isn't any indexing engines for those types currently) but also to limit the scope of this PR. I'm not opposed to add support for those dtypes later, but I think adding them here would be too much for this PR. Also, it might be more sensible to have extension types in a different index class than numpy types, so probably better to look into that seperately.

Actual deprecation of Int64Index, UInt64Index and Float64Index should also be in a later PR IMO. If Pandas 2.0 is far in the future, it may be better holding off deprecating them until we see a v2.0 in the horizon instead of having a very long deprecation cycle for a relatively often used part of Pandas.

@toobaz
Copy link
Member

toobaz commented Apr 25, 2021

Probably stupid (and maybe already discussed) question, but I see this as two steps:

  1. ensure that our code supports new dtypes/fix code that doesn't; unify code in a single class
  2. change the API so that only one class is exposed (as opposed to e.g. automatically generating classes for each type, including the ones that previously weren't there)

... point 1) requires no deprecation. Do we actually need 2)? I mean: I do see we declutter the namespace; I'm just not sure this is the best thing for our users.

@topper-123
Copy link
Contributor Author

topper-123 commented Apr 25, 2021

Hey @toobaz.

I haven't discussed this with anyone, wasn't sure if I'd end up making this work or not :-)

Wrt. your first point, I'm not sure I understand, but I've seperated out fixes to bugs in the code base (in #41063, #41073 (already merged) + #41132 and #41121, (not yet merged, present in this PR, so this PR will (hopefully ) pass the CI)).

To you means any other part that should be seperated out? The changes in pandas/_libs/join.pyx fixes a bug (or a bug-like) that could be seperated out, but I don't see any other ones that should be split?

Wrt. your second point, do you mean we should hide Int64Index etc. from the public api after this is merged? Especially Int64Index is used a lot, often implicitly, so I'm worried hiding it could also confuse, if a used index type isn't accecable directly. But I'm open to discussing how to proceed with :-)

@toobaz
Copy link
Member

toobaz commented Apr 26, 2021

Wrt. your second point, do you mean we should hide Int64Index etc. from the public api after this is merged?

Sorry, I was rather suggesting the opposite: not deprecating Int64Index, actually providing in the API Int32Index and friends (to be automatically generated as subclasses of NumIndex), possibly considering NumIndex as a non-instantiable class.

My point is just: it's great to 1) have all numeric types supported by a same class and 2) supporting new types. But we can do this in a fully backwards compatible (and yes, I understand this PR is backwards compatible, I'm referring more to the longer term plan), and maybe simpler for users, way.

By the way, I think that what I'm proposing is mostly compatible with the current PR, with the possible exception of the docs and of the places where you instantiate a NumIndex (which would be replaced with something like NumIndex.new_for_type(...); but if my point of view is shared, ideally the PR could already replace all existing indexes with the NumIndex version.

(But it's an important API decision, so before getting into these details it's probably good to collect other devs' advice)

@topper-123
Copy link
Contributor Author

Ok, thanks for clarifying.

My own opinion is that I like better a common container type for content that is similar, and more specific differentiation would be done by looking at the dtype attribute. That is also mostly what pandas does, e.g.

>>> pd.array([1,2,3], dtype="float64")
<PandasArray>
[1.0, 2.0, 3.0]
Length: 3, dtype: float64

Though we also have a pd.arrays.FloatingArray, so the array interface is not completely consistent. However there is no Float64Array specifically for float64 dtype, for example.

I think I will wait with further development on this PR until other devs chime in.

@jorisvandenbossche
Copy link
Member

Sidenote: you shouldn't look too much at PandasArray as pre-existing design / patterns in pandas, as that is only a "temporary" EA class in the sense that it is never actually used internally (it exists because we needed some EA that store numpy arrays to have a consistent return type in Series.array attribute)
In general for the arrays, we indeed have dtype-specialized classes, because there it is the array class that can customize behaviour in the ExtensionArray interface (although this is also not strictly needed: we already combine the different bit-sizes in a single FloatingArray or IntegerArray, and there are external packages that use a single array class for all their dtypes).

@jorisvandenbossche
Copy link
Member

In general, I am fully +1 of supporting all numeric dtypes as Index, instead of limiting it to the 64-bit version.

For how to expose this to the user, is it correct to summarize the different options as:

  1. Have a single, public NumIndex class for all numeric index types (as proposed here)
  2. Have all dtype-specific subclasses (Int8Index, Int16Index, Int32Index, etc), and those can still be backed by a single internal NumIndex and automatically generated from that, implementation-wise (as @toobaz mentions).
  3. Use the existing, general-purpose Index class for this (which is now only used for object type).

And in both the first and third option, we can initially ensure that the existing Int64Index/Uint64Index/Float64Index are still being used for the 64-bit versions for backward-compatibility, but those could be deprecated and removed long-term.


Personally, and speaking from a user perspective, and specifically for the numeric indexes, the subclasses don't add that much value. For example, the Int64Index does not have any additional public method compared to the base Index class (which is of course different for something like CategoricalIndex or DatetimeIndex).
Therefore, I think the third option would be interesting to explore, if this is possible from an implementation point of view (what is now overridden in numeric sublcasses would need to be added to the base class).

@topper-123
Copy link
Contributor Author

That is a good summary of the options, @jorisvandenbossche.

Wrt. option 3, I agee that would be the best interface, thought I have two reservations:

  1. I'm nor sure how easy it will be (but I can take a look).
  2. Backwards compatibillity. For example, pd.Index(arr, dtype="int16") currently returns a Int64Index. When a user checks the index type by doing isinstance(idx, Int64Index) they'll get a unexpected result after that change. Of course they should have done these checks by doing is_integer_dtype(idx.dtype) (or issubclass(idx.dtype.type, np.int64) for more specific dtype checks), but in many cases they probably won't have done it correctly.

What are your thoughts on backwards compatability, if we join the numeric index dtypes into Index?

@jreback
Copy link
Contributor

jreback commented Apr 26, 2021

I think option 3, while if we started over might be fine is almost impossible from a back compat perspective. Option 2 is thus appropriate. We could possibly move to option 3 in a future version, but way better to do this in multiple steps.

@jreback jreback added Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions Enhancement Index Related to the Index class or subclasses labels Apr 26, 2021
@toobaz
Copy link
Member

toobaz commented Apr 26, 2021

I also think having less classes is not necessarily good for users - while I accept that my proposal of creating new ones looks inelegant, I do think that at least generic Index and NumIndex are different enough (e.g. performance wise, even if we were able to merge the code bases in a single class) to keep them separated.

EDIT: unless the choice was to really have only one class, Index, including dates, intervals... but I guess this is not an option considered.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 26, 2021

2. Backwards compatibillity. For example, pd.Index(arr, dtype="int16") currently returns a Int64Index. When a user checks the index type by doing isinstance(idx, Int64Index) they'll get a unexpected result after that change.

I think the same issue arises when using NumIndex ? (also in that case isinstance(idx, Int64Index) won't work anymore).
But I assume that the idea is that pd.Index(..., dtype="int16") would still return Int64Index for now, and you would need to use pd.NumIndex explicitly for getting an index with int16 values?
That indeed gives backwards compatibility, but also limits quite a bit the usage of this improvement. For example, also doing a set_index with an int32 column would still convert the data to int64.

There might also be a somewhat "in-between" option: already implement it in the Index class, but still have a very shallow Int64Index subclass to ensure such isinstance checks keep working.
We could then in theory even use the Int64Index class to store all integer bit-sizes, so that we could already stop converting to int64, while not breaking code that expects the Int64Index class (although the index.dtype would still change, but I would assume less people to strictly rely on that being "int64")

I do think that at least generic Index and NumIndex are different enough (e.g. performance wise, ...) to keep them separated.

@toobaz how are they different for a user? The only difference would be the value of .dtype (the numeric index has no additional methods). And the fact that there is a performance difference is also mostly determined by that dtype (of course, since we only use Index for object dtype right now, users might indirectly also relate it to the class).

(it's true that it would be somewhat inconsistent with still having dtype-specific sublcasses for DatetimeIndex, IntervalIndex etc. But having a single NumericIndex that covers several dtypes is already inconsistent with that pattern in itself as well .. so personally I don't see that as a problem)

@toobaz
Copy link
Member

toobaz commented Apr 26, 2021

And the fact that there is a performance difference is also mostly determined by that dtype

True, but anyway, the consequence is that Index is a clear signal to users that they are using unoptimized code, and if they are handling numbers, they could do much better!

By the way, an Index could contain (at least in perverse enough codebases) only intervals, only dates, and combinations thereof. The "message" above is general.

So for me seeing and Index class immediately means two things:

  • "beware: even if I contain data types that pandas is able to handle well, I'm not actually handling them well"
  • "beware: even if I contain data that appears of a given type, I may hide surprises (e.g. I'm thinking about the string "NULL" I found lost somewhere in an Index of mostly integer numbers)"

True, both of these things could be conveyed by the Index.dtype, it just seems less clear to me, and it seems inconsistent that other types of data do have their own index classes. And "OK, we only merge numbers in Index because... they are numbers" does not look to me like a good reason, because Index can contain much more than numbers.

@topper-123
Copy link
Contributor Author

I've made a updated version, where Int64Index etc. are now thin child classes of NumIndex.

NumIndex and NumericIndex should probably be combined also, but RangeIndex is a child of NumericIndex, so I'll have to untangle that, before combining.

I'll look into the architecture issue later tonight.

@topper-123
Copy link
Contributor Author

topper-123 commented Apr 27, 2021

Thanks for the comments. My views:

I think the same issue arises when using NumIndex ? (also in that case isinstance(idx, Int64Index) won't work anymore). But I assume that the idea is that pd.Index(..., dtype="int16") would still return Int64Index for now, and you would need to use > pd.NumIndex explicitly for getting an index with int16 values?

Yes, that was my idea. That would keep this addition 100 % backwards compatible. Users who want e.g. int16, would do the explicit pd.NumIndex(..., dtype="int16").

That indeed gives backwards compatibility, but also limits quite a bit the usage of this improvement. For example, also doing a set_index with an int32 column would still convert the data to int64.

I agree, but that is a trade-off between keeping compatability and improving the architecture. Not easy, but I think this is a relatively large change, so my preference is keeping backward compat untill pandas 2.0.

There might also be a somewhat "in-between" option: already implement it in the Index class, but still have a very shallow Int64Index subclass to ensure such isinstance checks keep working.
We could then in theory even use the Int64Index class to store all integer bit-sizes, so that we could already stop converting to int64, while not breaking code that expects the Int64Index class (although the index.dtype would still change, but I would assume less people to strictly rely on that being "int64").

This would be easy to implement. Most people wouldn't be affected, but there will always be someone who do checks by doing e.g. issubclass(idx.dtype.type, np.int64), which would break. Do we accept that?

So if we change how Int64Index or the base Index work, it would have to be a concius decision to accept some breakage. My own view is that I prefer keeping backwards compat, as I mentioned above. I wouldn't mind accelerating the release of pandas 2.0, so we can accept breaking changes to the index classes, but I suspect others may not agree with that.

BTW, I like the suggestion from @jreback to do this stepwise. That would also make working on this more "relaxed", (The exact public API wouldn't have to be finished in this one PR).

For example, the NumIndex class doesn't actually need to be in the public namespace right away. Could we agree that this PR is internal only (not available outside of pd.core)? That would make it available for e.g. merging and groupbys internally, which would be a win already. The public API could come in a later PR, either as:

  1. new Int32Index etc. classes,
  2. more flexible existing classes (Int64Index can hold int32 dtypes etc.),
  3. a new public NumIndex class or
  4. merge this functionality into the base Index?

No matter which choice is made, this PR will be a step in the right direction.

I could open an issue, where we further discuss the choice for public API.

@topper-123
Copy link
Contributor Author

BTW, I still need to look at some tests for Int64Index etc., if they should also be used for NumIndex. I'll remove the draft marker, when that is finished.

@jorisvandenbossche
Copy link
Member

For example, the NumIndex class doesn't actually need to be in the public namespace right away. Could we agree that this PR is internal only .... The public API could come in a later PR

+1

@topper-123
Copy link
Contributor Author

topper-123 commented May 2, 2021

Parametrized simple_index in tests.indexes.numeric. to get better test coverage.

ATM this PR includes #41254 and #41263, in order to pass the CI. Will rebase again, when those PRs are accepted seperately.

PR still not ready for commiting, as some more tests still to include this new index.

@topper-123
Copy link
Contributor Author

I’ve updated and rebased.

@topper-123
Copy link
Contributor Author

Ping.

@jreback jreback merged commit fcadfb8 into pandas-dev:master Aug 5, 2021
@jreback
Copy link
Contributor

jreback commented Aug 5, 2021

thanks @topper-123 really nice, sorry for the delay. if you can rebase the doc one we can merge that too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions Enhancement Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Index with dtype int32
6 participants