Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Indexes with any numpy int/uint/float dtype #41272

Closed
topper-123 opened this issue May 2, 2021 · 21 comments · Fixed by #42706
Closed

ENH: Indexes with any numpy int/uint/float dtype #41272

topper-123 opened this issue May 2, 2021 · 21 comments · Fixed by #42706
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors Enhancement Index Related to the Index class or subclasses
Milestone

Comments

@topper-123
Copy link
Contributor

I've made a (draft) PR in #41153 that implements an index class for all the normal numpy numeric dtypes (int64/32/16/8, uint64/32/16/8 an dfloat65/32). There was some discussion in that PR how the public API for this should be, so I'm opening this issue for discussing that.

My plan is to get #41153 merged when its ready, possibly without any public-facing API changes. The public-facing changes would then come afterwards, and after there is an conclusion on how the API should be.

Summary of the options

As I see it the options for the API is:

Nr. Option Is backwards compat. complexity
1. Have a single, public NumericIndex class for all numeric index types Yes Easy
2. Have dtype-specific subclasses (Int8Index, Int16Index, Int32Index, etc), and those can still be backed by a single internal NumIndex Yes Easy
3. Use the existing, general-purpose Index class for these numeric numpy dtypes No Harder
4. Use the existing, numeric index classes for these numeric numpy dtypes, so e.g. Int64Index can take int32 etc. No Easy

Options 1. and 3. would mean that the existing Int64Index, UInt64Index and Float64Index would be deprecated and removed in pandas 2.0, because their functionality would bw covered by the other index classes.

Option 2. would increase the number of numeric index classes.

Options 4. would extend the functionality of the existing numeric index classes.

So there are quite a few possibilities. Hopefully we can come to a common conclusion.

@pandas-dev/pandas-core

@topper-123 topper-123 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2021
@topper-123
Copy link
Contributor Author

topper-123 commented May 2, 2021

My opinion:

The current index types are very ingrained in pandas, and I think we should prioritize backwards compatibility in pandas 1.x, even at the possible cost of having a worse API short-term. The "optimal" API could then be implemented for pandas 2.0.

In that light I would prefer different short-term and long-term choices:

Short-term: Implement a seperate NumericIndex, i.e. keeping Int64Index etc. working unchanged, but deprecate them in due time for pandas 2.0.
Long-term: For pandas 2.0, integrate the numeric indexes into the base Index and removing Int64Index, NumericIndex etc.

I think new index classes should only be implemented for differing API and there will not be API differences between working on the various indexes with different numpy numeric dtypes. So having a index class for each dtype subtype is IMO excessive, and in e.g. in Float32Index(..., dtype="float32") the dtype parameter would be superflous and doesn't add value to the API because the dtype is encoded in the index class.

@topper-123 topper-123 added API Design Constructors Series/DataFrame/Index/pd.array Constructors Index Related to the Index class or subclasses and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2021
@toobaz
Copy link
Member

toobaz commented May 3, 2021

Short-term: Implement a seperate NumericIndex, i.e. keeping Int64Index etc. working unchanged, but deprecate them in due time for pandas 2.0.

I agree, with the existing numeric indexes being actually implemented as NumericIndex

Long-term: For pandas 2.0, integrate the numeric indexes into the base Index and removing Int64Index, NumericIndex etc.

I still fail to see why e.g. NumericIndex should be integrated inside Index but not RangeIndex, IntervalIndex... maybe there is a clear (and easily documentable) API distinction... but I'm failing to see it, and hence I don't like this idea. Sure, one could argue that IntervalIndex is different because its elements are (in some sense) "not scalar", but already the distinction with RangeIndex is more in terms of implementation... and what about strings?!

In general, while I understand that aggregating in Index more functionalities allows to clean the API, i think it's not a consistent approach in particular now that types and implementations are becoming more modular. What about alternative (even third-party) indexes that store numbers in different ways (like EA for values)? What becomes the rule of "what Index does?" if it's not just about object, nor about "indexing with numbers" (RangeIndex does too and possibly more in the future)?

@jbrockmendel
Copy link
Member

Use the existing, numeric index classes for these numeric numpy dtypes, so e.g. Int64Index can take int32 etc.

My kneejerk reaction is very negative on the idea of Int64Index holding anything other than int64.

Use the existing, general-purpose Index class for these numeric numpy dtypes

In 2.0, if we could get combine all Index subclasses back into Index, that'd be pretty neat. Failing that, having only a handful of dtypes return the base class seems klunky.

Have dtype-specific subclasses (Int8Index, Int16Index, Int32Index, etc), and those can still be backed by a single internal NumIndex

The proliferation of classes is unfortunate, but this is the most straightforward option.

Have a single, public NumericIndex class for all numeric index types

I'd be fine with this.

@jorisvandenbossche
Copy link
Member

I still fail to see why e.g. NumericIndex should be integrated inside Index but not RangeIndex, IntervalIndex... maybe there is a clear (and easily documentable) API distinction... but I'm failing to see it

@toobaz there is indeed a very clear API distinction between Index and IntervalIndex: IntervalIndex has additional public methods and properties compared the base Index class, and it wouldn't make sense to add those interval-specific methods and properties to the base class.
The different numeric Index classes (Int64Index, UInt64Index, Float64Index) have no such API distinction with the base Index class. So basically swapping NumericIndex with Index doesn't have any user-visible impact, except for the repr being "Index([..., dtype='int64')" instead of "NumericIndex([...], dtype='int64')".

What about alternative (even third-party) indexes that store numbers in different ways (like EA for values)?

IMO that's actually a good reason to use the base Index class for other dtypes than just object dtype. That way, a third-party EA can actually be used as Index using the Index class (currently this is not possible, and we convert the EA to a numpy array when storing it in the index)

@toobaz
Copy link
Member

toobaz commented May 14, 2021

@toobaz there is indeed a very clear API distinction between Index and IntervalIndex: IntervalIndex has additional public methods and properties compared the base Index class, and it wouldn't make sense to add those interval-specific methods and properties to the base class.

You have a point, and one that in principle is also easy to document ("it's a separate class iff it provides additional methods").

I'm a bit afraid (in terms of implementation, but also of users expectations) of deciding that whether a given type/storage mechanism is supported by Index has nothing to do with the code actually implementing it below the surface. But it's true that as long as we state that any type/storage mechanism will be supported only via EA, it's just a matter of supporting EA.

a third-party EA can actually be used as Index using the Index class (currently this is not possible, and we convert the EA to a numpy array when storing it in the index)

I'm a bit confused by what pd.core.indexes.extension.ExtensionIndex is for. Isn't it able to keep the EA as such?

Notice that if Index will need to support EA, some Index subclasses might also have to (think about an EA providing an alternative implementation of datetimes, or intervals). But maybe this is not too difficult by doing inheritance right.

@jorisvandenbossche
Copy link
Member

I'm a bit confused by what pd.core.indexes.extension.ExtensionIndex is for. Isn't it able to keep the EA as such?

It's (currently) only meant as a base class for a few of our own existing Index subclasses that are backed by an ExtensionArray (i.e. CategoricalIndex, IntervalIndex, Datetime/Timedelta/PeriodIndex), but you currently can't use it for any EA (it could be extended for this use case, or we could simply use Index for that, which relates to point 1) in #39133)

@toobaz
Copy link
Member

toobaz commented May 14, 2021

you currently can't use it for any EA

I see. And now that I think about it,

some Index subclasses might also have to (think about an EA providing an alternative implementation of datetimes, or intervals). But maybe this is not too difficult by doing inheritance right.

Inheritance must be done right, because we might legitimally want to have different subclasses (i.e., with specific methods) to support EA. So whether or not we merge as much as possible in Index, we would like Index and any of its subclasses to support EA.

Still not a fan of the merge, but I finished my arguments ;-) @jorisvandenbossche your general API distinction makes sense, I still think it complicates the codebase but it might be worth it. Bonus points if Index learns to manage non-predefined numpy dtypes (e.g. fixed length strings - clearly not as an alternative to #35169 ).

@jbrockmendel
Copy link
Member

there is indeed a very clear API distinction between Index and IntervalIndex: IntervalIndex has additional public methods and properties compared the base Index class, and it wouldn't make sense to add those interval-specific methods and properties to the base class.

If we wanted to get down to a single Index class, one option for these methods/properties would be to hide them behind an accessor.

@toobaz
Copy link
Member

toobaz commented May 15, 2021

If we wanted to get down to a single Index class, one option for these methods/properties would be to hide them behind an accessor.

I guess we would loose in terms of simplicity of implementation and syntax... but I see the possible important gain in writing all dtype-specific attributes/methods for Series which work on Index without any further effort (and vice-versa) by just registering the accessor on both.

@topper-123
Copy link
Contributor Author

topper-123 commented Jun 4, 2021

Having thought about this a bit, I think having the base Index hold numeric dtypes is not possible short-term, as the return value of pd.Index(arr, dtype=“int32”) is a Int64Index. That can’t be changed without breaking the API or adding a key word, which is ugly IMO (requires the keyword peppered around the pandas code base).

I suggest we short term (pandas 1.x) go with NumericIndex and for pandas 2.0 we can merge that into Index(or not, if we don’t like that and linke to keep NumericIndex).

@jorisvandenbossche
Copy link
Member

I would like to reopen this issue to discuss the "public API" aspect of it (and sorry in advance for the long post coming). To be clear I fully support the goal of being able to use all numeric dtypes in the index (instead of the int64/uint64/float64 we have now), I am only questioning how want to expose this in the future in the public API.

The original PR implementing NumericIndex had quite some discussion about the different ways this could be implemented, which is summarized in the table at the top post of this issue. But at some point in that PR we decided to move forward with the implementation of that PR without directly adding it to the public namespace, because we could decide on how to publicly expose it later (at the bottom of #41153 (comment)). But then later we actually added NumericIndex to the public namespace in #42706 without any further discussion. Anyway, how it happened is not important (I also didn't react to the last comment about this above in this issue), but I was thinking again about this while having to deal with the deprecation warning in downstream packages, and I would like to reconsider this before the final 1.4 release.


The table in the top post lists 4 options. The relevant ones are:

  • Option 1 is about adding a single NumericIndex class, which is what is in master right now.
  • Option 3 is about using the generic Index class for this instead.

For both options, the existing Int64Index, UInt64Index and Float64Index would be deprecated and removed in pandas 2.0.

However, the table lists Option 1 as backwards compatible, while Option 3 as not backwards compatible. And I think that's the reason @topper-123 went with option 1, according to the last comment #41272 (comment) above.

But personally, I am not convinced there needs to be a difference in backwards compatibility between both options.

As far as I understand, the reason for indicating Option 3 using Index as less backwards compatible, is that doing pd.Index([1, 2, 3], dtype=int8"), which currently casts to int64 and returns an Int64Index, would then no longer return an Int64Index, but an Index with dtype int8.
While with the NumericIndex class, only directly usage of this new class will be able to actually instantiate an int8 index object (while Index calls would continue to return the current subclasses), making this (for now) fully backwards-compatible.

But I would like to argue that we can also keep Index([1, 2, 3], dtype="int8") returning an Int64Index for now when going for Option 3.
We can keep the breaking change of changing this return value for pandas 2.0, which is similar for option 1 with NumericIndex, as also for that option Index([1, 2, 3], dtype="int8") will start returning a different class in pandas 2.0 (so this is no different between both options).

It might be a bit more complex internally for the implementation compared to the separate NumericIndex, but I think still doable, and IMO in that case we should mainly consider the user API to decide on this. What's the value for the user of this NumericIndex class?

For me, some reasons to prefer the base Index for this:

I think the only actual value that I see NumericIndex providing is that it makes it possible to already use those other numeric dtypes for the index right now (because you can call pd.NumericIndex(...) explicitly), instead of having to wait for pandas 2.0.
But personally, I think this is only a minor use case, since by far most users get their index from either reading data from some data source that already sets the index or from doing a set_index on your dataframe. And for those cases we currently (for backwards compatibility) don't yet use NumericIndex anyway.

(and if we find it important to be able to use the other numeric dtypes right now, we could also consider other ways, like an (ugly) keyword to force this in the Index(..) call)


Short term, we don't necessarily need to do big changes. If others agree with the reasoning above, the minimal change that would be required for 1.4 is 1) remove the public exposure of NumericIndex in the public top-level namespace and 2) update the deprecation warning to not mention NumericIndex (but instead refer to Index class + refer to checking .dtype instead of isinstance checks).
And then longer term we would of course still see how to implement this (how to move the functionality of NumericIndex into the base class Index).

@jreback
Copy link
Contributor

jreback commented Oct 22, 2021

I disagree that NumericIndex doesn't have value ; it's name is the value. Furthermore this index type can allow operations such as + - for example while we should ban these for Index

-1 on jamming even more in Index
-1 on adding even more complexity

we have a good balance now

@jorisvandenbossche
Copy link
Member

I disagree that NumericIndex doesn't have value ; it's name is the value

I personally think it is the dtype that has this value (indicating to users what kind of index they have), or at least is sufficient for this IMO.
As a user, I think this is mostly relevant when checking the index interactively, so you get something like this:

>>> df.index
NumericIndex([0, 1, 2], dtype='int64')
# vs
Index([0, 1, 2], dtype='int64')

Personally, while the first gives a bit more direct hint that it is a numeric index, I think this is also obvious enough from the values and the dtype in the second case (in the end, this is very similar for a Series, where you also only see it based on the dtype in the repr).

-1 on adding even more complexity

Note that my proposal reduces the complexity for the user.

@jorisvandenbossche
Copy link
Member

@topper-123 do you have thoughts on the issue?

@topper-123
Copy link
Contributor Author

topper-123 commented Nov 9, 2021

I'm ok with both. I'm -1 on implenting it myself, as I'm mostly interested in having this functionality and am ok with NumericIndex. If someone else wants to integrate it into Index, I would be ok with that :-)

More generally though, I'm thinking if all the index classes are really needed. It's probably messy having CategoricalIndex, TimeDeltaIndex and all the other variants. Couldn't we make something smarter, so all array types could be the base Index class? All array types can already be contained in series, so why not in Index?

@jbrockmendel
Copy link
Member

@topper-123 what you're describing sounds a lot like #43002

@jorisvandenbossche
Copy link
Member

I'm mostly interested in having this functionality

And is it important to have it "right now", or are you fine with waiting until 2.0 (until after the deprecation cycle)?

Because AFAIK that's the main difference: with the NumericIndex approach you will be able to call pd.NumericIndex(..) in 1.4 to create an index with specific dtype, while pd.Index(..) will continue to return (u)int64/float64 for backward compatibility. So if we remove the public exposure of NumericIndex, there is no longer a way to get the behaviour now.
As I mentioned above, I personally find this only a minor inconvenience (most of the time, you get your index from reading data, or from calling set_index implicitly or explicitly, .., and all those cases will also keep using the int64/float64 indices for now). But if people find this important, there are other ways to expose the new functionality (such as a keyword in pd.Index(..))

@topper-123
Copy link
Contributor Author

I think the end game should be something like decribed by @jbrockmendel, where we have one index container for many array types. Currently, we already have many different index classes (CategoricalIndex etc.) and I don't see a big issue having NumericIndex also until / if(?) #43002 gets implemented. I think we should have indexes of all numpy int/uint/float types in 1.4.

@jorisvandenbossche
Copy link
Member

I don't see a big issue having NumericIndex also until / if(?) #43002 gets implemented

My issue with this is that we are now pointing users to switch to a class that we are not yet sure about whether we want to keep it (and moreover for a class that doesn't add any value on itself; there are no additional methods like other index subclasses).

Also note that the Index class will already be able to hold multiple dtypes with #43930 as well. So it's not that this would be the only case of that.

I think we should have indexes of all numpy int/uint/float types in 1.4.

An alternative would be a keyword in the Index constructor to indicate to preserve the dtype of the input data?

@jorisvandenbossche
Copy link
Member

To bump this discussion, I opened a PR to start removing NumericIndex just from the public top-level namespace (IMO the minimal short-term change for 1.4): #44819

@jreback
Copy link
Contributor

jreback commented Jan 1, 2022

closing this, the actual conversions can be handled in other issues.

@jreback jreback closed this as completed Jan 1, 2022
@simonjayhawkins simonjayhawkins removed the Blocker for rc Blocking issue or pull request for release candidate label Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors Enhancement Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants