-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Indexes with any numpy int/uint/float dtype #41272
Comments
My opinion: The current index types are very ingrained in pandas, and I think we should prioritize backwards compatibility in pandas 1.x, even at the possible cost of having a worse API short-term. The "optimal" API could then be implemented for pandas 2.0. In that light I would prefer different short-term and long-term choices: Short-term: Implement a seperate I think new index classes should only be implemented for differing API and there will not be API differences between working on the various indexes with different numpy numeric dtypes. So having a index class for each dtype subtype is IMO excessive, and in e.g. in |
I agree, with the existing numeric indexes being actually implemented as
I still fail to see why e.g. In general, while I understand that aggregating in |
My kneejerk reaction is very negative on the idea of Int64Index holding anything other than int64.
In 2.0, if we could get combine all Index subclasses back into Index, that'd be pretty neat. Failing that, having only a handful of dtypes return the base class seems klunky.
The proliferation of classes is unfortunate, but this is the most straightforward option.
I'd be fine with this. |
@toobaz there is indeed a very clear API distinction between
IMO that's actually a good reason to use the base |
You have a point, and one that in principle is also easy to document ("it's a separate class iff it provides additional methods"). I'm a bit afraid (in terms of implementation, but also of users expectations) of deciding that whether a given type/storage mechanism is supported by
I'm a bit confused by what Notice that if |
It's (currently) only meant as a base class for a few of our own existing Index subclasses that are backed by an ExtensionArray (i.e. CategoricalIndex, IntervalIndex, Datetime/Timedelta/PeriodIndex), but you currently can't use it for any EA (it could be extended for this use case, or we could simply use |
I see. And now that I think about it,
Inheritance must be done right, because we might legitimally want to have different subclasses (i.e., with specific methods) to support EA. So whether or not we merge as much as possible in Still not a fan of the merge, but I finished my arguments ;-) @jorisvandenbossche your general API distinction makes sense, I still think it complicates the codebase but it might be worth it. Bonus points if |
If we wanted to get down to a single Index class, one option for these methods/properties would be to hide them behind an accessor. |
I guess we would loose in terms of simplicity of implementation and syntax... but I see the possible important gain in writing all dtype-specific attributes/methods for |
Having thought about this a bit, I think having the base Index hold numeric dtypes is not possible short-term, as the return value of pd.Index(arr, dtype=“int32”) is a Int64Index. That can’t be changed without breaking the API or adding a key word, which is ugly IMO (requires the keyword peppered around the pandas code base). I suggest we short term (pandas 1.x) go with NumericIndex and for pandas 2.0 we can merge that into Index(or not, if we don’t like that and linke to keep NumericIndex). |
I would like to reopen this issue to discuss the "public API" aspect of it (and sorry in advance for the long post coming). To be clear I fully support the goal of being able to use all numeric dtypes in the index (instead of the int64/uint64/float64 we have now), I am only questioning how want to expose this in the future in the public API. The original PR implementing NumericIndex had quite some discussion about the different ways this could be implemented, which is summarized in the table at the top post of this issue. But at some point in that PR we decided to move forward with the implementation of that PR without directly adding it to the public namespace, because we could decide on how to publicly expose it later (at the bottom of #41153 (comment)). But then later we actually added NumericIndex to the public namespace in #42706 without any further discussion. Anyway, how it happened is not important (I also didn't react to the last comment about this above in this issue), but I was thinking again about this while having to deal with the deprecation warning in downstream packages, and I would like to reconsider this before the final 1.4 release. The table in the top post lists 4 options. The relevant ones are:
For both options, the existing Int64Index, UInt64Index and Float64Index would be deprecated and removed in pandas 2.0. However, the table lists Option 1 as backwards compatible, while Option 3 as not backwards compatible. And I think that's the reason @topper-123 went with option 1, according to the last comment #41272 (comment) above. But personally, I am not convinced there needs to be a difference in backwards compatibility between both options. As far as I understand, the reason for indicating Option 3 using But I would like to argue that we can also keep It might be a bit more complex internally for the implementation compared to the separate For me, some reasons to prefer the base
I think the only actual value that I see (and if we find it important to be able to use the other numeric dtypes right now, we could also consider other ways, like an (ugly) keyword to force this in the Short term, we don't necessarily need to do big changes. If others agree with the reasoning above, the minimal change that would be required for 1.4 is 1) remove the public exposure of |
I disagree that NumericIndex doesn't have value ; it's name is the value. Furthermore this index type can allow operations such as + - for example while we should ban these for Index -1 on jamming even more in Index we have a good balance now |
I personally think it is the dtype that has this value (indicating to users what kind of index they have), or at least is sufficient for this IMO. >>> df.index
NumericIndex([0, 1, 2], dtype='int64')
# vs
Index([0, 1, 2], dtype='int64') Personally, while the first gives a bit more direct hint that it is a numeric index, I think this is also obvious enough from the values and the dtype in the second case (in the end, this is very similar for a Series, where you also only see it based on the dtype in the repr).
Note that my proposal reduces the complexity for the user. |
@topper-123 do you have thoughts on the issue? |
I'm ok with both. I'm -1 on implenting it myself, as I'm mostly interested in having this functionality and am ok with More generally though, I'm thinking if all the index classes are really needed. It's probably messy having |
@topper-123 what you're describing sounds a lot like #43002 |
And is it important to have it "right now", or are you fine with waiting until 2.0 (until after the deprecation cycle)? Because AFAIK that's the main difference: with the |
I think the end game should be something like decribed by @jbrockmendel, where we have one index container for many array types. Currently, we already have many different index classes ( |
My issue with this is that we are now pointing users to switch to a class that we are not yet sure about whether we want to keep it (and moreover for a class that doesn't add any value on itself; there are no additional methods like other index subclasses). Also note that the
An alternative would be a keyword in the |
To bump this discussion, I opened a PR to start removing NumericIndex just from the public top-level namespace (IMO the minimal short-term change for 1.4): #44819 |
closing this, the actual conversions can be handled in other issues. |
I've made a (draft) PR in #41153 that implements an index class for all the normal numpy numeric dtypes (int64/32/16/8, uint64/32/16/8 an dfloat65/32). There was some discussion in that PR how the public API for this should be, so I'm opening this issue for discussing that.
My plan is to get #41153 merged when its ready, possibly without any public-facing API changes. The public-facing changes would then come afterwards, and after there is an conclusion on how the API should be.
Summary of the options
As I see it the options for the API is:
NumericIndex
class for all numeric index typesInt8Index
,Int16Index
,Int32Index
, etc), and those can still be backed by a single internalNumIndex
Index
class for these numeric numpy dtypesInt64Index
can take int32 etc.Options 1. and 3. would mean that the existing
Int64Index
,UInt64Index
andFloat64Index
would be deprecated and removed in pandas 2.0, because their functionality would bw covered by the other index classes.Option 2. would increase the number of numeric index classes.
Options 4. would extend the functionality of the existing numeric index classes.
So there are quite a few possibilities. Hopefully we can come to a common conclusion.
@pandas-dev/pandas-core
The text was updated successfully, but these errors were encountered: