Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: hide NumericIndex from public top-level namespace in favor of pd.Index #44819

Merged
merged 8 commits into from
Jan 1, 2022

Conversation

jorisvandenbossche
Copy link
Member

Starting a PR for this to bump the discussion in #41272 (comment)

This PR already removes NumericIndex from the top-level namespace, but does not yet enable getting a NumericIndex otherwise. I we want to go forward with this proposal, we will have to choose between either a keyword in pd.Index to preserve the passed dtype (pd.Index([..], dtype="int32") currently gives an int64 index), or accept that this will only be possible in pandas 2.0. Let's have that high-level discussion at #41272.

@jorisvandenbossche jorisvandenbossche added the Index Related to the Index class or subclasses label Dec 8, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.4 milestone Dec 8, 2021
@jorisvandenbossche jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Dec 8, 2021
@bashtage
Copy link
Contributor

bashtage commented Dec 8, 2021

statsmodels often needs an easy way to detect index types. This is relied heavily upon when working with time series data where we are strict about the types of indices that are allowed, and if an end user passes a disallowed index, it is replaced by an Int64Index. I spent a reasonable amount of time working around the move from Int64Index to NumericIndex. I am somewhat worried that needing to detect that an Index is in fact equivalent to an Int64Index might not be simple.

@bashtage
Copy link
Contributor

bashtage commented Dec 8, 2021

IIRC one aspect that made it harder was that some attributes were not always defined for all possible index types. Maybe this would simplify things so that indices like PeriodIndex will have the dtype attribute.

@jorisvandenbossche
Copy link
Member Author

I am somewhat worried that needing to detect that an Index is in fact equivalent to an Int64Index might not be simple.

Can you try to explain why you think it is not that simple (I think you can check the dtype of the index object?)
Maybe a pointer to how you dealt with this in Statsmodels would be interesting.

IIRC one aspect that made it harder was that some attributes were not always defined for all possible index types

This specific discussion is about numeric Index classes, and the Int64Index and Float64Index have no additional attributes compared to the base Index class.

Maybe this would simplify things so that indices like PeriodIndex will have the dtype attribute.

All index classes (including PeriodIndex) should have a .dtype attribute.

@bashtage
Copy link
Contributor

I'm not sure why I was complaining so much. At the time it felt painful, but it seems it wasn't that bad

@jreback
Copy link
Contributor

jreback commented Dec 20, 2021

@jorisvandenbossche if you can rebase this and will look as needed for the RC

@simonjayhawkins simonjayhawkins added the Blocker for rc Blocking issue or pull request for release candidate label Dec 21, 2021
@topper-123
Copy link
Contributor

@bashtage , I looked at your PR at bit.

In general, it's not necessary to use the more specific index constructors if you just instead supply the dtype. So, instead of NumericIndex(np.arange(nobs)), None) you could to Index(np.arange(nobs)), dtype="int"), which will resolve to Int64Index in Pandas 1.x and to a NumericIndexin Pandas 2.0. Same goes for (most?) other index types, just supply the correct dtype to Index and the correct index type is used.

Same goes for instance checks. Instead of isinstance(idx, Int64Index), you should to pd.api.types.is_integer_dtype(idx.dtype) etc. like youød do for Series (it wouldn't make sense to do instance(ser, Series) for most part.

I think this is part of what @jorisvandenbossche is trying to do: The Index subclasses may not be necessary to have in the public namespace and people just rely on dtype more that index type (not sure if that discussion is 100 % settled, but that's the direction).

@topper-123
Copy link
Contributor

(I'll comment on the PR later today)

@jorisvandenbossche
Copy link
Member Author

@bashtage as a specific example, I think you could replace the whole is_int_index with something like:

def is_int_index(index: pd.Index):
    return isinstance(index, pd.Index) and isinstance(index.dtype, np.dtype) and np.issubdtype(index.dtype, np.integer)

which doesn't check for specific subclasses and should also already work for older and newer pandas versions alike (or use the pd.api.types.is_integer_dtype as mentioned above, but that is a bit more flexible, it will eg also allow the nullable int dtypes).

@jorisvandenbossche
Copy link
Member Author

For this PR, as mentioned in the top post (and being discussed a bit in #41272), the main issue we need to discuss is how to enable using different dtypes (the non 64-bit dtypes):

  • Live with the fact that this will only be possible in pandas 2.0
  • Add a keyword to pd.Index to enforce that the specified dtype is respected (currently pd.Index([1, 2], dtype="int32") works but returns int64 index)
  • Actually start respecting the dtype keyword already (but only if specified explicitly, so not when just passing an np.ndarray to pd.Index). But this is not fully backwards compatible.

@topper-123
Copy link
Contributor

My preferred option is to keep this as it is. There are already something like 12 Index types in the main namespace and I don't see the big deal with having this one also.

@jorisvandenbossche
Copy link
Member Author

@topper-123 could you also say something about your preference what to do would be if we would decide not exposing it publicly (i.e. related to my last comment above)?

@bashtage
Copy link
Contributor

bashtage commented Dec 23, 2021

if we would decide not exposing it publicly (i.e. related to my last comment above)?

IMO, it seems strange to not publically expose the concrete type of the most common Index.

This is different from saying that it should be in the top level of pandas. I am arguing that it should be in a location that is considered public.

@jorisvandenbossche
Copy link
Member Author

IMO, it seems strange to not publically expose the concrete type of the most common Index.

The idea is that the concrete type will be pd.Index, so we will be exposing that

@jreback
Copy link
Contributor

jreback commented Dec 23, 2021

@jreback
Copy link
Contributor

jreback commented Dec 31, 2021

@jorisvandenbossche if you do have a chance to rebase as this is blocking the RC

# Conflicts:
#	pandas/tests/indexes/test_numpy_compat.py
@phofl
Copy link
Member

phofl commented Dec 31, 2021

I've rebased, was trivial fortunately.

@jreback jreback removed the Needs Discussion Requires discussion from core team before further action label Dec 31, 2021
@jreback
Copy link
Contributor

jreback commented Dec 31, 2021

the main issue we need to discuss is how to enable using different dtypes (the non 64-bit dtypes):
Live with the fact that this will only be possible in pandas 2.0

this is the current way and so that is fine.

I agree that this is a fairly big change and likely better wait till 2.0

@jreback
Copy link
Contributor

jreback commented Dec 31, 2021

@phofl
Copy link
Member

phofl commented Dec 31, 2021

Hm yeah this was missing. Had to remove thet whatsnew regarding NumericIndex

@phofl
Copy link
Member

phofl commented Dec 31, 2021

Should also remove from userguide

@jreback jreback merged commit abd7436 into pandas-dev:master Jan 1, 2022
@jreback
Copy link
Contributor

jreback commented Jan 1, 2022

thanks @jorisvandenbossche and @phofl

@topper-123
Copy link
Contributor

topper-123 commented Jan 1, 2022

I just noticed #43930, so it seems there is excellent progress towards making Index accept all array types. In that case I think this PR goes in the same direction, which is great.

Shouldn't CategoricalIndex et. al. likewise be pulled into Index also, so all the index sub classes in the main namespace can be avoided?

@jreback
Copy link
Contributor

jreback commented Jan 1, 2022

Shouldn't CategoricalIndex et. al. likewise be pulled into Index also, so all the index sub classes in the main namespace can be avoided?

yep that's the end goal here

@topper-123
Copy link
Contributor

yep that's the end goal here

👍

@simonjayhawkins simonjayhawkins removed the Blocker for rc Blocking issue or pull request for release candidate label Jan 5, 2022
@jorisvandenbossche jorisvandenbossche deleted the numericindex-private branch January 21, 2022 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants