Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: return value of .values for Series with the future string dtype (numpy array vs extension array) #60301

Open
jorisvandenbossche opened this issue Nov 13, 2024 · 10 comments
Labels
API Design Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Historically, the .values attribute returned a numpy array (except for categoricals). When we added more ExtensionArrays, for certain dtypes (e.g. tz-aware timestamps, or periods, ..) the EA could more faithfully represent the underlying values instead of the lossy conversion to numpy (e.g for tz-aware timestamps we decided to return a numpy object dtype array instead of "datetime64[ns]" to not lose the timezone information). At that point, instead of "breaking" the behaviour of .values, we decided to add an .array attribute that then always returns the EA.

But for generic ExtensionArrays (external, or non-default EAs like the masked ones or the Arrow ones), the .values has always already directly returned the EA as well. So in those cases, there is no difference between .values and .array.

Now to the point: with the new default StringDtype, the current behaviour is indeed to also always return the EA for both .values and .array.

This means this is one of the breaking changes for users when upgrading to pandas 3.0, that for a column which is inferred as string data, the .values no longer returns a numpy array.

Are we OK with this breaking change now?
Or, we could also decide to keep .values return the numpy array with .array returning the EA.

Of course, when we would move to use EAs for all dtypes (which is being considered in the logical dtypes and missing values PDEP discussions), then we would have this breaking change as well (or at least need to make a decision about it). But, that could also be a reason to not yet do it for the string dtype now, if we would change it for all dtypes later.

cc @pandas-dev/pandas-core

@jorisvandenbossche jorisvandenbossche added API Design Strings String extension data type and string data labels Nov 13, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Nov 13, 2024
@jbrockmendel
Copy link
Member

I'm fine with it. Longer-term I'd like to deprecate/change the dt64tz behavior and return the EA there too (making .values match ._values so we can get rid of the latter)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 13, 2024

I think this is fine. It's hard to imagine someone depending on a feature of .values returning a numpy array versus any other iterable.

I should note that in pandas-stubs, Series.values returns ExtensionArray | ndarray[Any, Any] so that wouldn't change under this proposal. So if people are using any form of type checking on their code, and their code was dependent on .values returning a numpy array, then the type checker should have picked that up.

@jorisvandenbossche
Copy link
Member Author

It's hard to imagine someone depending on a feature of .values returning a numpy array versus any other iterable.

I don't think it is that hard. For example if the user uses .values instead of np.asarray(..) or .to_numpy() in the intention to convert to a numpy array, and passes it on to some API that requires a numpy array (and doesn't do an asarray themselves).
The API of a numpy.ndarray and an EA is also not exactly the same.

@WillAyd
Copy link
Member

WillAyd commented Nov 13, 2024

.values has been highly discouraged for many years now, with the warning that it does not have a stable interface, so I think this is also OK.

Tangentially I think we need to get more serious about deprecating / removing it

@attack68
Copy link
Contributor

I have used .values quite a lot in some code for getting numpy array, when working exclusively with numeric data. I have shifted to to_numpy since i have been aware this for a while but by no means gone through legacy code.

@WillAyd
Copy link
Member

WillAyd commented Nov 13, 2024

@attack68 that's definitely valid for numeric data; do we think the same applies to strings?

@bashtage
Copy link
Contributor

I think .values has been problematic over the years, and so it would be preferred to move users to a stable interface that will be more predictable.

@bashtage
Copy link
Contributor

As @attack68 says, I also do think this will be a pretty big change for downstream. Fairly certain there is a lot of use in statsmodels. Still support cleaning it up.

@attack68
Copy link
Contributor

Personally tended to avoid numpy for strings. Doubt the use case is as prevalent, would guess an order of magnitude smaller.

@rhshadrach
Copy link
Member

+1 on .values returning EA for strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

7 participants