-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: return value of .values
for Series with the future string dtype (numpy array vs extension array)
#60301
Comments
I'm fine with it. Longer-term I'd like to deprecate/change the dt64tz behavior and return the EA there too (making .values match ._values so we can get rid of the latter) |
I think this is fine. It's hard to imagine someone depending on a feature of I should note that in |
I don't think it is that hard. For example if the user uses |
Tangentially I think we need to get more serious about deprecating / removing it |
I have used |
@attack68 that's definitely valid for numeric data; do we think the same applies to strings? |
I think |
As @attack68 says, I also do think this will be a pretty big change for downstream. Fairly certain there is a lot of use in statsmodels. Still support cleaning it up. |
Personally tended to avoid numpy for strings. Doubt the use case is as prevalent, would guess an order of magnitude smaller. |
+1 on |
Historically, the
.values
attribute returned a numpy array (except for categoricals). When we added more ExtensionArrays, for certain dtypes (e.g. tz-aware timestamps, or periods, ..) the EA could more faithfully represent the underlying values instead of the lossy conversion to numpy (e.g for tz-aware timestamps we decided to return a numpy object dtype array instead of "datetime64[ns]" to not lose the timezone information). At that point, instead of "breaking" the behaviour of.values
, we decided to add an.array
attribute that then always returns the EA.But for generic ExtensionArrays (external, or non-default EAs like the masked ones or the Arrow ones), the
.values
has always already directly returned the EA as well. So in those cases, there is no difference between.values
and.array
.Now to the point: with the new default
StringDtype
, the current behaviour is indeed to also always return the EA for both.values
and.array
.This means this is one of the breaking changes for users when upgrading to pandas 3.0, that for a column which is inferred as string data, the
.values
no longer returns a numpy array.Are we OK with this breaking change now?
Or, we could also decide to keep
.values
return the numpy array with.array
returning the EA.Of course, when we would move to use EAs for all dtypes (which is being considered in the logical dtypes and missing values PDEP discussions), then we would have this breaking change as well (or at least need to make a decision about it). But, that could also be a reason to not yet do it for the string dtype now, if we would change it for all dtypes later.
cc @pandas-dev/pandas-core
The text was updated successfully, but these errors were encountered: