-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.values on ExtensionArray-backed containers #19954
Comments
Not to derail the discussion from the get-go, but there's a closely related topic of how many "nuisance" ndarray methods we throw on the ExtensionArray base class to make writing code that's ndarray vs. extensionarray agnostic easier. For example |
I was pretty happy with always returning an I think my preferred option would be to make
NumPy's datetime64 is now always timezone naive, not UTC. I guess you meant the times are converted into UTC before being passed into NumPy?
My preference would be to make the base class as abstract as possible, only implementing methods that are actually used by pandas. This will minimize the effort required to author new ExtensionArray classes, rather than requiring that authors override pre-existing methods. Extension arrays should be a low-level interface -- we already have the high-level interface for extension arrays in Series/DataFrame. |
+1 on this. I think the break in compatibility is less painful than the inconsistency and current situation of, e.g. xref to #15750 - was about changing the return type of tz-aware datetimes to boxed scalars. |
Yep I think I share the preference for Also +1 for a |
Are people OK with |
it has been this way for a long time |
What do you mean by that?
Why would we change tz-naive to return anything other than a datetime64[ns] ndarray? |
As stated above I'm OK with the inconsistency, but see where it could be surprising. Another option would be to deprecate |
👍 . I wanted to get some explicit thoughts on that one, since I didn't initially comprehend that two seemingly similar types could have different outcomes.
Yes. We'll want something like that. Internally, we've used |
Let's start with goals and work backwards to an API
I think getting 1 and 2 right are necessary. 3 would be a nice thing to have, but I wouldn't I also think that "fixing" So, I propose avoiding that fight. Let's use new names to achieve the behavior we want. Here's a concrete proposal. I still need to go through all our EAs to figure out an @property
def array(self) -> Union[ndarray, ExtensionArray]:
"""Return the array backing this Series or Index
Examples
--------
>>> arr = pd.core.arrays.period_array(['2000', '2001'], freq='A')
>>> ser = pd.Series(arr)
>>> ser.array
<PeriodArray>
['2000', '2001']
Length: 2, dtype: period[A-DEC]
"""
return self._values
@property
def numpy_array(self) -> ndarray:
"""Return a NumPy array of this object's values.
This may or may not require a copy or coercion of values.
For dtypes that can be represented by NumPy, this will be a view on
the actual values. For ExtensionArrays, this will likely be an object-dtype
ndarray that losslessly represents the values.
Examples
--------
>>> arr = pd.core.arrays.period_array(['2000', '2001'], freq='A')
>>> ser = pd.Series(arr)
>>> ser.numpy_array
array([Period('2000', 'A-DEC'), Period('2001', 'A-DEC')], dtype=object)
"""
return np.asarray(self._values)
@property
def ndarray_values(self) -> ndarray:
"""Return a NumPy array representing the values in this object.
This should be faster to compute than ``self.numpy_array``, but will
require additional context to interpret.
Examples
--------
>>> arr = pd.core.arrays.period_array(['2000', '2001'], freq='A')
>>> ser = pd.Series(arr)
>>> ser.ndarray_values
array([30, 31])
"""
return self._ndarray_values |
@TomAugspurger I like this general idea, but let me suggest two refinements:
|
why would this not be on top of an arrow backed array? the point of EA is to completely hide this detail |
I was viewing this as a
Mmm, yes that's a good point... I would need to look at this. In several places, like in our indexing engines, I think we really need an ndarray right now.
I'm not sure what you mean. Were you responding to my proposal or Stehpan's comments? |
I was responding to stephan an arrow backed extension arrow is nothing special as far as pandas is concerned |
You can turn an arrow backed extension arrow into a NumPy array (possibly with a copy) or you can access it as an extension array, but there isn't necessarily any equivalent of a "raw numpy array" underlying the values that you can directly mutate to change the values in a |
@shoyer of course but how does this actually matter - again the actual implementation of the EA is irrelevant, except when we ask it give us a raw set of values and even that is transparent |
I think whether or not this has to be an ndarray depends on how it's (intended) to be used. Is the primary motivation
I was proposing the first, hence the name Right now, we seem to use |
This is definitely the use-case I had in mind. This can be convenient, e.g., for cases where you want to modify the NumPy array in place. Note that this is definitely possible for some but not all arrow based arrays. For example, if you have integers or floats without any missing values.
I think this is covered by |
Not the "fast" requirement though. For e.g. PeriodIndex, My last argument in favor of The usage for As a final note, I think this |
I totally agree here, I was just proposing a different name for the same use-case:
|
One last question here... I've implicitly assumed that I don't think that allowing this to be a 2-D will be that useful in practice. I suspect that places using these arrays are likely to expect 1-D arrays. But I figured I'd throw this out there so we can at least reject it, and document that they should be 1-D. |
Personally I would leave the Of course, as @shoyer mentions, we can raise an error for this attribute if there is no clear no-copy option available. But then a user would need to be aware of the specific data type/EA that is being used, so then why not use EA-specific API to do this?
and if you have a PeriodArray. Then you know how a PeriodArray works: it consistent of integers and a freq, so you can use the public PeriodArray.asi8 to access those integers, do something with it, and then reconstruct a PeriodArray from the integers and the freq. I would say that each ExtensionArray that wants to enable such manipulation, can provide it's own access to the underlying data, because this will differ for each array anyhow? (eg for PeriodArray, you already need to know you need to keep track of freq to reconstruct it, so then you can also use a specific attribute to access the data?) |
To update the table at
which is pleasingly symmetric. OK by everyone? And if we allow kwargs for I suppose we should reconsider what the default is... Should |
It's a private attribute, so in principle there should be no downstream use case right now, but you are right, there could of course be use cases, which is good to think about to decide to make it public or not. |
+1 on #19954 (comment) a I think the discussion for a bit -0 on |
We indeed need to decide on the default behaviour of Having the default
To be clear this is not a numpy dtype (I would make that clearer in the formatting), so in that sense the table is not fully symmetric. |
That's part of why I want to ignore it for now :)
People seem to like the property. And if |
Changed to DatetimeTZDtype |
Again, this is discussion is for sure very relevant, and one we need to have. But I think it is orthogonal enough from the discussion about public attributes to keep it separate and not further complicate this already complex issue thread.
The idea is that |
I added |
Opened #23565 for the discussion on |
a property just boxes you in as you can never add arguments. |
why is |
But why would you want an argument for it? Can you give a use case? There is only one array stored in the series/index container, and it simply returns that one (without making a copy, without any modification)
If we do the last change that Tom mentioned above (#19954 (comment)), there might be no difference in the default behaviour, but Do we have other use cases where multiple ndarrays are possible (except for the datetime tz case) ? |
this is exactly the point! I don't know now. |
Can you explain what you mean with "boxed us in" ? |
I think part of the idea is to not have arguments.
I think we're moving to it matching by default.
Not saying this is a good idea, but potentially a 2-D array from an IntervalArray. |
Maybe we can pause this conversation until this afternoon :) |
This again offers too many options. |
great |
why do you think we have a problem converting Datetime w/tz now? because we can't touch A method offers possibilities, a property precludes them. |
Suppose we were to implement RangeArray to back RangeIndex (or even range-like DatetimeArray/Index). Then there is nothing being boxed. I don't have a real opinion here, just sayin' |
Then it is the RangeArray that is being boxed, and thus what would be returned? |
This adds two new methods for working with EA-backed Series / Index. - `.array -> Union[ExtensionArray, ndarray]`: the actual backing array - `.to_numpy() -> ndarray`: A NumPy representation of the data `.array` is always a reference to the actual data stored in the container. Updating it inplace (not recommended) will be reflected in the Series (or Index for that matter, so really not recommended). `to_numpy()` may (or may not) require data copying / coercion. Closes pandas-dev#19954
Discussed briefly on the call today, but we should go through things formally.
What should the return type of
Series[extension_array].values
andIndex[extension_array].values
be? I believe the two options areCurrent State
Not sure how much weight we should put on the current behavior, but for reference:
If we decide to have the return values be ExtensionArrays, we'll need to discuss
to what extent they're part of the public API.
Regardless of the choice for
.values
, we'll probably want to support the otheruse case (maybe just by documenting "call
np.asarray
on it). Internally, wehave
._values
("best" array, ndarray or EA) and._ndarray_values
(always anndarray).
cc @jreback @jorisvandenbossche @jschendel @jbrockmendel @shoyer @chris-b1
The text was updated successfully, but these errors were encountered: