-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: back IntervalArray by a single ndarray #37047
Conversation
pandas/core/arrays/interval.py
Outdated
@@ -1399,3 +1344,78 @@ def maybe_convert_platform_interval(values): | |||
values = np.asarray(values) | |||
|
|||
return maybe_convert_platform(values) | |||
|
|||
|
|||
def _maybe_cast_inputs(left, right, copy, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type if you can (esp the return)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
pandas/core/arrays/interval.py
Outdated
return left, right | ||
|
||
|
||
def _get_combined_data(left, right): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type if you can esp the return
|
||
|
||
def _get_combined_data( | ||
left: Union["Index", ArrayLike], right: Union["Index", ArrayLike] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this method be more strict? e.g. only accept Union[np.ndarray, "DatetimeArray", "TimedeltaArray"]
e.g. things that have already been casted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can do quite a bit less back-and-forth casting eventually, yes
cc @jschendel @TomAugspurger if you want to look |
I think things looked fine last I checked.
…On Mon, Oct 12, 2020 at 10:19 AM Jeff Reback ***@***.***> wrote:
cc @jschendel <https://github.com/jschendel> @TomAugspurger
<https://github.com/TomAugspurger> if you want to look
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37047 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIXWO4HOX6ZZUHVTPELSKMNAHANCNFSM4SLO3GZQ>
.
|
thanks @jbrockmendel |
The consequence of this is that we can never use another ExtensionArray as the backer of an IntervalArray, right? (except for the DatetimeArray for which this is an internal special case already) Also, @jbrockmendel you mention this change is mainly for perf reasons. What is the amount of speed-up that this gives? Can you give a few examples? (are there benchmarks for this?) |
huh why make things more complicated ? what reason do you have for not dojng this? |
I think the model we have had for years (storing a left and right array) is actually quite simple, and maps to the mental model of an IntervalArray as a struct array. (and as a sidenote @jreback, some feedback on how we are discussing things: words like "huh" (which give the impression to me as if what I said is complete nonsense) is not really helpfull for a good discussion) |
i think huh is pretty telling actually - you are making a late comment because on a PR because so i don't this is helpful at all and the huh is confusion because of the commen |
(then you can also say: "I am confused" instead of "huh?". Maybe for you it is the same, but so I am giving you the feedback that this phrasing is giving me a very negative feeling. Even if you are confused why I am commenting on a closed PR, I think you can say that in a different way) Yes, this PR is already merged, but since when does that mean I cannot have comments anymore about the change? (the PR was also open for not even 2 days, without a prior issue discussing this. That's not leaving much time to comment on it while it was open) But so about the actual topic:
To repeat what I said in my original comment: this means "we can never use another ExtensionArray as the backer of an IntervalArray". Of course, right now, we only use numpy arrays as the underlying values for IntervalArray (also because we don't yet have an EA-backed Index). So there is not a direct issue at this moment with this change. But I personally think it is reasonable to consider allowing IntervalArray to be backed by both ndarray or EA. And why would we want to have IntervalArray backed by EAs? |
Ah that's a fair point about other extension types I hadn't considered. IIUC, the main reason for this PR was to clean up the (hacky) code to support |
I think this is a better representation actually of the primary usecases of IntervalArrays, until / unless we have another one this is one that keeps consistency of dtypes which has plagued since the beginning. This advocates for 2E EA (@jbrockmendel will be happy). THe main point though is the mental model is much simpler. This is a collection of Interval points that are joined exactly; 2 independent arrays are not, you have lots of checking are they they same dtype, are they the same length and so on. this is simpler. Let's not build more than we need now. |
Not fully sure what you mean here. Can you give an example of a past issue releted to this?
Well, the current code to deal with that was mostly focused in a single function, i.e. Whether a single 2D array is "much simpler" is clearly subjective then, but anyway, my main worry this related to EA support:
We actually need it now. For sure, there is not an open PR or issue about supporting EAs in IntervalArray right now. But we are still actively working on making the nullable dtypes full replacements for their numpy counterparts. Support in IntervalArray is for me part of that "full support". |
Not on hand, no.
full replacements for numpy counterparts would support 2D. |
Not necessarily for pandas functionality, though But that's not the actual discussion here. To repeat again: I think we should be able to have an IntervalArray using EA (eg to support IntervalArray with nullable int/float). How do we do that? (on the short term) And yes, 2D EAs would be one way. But that's a much bigger discussion, though. So if someone wants to have that discussion, I would suggest to open a separate issue for that. |
What the status of this discussion? Since I haven't seen further attempts to use this for performance improvements, shall we revert it? |
thanks for the reminder; ill move this up the priority list |
Any update here? Otherwise I would propose to revert this change. |
I've got a branch going on the set ops, but it has devolved into yak shaving. There is code simplification available unrelated to perf, but on the call I agreed not to move forward with any of that in the short-term. |
Another ping here. If there are no plans for actual improvements based on this change, are you OK with reverting it? |
Do code simplifications count as improvements? The stated reason for wanting to revert is to allow IntegerArray/FloatArray to back it. Are there plans to do that in the near future? |
In principle certainly yes, but in this case I think it is quite subjective. As I said above, I personally don't find this much of a simplification. But it is clear that there is disagreement about this ;)
As mentioned above: there is no concrete open issue or PR for this, but we are actively working towards full support for the nullable dtypes so they can become the default in a future release. While working on this the coming months, supporting them in IntervalArray will be one of the issues we need to tackle. So yes, there are IMO plans to do this in the near future. |
was wavering in this but i agree with @jorisvandenbossche logic here for reverting |
OK. easy to un-revert once either a) i finish the yak shaving or b) nullable dtypes support 2D. |
This reverts commit 9cb3723.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
The main motivation here is for perf for methods that current cast to object dtype, including value_counts, set ops, values_for_(argsort|factorize)
Also we get an actually-simple simple_new