-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP/POC: Allow (N, 1) and (1,N) EAs #26914
Conversation
Minimal follow-on edits to keep tests passing
Codecov Report
@@ Coverage Diff @@
## master #26914 +/- ##
==========================================
- Coverage 91.87% 91.85% -0.02%
==========================================
Files 180 180
Lines 50712 50792 +80
==========================================
+ Hits 46590 46657 +67
- Misses 4122 4135 +13
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26914 +/- ##
==========================================
- Coverage 91.97% 91.97% -0.01%
==========================================
Files 180 180
Lines 50756 50833 +77
==========================================
+ Hits 46685 46753 +68
- Misses 4071 4080 +9
Continue to review full report at Codecov.
|
What have you learned from starting the proof of concept? Do you think this is worth pursing? |
Good question. In no particular order from my notes:
Yes. As described above, I think the complexity of internals is hiding a good amount of unintended behavior that this would help alleviate. I'd like to find a nice way to break this into smaller pieces. Something like:
|
Thanks, I'll digest all that later. One more high-level question I forgot: How do 3rd party EAs complicate things? Do we require that they can be reshaped to a (N, 1) and (1, N) 2D arrays as well? Can we hide that complexity from them, or is inevitable that they worry about it? If the goals are
then it seems inevitable that 3rd party EAs need to be able to handle this (limited) reshaping. |
I expect that most of these will be able to use |
So in the "more than minimum" case, the EA methods need to work for 2D? I am just trying to get an understanding what the implications are for EAs.
Can you give an example of such operations? |
I read that as
though I may be projecting my own thoughts onto this. |
Which of our test EAs would make for a good test-case for the wrapper / mixin / metaclass for EAs that can't be reshaped 2D? I think either arrow or json would be best. edit: Sorry for the extraneous comment. Not saying that the wrapper for un-reshapable EAs should necessarily be done here. |
But if we would keep all the 1D special casing for external EAs, doesn't this defeat the purpose of the simplication? |
Tom communicates better than Joris and I put together. We could apply the bare-minimum requirements for general EAs and go past that for our own EAs.
I like this idea, will take a look at it.
In my ideal case, EAs are a drop-in replacement for ndarray. In my preferred compromise case, the EA would patch
This elides the fact that we already do handle them differently, otherwise we wouldn't need DatetimeTZBlock, CategoricalBlock, or ObjectValuesExtensionBlock. It strikes me as very unlikely that 3rd party EAs will Just Work without analogous patching (I hypothesize some of the many xfailed/skipped tests in tests/extensions/ are related to this) |
Can you detail what those bare-minimum requirements are? Is that the
I don't think that's a good comparison in all cases. The ObjectValuesExtensionBlock is a very minimal tweak to have a different public
What do you mean with this? As far as I know, everything indeed Just Works for the things that makes sense to work, without having the need to patch something from the ExtensionBlock. Would it make sense to summarize it very shortly that it would move some of the complexity dealing with a 1D/2D mixture from the BlockManager to the ExtensionBlock or ExtensionArray ? |
Right, but you're choosing to focus on the lease relevant of the three examples given.
Not sure how much clearer I can make it: if ExtensionBlock contained everything needed to make EAs work, then DatetimeTZBlock and CategoricalBlock would be unnecessary.
Close. I'd say a) the complexity is moved from ExtensionBlock (and subclasses) to the EAs, b) that there is a net decrease in complexity to be gained, and c) that the cost of complexity is convex, so getting it out of the blocks would be a win even if there were no net gains. |
... unless you want to argue that |
Checking the skipped test, you can see it has a clear "TODO". It's missing a required classmethod of the interface -> #26957
It's used by period, interval, and sparse is also using ExtensionBlock. For sure those are the lesser used types, not not irrelevant. They still have much more testing internally than our base extension tests provide. And I have always assumed that the long term goal of CategoricalBlock would be to become an actual ExtensionBlock, if not for all historical reasons that it has some deviating behaviour we need to preserve (datetime block is something different as there you have the mix of consolidatable tz naive and non-consolidatable tz aware).
What I meant with what I said is: for me, from my experience in GeoPandas, everything seems to work for now. And if something does not work out (which probably will come up once it is released), then I consider that a bug or missing feature of the interface and report / fix it in pandas. I once started with a custom GeometryBlock, but the EA interface was exactly introduced to not need meddling with the blocks. Long story short: to me, it is not "unlikely that 3rd party EAs will just work without analogous patching [of the blocks]", but rather that should be our goal. Idea: we could also keep our internal EAs (or part of them) 1D. If we want to give this possibility to external EAs, we need the 1D/2D compatibility code in ExtensionBlock anyway, so why not use it ourselves? |
I retract that example.
Thank you for clarifying, this makes much more sense to me. |
Updated to make changes less invasive where possible. If my guess in #26976 is correct, then we can have |
Did we reach anything resembling consensus on how to move forward here on the call? cc @jreback |
@jbrockmendel my expectations should be that
this IMHO is much less magical than a decorator, its conceptually like the mixin you have proposed but less invasive. I think that all of your aims can be achieved with this design (simplification and unification of block handling code for 1d/2d) It maybe that we need to augement for example Categorical and/ro Datetime w/tz with methods / overrides in order to lose the actual Categorical / DatetimeTZ blocks; this IMHO should show us some gaps (possibly in the public part of the EA interface, but more likely in the 'private' part, meaning _take2d and friends). |
One issue with the private methods like In terms of progressing, I'm having trouble reconciling the changes here with #26954. Personally, I'd like to see something like #26954 (with a class decorator being my desired approach) go in first, and then see what remains to be done here. But again, I'm having trouble seeing how these two should come together. |
Would it be helpful to do this on a branch so see how things play out? certainly amenable to trying approaches to see what works |
Right now I'm working on extending the proof of concept in #26954 to have Categorical use it to become 2D, will see what that breaks. I'll also update it to use the suggested non-
@jreback Tom is right, having
I'll adapt #26954 into that. |
Just FYI, I was playing with a class decorator the last 20 minutes and it's not 100% straightforward. I forgot when class decorators are run, so they only apply to the base class and not to subclasses. I think we'd need a metaclass. |
and why can't we add a base class attribute (to EA base) so that we have (also renaming) |
Discussed on the mailing list last week, cc @TomAugspurger @jreback @jorisvandenbossche
The central claim: if we allow EAs to take on shapes (N, 1) and (1, N), then we can get some major simplifications in core.internals.
For the proof of concept I've made the Datetimelike arrays accept these shapes and updated
DatetimeTZBlock
to see what simplifications are available. The_wrap_data
implementation would also work forCategorical
andPandasArray
.Three categories of simplifications are not in here yet and you'll have to use your imaginations
ndim
arg to Block constructorsshift
,__iter__
, ortake
aware of 2DnessAdded bonus: this fleshes out bugs (see #26864)