Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: implement NDFrame._from_mgr #52132

Merged
merged 19 commits into from
Jun 25, 2023
Merged

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

cc @phofl @jorisvandenbossche we've discussed this recently-ish but I've lost track of where.

There is a pretty nice performance improvement in ops with small sizes:

ser = pd.Series(range(3))

%timeit ser.copy()
17.5 µs ± 580 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)  # <- main
11 µs ± 281 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)  # <- PR

Making axes required despite the fact that we don't use them ATM bc that opens up the option of refactoring axes out of the Managers xref #48126.

@mroeschke mroeschke added the Internals Related to non-user accessible pandas implementation label Mar 23, 2023
@jorisvandenbossche
Copy link
Member

Compatibility wise, this can break existing subclass implementations since _from_mgr is a new method that they don't yet override. So if we want to do this, should we first introduce the method for some time but while setting it equal to _constructor, and then later change the implementation internally?

@jbrockmendel
Copy link
Member Author

should we first introduce the method for some time but while setting it equal to _constructor, and then later change the implementation internally?

Sounds good!

@jbrockmendel
Copy link
Member Author

Hmm this is straightforward to do for _sliced_from_mgr and _expanddim_from_mgr, but making _from_mgr itself breaks as long as _from_mgr is a classmethod instead of a method. abstraction-wise it makes more sense as a classmethod, but i can try making it a method and see if that breaks anything else

@jbrockmendel
Copy link
Member Author

How long would you want to keep that for before making a "real" _from_mgr method? im hoping the answer isn't "until 3.0", since we need a real version to exist before we can deprecate passing Manager to the constructors

@jorisvandenbossche
Copy link
Member

How long would you want to keep that for before making a "real" _from_mgr method? im hoping the answer isn't "until 3.0"

Maybe yes? The problem here is that this is not easy to communicate via a deprecation (how do subclasses get notified that they need to change this, apart from expecting them to read release notes? Is there a way we can somehow signal this?)

since we need a real version to exist before we can deprecate passing Manager to the constructors

Would it be possible to somehow do this in parallel? If we have a way to signal to the constructor whether to raise the warning or not (eg, when called from via _constructor, don't raise a warning, when called via DataFrame(..) directly, do raise the warning)
One possible way would be to suppress that warning in _constructor, but that probably has a too high overhead / impact on performance?

@jbrockmendel
Copy link
Member Author

how do subclasses get notified that they need to change this, apart from expecting them to read release notes? Is there a way we can somehow signal this?

Good question. My working assumption is that informing geopandas gets us 90+% of the way there. That last few % may need to be the "release notes" approach.

I guess we could implement a __init_subclass__ method and do some checking/warning there.

Would it be possible to somehow do this in parallel? If we have a way to signal to the constructor [...]

The natural way to signal the constructor would be a new keyword. I guess if we make it super-duper explicit that the keyword is not there to stay and users/libraries should never ever use it...

Any thoughts on the classmethod bit? Making _from_mgr a regular method mostly works, but makes it so we can't call it from other classmethods, in particular from_records.

@jorisvandenbossche
Copy link
Member

Would something like this work, if we define our own DataFrame._constructor as:

@property
def _constructor(self):
    def constructor(*args, **kwargs):
        if first arg is Manager:
            return self._from_mgr(*args, **kwargs) 
        else:
            return DataFrame(*args, **kwargs)
    return constructor

That ensures that internally we don't call DataFrame(..) with a manager, and thus can deprecate that. While subclasses that override _constructor with there subclassed DataFrame will still typically call super init, and thus actually see the warning, which can signal them they need to do something (change _constructor and override _from_mgr)

My working assumption is that informing geopandas gets us 90+% of the way there. That last few % may need to be the "release notes" approach.

Are you sure that subclasses that return a class from _constructor are not affected / wouldn't need to change anything to automatically get the correct behaviour?
I assume the future implementation of _from_mgr will "work" for such subclasses in the sense that it will return the correct class. But it stil doesn't go through their own custom init, so it might not be fully setting up the subclass instance correctly?

Any thoughts on the classmethod bit? Making _from_mgr a regular method mostly works, but makes it so we can't call it from other classmethods, in particular from_records.

I didn't fully follow your explanation above. What is the reason that it breaks things if it is a class method?
Now, currently _constructor is a property, which means that the logic inside that can use self. So changing it to a class method would limit that.

@jbrockmendel
Copy link
Member Author

Would something like this work, if we define our own DataFrame._constructor as:

That looks plausible. Will need to check how it affects performance.

I didn't fully follow your explanation above. What is the reason that it breaks things if it is a class method?
Now, currently _constructor is a property, which means that the logic inside that can use self. So changing it to a class method would limit that.

The most important/relevant place is in DataFrame.from_records (a classmethod) which currently ends with return cls(mgr). Unless _from_mgr is a classmethod, this cannot be changed to cls._from_mgr(mgr)

But it stil doesn't go through their own custom init, so it might not be fully setting up the subclass instance correctly?

Good point.

@jorisvandenbossche
Copy link
Member

The most important/relevant place is in DataFrame.from_records (a classmethod) which currently ends with return cls(mgr). Unless _from_mgr is a classmethod, this cannot be changed to cls._from_mgr(mgr)

Given that DataFrame.from_records already has this "problem" for subclasses (going through cls(..) and not _constructor, and so subclasses that have to customize it would have to override those class methods, which is actually exactly what geopandas does for from_dict) , I think it is fine that those class methods continue to work that way. So those could use the manual obj = cls.__new__(cls) and then call the non-classmethod mgr init?

@jbrockmendel
Copy link
Member Author

Given that DataFrame.from_records already has this "problem" for subclasses [...]

Fair enough.

Looking at the suggested _constructor implementation I want to double-check. This PR currently changes a bunch of uses of self._constructor(...) to self._from_mgr(...). Is your suggestion to revert those changes to continue using self._constructor for the interim?

@jorisvandenbossche
Copy link
Member

Is your suggestion to revert those changes to continue using self._constructor for the interim?

Yes, otherwise that would be a breaking change for subclasses that rely on _constructor being called in those cases and that don't yet have a custom _from_mgr

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche thanks for the reminder to take another look at #52132 (comment) (BTW that reminder was in the form of a comment on the dev meeting notes. I hit a check-mark there thinking that was akin to a thumbs-up but it made the note disappear. Just in case there was any ambiguity: the intent was to convey the thumbs-up)

[Defining _constructor to dispatch to _from_mgr when a Manager is passed] ensures that internally we don't call DataFrame(..) with a manager, and thus can deprecate that [...]

This is clever, and I think would work. I'm wary bc 1) part of the point of this is performance, which this would hurt, even if just a little, and 2) that conflicts with deprecating non-class _constructor.

Are you sure that subclasses that return a class from _constructor are not affected / wouldn't need to change anything to automatically get the correct behaviour? [...] But it stil doesn't go through their own custom init, so it might not be fully setting up the subclass instance correctly?

This is definitely possible. Such subclasses would need to override _from_mgr.

I didn't fully follow your explanation [of trouble regarding making _from_mgr a classmethod] above. What is the reason that it breaks things if it is a class method?

I think that topic got successfully cleared up, pls let me know if im wrong on this point?

@jreback
Copy link
Contributor

jreback commented Apr 13, 2023

+1 let's just do this already

tbh we are making mountains of molehills for subclasses - yea we have some support but so what

let's just do this already

we need velocity not endless discussions about every point

@jorisvandenbossche
Copy link
Member

@jbrockmendel thanks for giving this another look! The google doc comment resolving was correctly interpreted

This is clever, and I think would work. I'm wary bc 1) part of the point of this is performance, which this would hurt, even if just a little, and 2) that conflicts with deprecating non-class _constructor.

Regarding performance: it's true it requires an extra check of the input type, but that's something that is currently needed in __init__ as well, so at least it wouldn't regress (it just blocks improving it for some time).
Regarding the second point: I assume you can understand that I don't mind that conflict ;)

Are you sure that subclasses that return a class from _constructor are not affected ...

This is definitely possible. Such subclasses would need to override _from_mgr.

Yes, I assume so as well. I think the nice thing about my proposal above is that this gives subclasses the time to notice that they need to do this, without directly breaking them.
Assuming that a subclass overrides _constructor, they don't get our version that dispatches to _from_mgr (and so don't directly have the issue that our _from_mgr wouldn't work correctly for subclasses), but they still use their own version which will typically at some point call the parent init, i.e. calling DataFrame(*args, **kwargs). If we deprecate passing a manager object in this init, they will see those warnings, which can point them to the need to implement _from_mgr (the deprecation message can include some details specific towards subclass implementors).

One annoyance with this approach is that we would still be using _constructor internally (to allow for the adaptation period), but so that means that the subclasses also need to do this if first arg is Manager: .. check in their own _constructor, or still have some custom handling for Managers in their init (to avoid passing it to DataFrame init).
(now, if they support multiple versions of pandas, they will need to keep some support for accepting managers in their init anyway, so that's maybe not too bad).

[on being a normal method vs classmethod] I think that topic got successfully cleared up, pls let me know if im wrong on this point?

Just to be sure: the current idea is to make those from_mgr methods normal methods then, and not class methods? (as in the current diff)

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 14, 2023
@jorisvandenbossche
Copy link
Member

Another option, instead of short-term having a customized _constructor to decide whether to call the main constructor or the from_mgr one (#52132 (comment)), would be to do that in an already introduced _from_mgr. Something like:

    @classmethod
    def _from_mgr(cls, mgr: Manager, axes: list[Index]) -> Self:
        obj = cls.__new__(cls)
        NDFrame.__init__(obj, mgr)
        return obj

    def _constructor_from_mgr(self, mgr):
        if self._constructor is DataFrame._constructor:
            # we are pandas.DataFrame (or a subclass that doesn't override _constructor)
            return self._from_mgr(mgr)
        else:
            return self._constructor(mgr)

This would also allow already starting to actually deprecate DataFrame(mgr), and give subclasses a way to avoid the deprecation warning (either by overriding _constructor_from_mgr, or by avoiding to pass manager object to super().__init__ in their constructor).

The two methods could be merged into one (just _from_mgr), but having one version that is a class method might be nice, since that also works for subclasses in case they don't need to customize the constructor in the future.
Not sure how useful that would be, though.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 15, 2023

The question is also, for all possible options, what would a typical future implementation of a subclass' _from_mgr look like?

Maybe in many cases, you might not actually want to customize something (if you rely on _metadata/__finalize__ to pass through information), in which case you might just want to inherit the pandas one (assuming this sets the class correctly).
In that case, you could override _from_mgr to also do what the upstream pandas version does (using the naming of methods from my last post):

class MySubclassedDataFrame(pd.DataFrame):
    ...

    def _constructor_from_mgr(self, mgr):
        # this will call the class method defined in the parent pandas.DataFrame
        return self._from_mgr(mgr)

Does that look correct? (in the future, overriding it that way would be redundant I assume, but it would enable the subclass to make the transition / support multiple pandas versions without getting warnings)

Or you could add a check in your MySubclassedDataFrame.__init__ for a manager as first argument, and in that case call this _from_mgr.

@jbrockmendel
Copy link
Member Author

I'm leaning towards a path similar to what @jorisvandenbossche suggested here.

@jorisvandenbossche
Copy link
Member

Thanks for the update!

Making axes required despite the fact that we don't use them ATM bc that opens up the option of refactoring axes out of the Managers

Coming back to this: would it also be sufficient to just have **kwargs in _constructor_from_mgr and pass them through to _from_mgr (and documenting that subclasses should also do this when overriding that method).
That would avoid already having to add axes=mgr.axes everywhere (slightly complicating this call everywhere while it is not yet needed), while it should give also the future proofing to add this keyword later on?

(this assumes that _constructor_from_mgr is only used internally, and not by subclasses in methods they override, but I think that should be a correct assumption)

# we are pandas.DataFrame (or a subclass that doesn't override _constructor)
return self._from_mgr(mgr, axes=axes)
else:
assert axes is mgr.axes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assert needed here? (it's also not done in _from_mgr)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly. ATM _from_mgr is documented as requiring them to match, so this assertion seemed like an easy way of making it required along this path too. could remove the assertion and document the requirement in the docstring

# with self._constructor_sliced._from_mgr(...)
# once downstream packages (geopandas) have had a chance to implement
# their own overrides.
return self._constructor_sliced(mgr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be something like Series._from_mgr(mgr) or self._constructor_sliced._from_mgr(mgr) ?
Because now this is calling the same as the fall-back in _constructor_sliced_from_mgr, and thus the if block there is not doing anything different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. i think this is leftover from a previous round of editing. will update.

in the event that axes are refactored out of the Manager objects.
"""
obj = cls.__new__(cls)
NDFrame.__init__(obj, mgr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to discuss if this should make a shallow copy of the input mgr for CoW (cc @phofl). If we keep this consistent with the current behaviour of DataFrame(mgr), we should add this:

if using_copy_on_write():
    mgr = mgr.copy(deep=False)

We had some discussion on the PR that introduced this to what extent this is actually needed: #51239 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this should only be used internally I think we should be safe without the extra copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this should only be used internally I think we should be safe without the extra copy.

We could also internally have cases that require it. Essentially, whenever we would do something that boils down to return self._constructor_from_mgr(self.mgr) (i.e. a code path that simplifies to that) in some method, this is required.

That should probably be considered a bug in our implementation, but so in #51239 (comment) we went for the safe route for now.

If we want to change that, maybe better to do that in a separate PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to change that, maybe better to do that in a separate PR?

Yah I'd like to punt on this for the time being.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are in a state now that we should be able to rely on doing this on the manager level and not here. I think I am ok with not doing a shallow copy here

@jbrockmendel
Copy link
Member Author

Coming back to this: would it also be sufficient to just have **kwargs in _constructor_from_mgr and pass them through to _from_mgr (and documenting that subclasses should also do this when overriding that method).

I think that would work. In general kwargs is a pattern I try to avoid, but I don't care enough to make it a sticking point if you feel strongly about it.

@jorisvandenbossche
Copy link
Member

I don't feel super strong about it, but I mostly don't like seeing axes=mgr.axes being passed everywhere while that is essentially never used. It can easily give the impression to a contributor that it is used, leading to trying to pass something else intentionally / being confused about this not doing anything.

@jbrockmendel
Copy link
Member Author

I don't feel super strong about it, but I mostly don't like seeing axes=mgr.axes being passed everywhere while that is essentially never used. It can easily give the impression to a contributor that it is used, leading to trying to pass something else intentionally / being confused about this not doing anything.

That's reasonable. So both the axes=mgr.axes and the **kwargs options have the bad silently-unused characteristic. How about I'll remove the keyword from this PR and then we can add it later when it is actually needed. As long as there isn't a released version where there is a risk of downstream authors using it, that should be safe.

@jbrockmendel
Copy link
Member Author

How about I'll remove the keyword from this PR and then we can add it later when it is actually needed

This turns out to be something of a hassle. Any chance we can all be OK with the current implementation?

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

Personally I also prefer not using kwargs, I'd rather have one specific keyword, even if it's not doing anything.

I'd be ok with merging this

@phofl phofl added this to the 2.1 milestone Jun 25, 2023
@phofl phofl removed the Stale label Jun 25, 2023
@phofl phofl merged commit 21ff2fb into pandas-dev:main Jun 25, 2023
@phofl
Copy link
Member

phofl commented Jun 25, 2023

thx @jbrockmendel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants