Skip to content

CLN/DOC: _selected_obj vs _obj_with_exclusions #46944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jbrockmendel opened this issue May 4, 2022 · 10 comments
Open

CLN/DOC: _selected_obj vs _obj_with_exclusions #46944

jbrockmendel opened this issue May 4, 2022 · 10 comments

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented May 4, 2022

In the groupby code, when do/should we use self._obj_with_exclusions vs self._selected_obj?

@jbrockmendel jbrockmendel added Docs Needs Triage Issue that has not been reviewed by a pandas team member Groupby labels May 4, 2022
@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jul 28, 2022
@mroeschke
Copy link
Member

cc @rhshadrach @jreback

@rhshadrach
Copy link
Member

Current behavior

self._selected_obj:

  • If self.obj is a Series, self._selected_obj is always equal to self.obj.
  • Otherwise, if self._selection is not None, self._selected_obj is self._obj subset to self._selection.
  • Otherwise, if self._group_selection is not None, self._selected_obj is self.obj subset to self._group_selection. self._group_selection is only set within the self._group_selection_context (and even then, only when certain conditions hold).
  • Otherwise, self._selected_obj is self.obj

self._obj_with_exclusions:

  • If self.obj is a Series, self._obj_with_exclusions is always equal to self.obj.
  • Otherwise, if self._selection is not None, self._obj_with_exclusions is self.obj subset to self._selection.
  • Otherwise if self.exclusions is non-empty, self._obj_with_exclusions is self.obj with exclusions columns dropped
  • Otherwise self._obj_with_exclusions is self.obj

I don't have a good reason why both of these objects should exist, so I don't think I can say when to use one over the other. I have similar thoughts about _group_selection_context. I'd like to see one of them (and the context) removed. Looking at usage in the groupby code, I see 58 instances of _selected_obj and 27 instances of _obj_with_exclusions. As such, I would try to remove uses of _obj_with exclusions and add the self._group_selection_context wherever _selected_obj is used without one.

@rhshadrach
Copy link
Member

To the _obj_with_exclusions method, I added

from pandas.core.groupby.groupby import BaseGroupBy
if isinstance(self, BaseGroupBy):
    from pandas._testing import assert_equal
    with self._group_selection_context():
        assert_equal(result, self._selected_obj)

where result is the result of this method and only 106 of the 8414 groupby tests failed. There were 32 distinct tests (ignoring parameterizations).

@jbrockmendel
Copy link
Member Author

As such, I would try to remove uses of _obj_with exclusions and add the self._group_selection_context wherever _selected_obj is used without one.

My intuition is to go the other direction. The statefulness of _group_selection_context makes things way more difficult to reason about. I'd want to move away from that. Without that context, the two are almost identical.

AFAICT we can eliminate the with self._group_selection_context in the following places without breaking any tests: _agg_general, ngroup, cumcount.

For the other usages of _group_selection_context, using obj_with_exclusions instead of selected_obj lets us remove the context without breaking any tests.

@rhshadrach
Copy link
Member

Once #50846 is merged, any place in the tests where a groupby object is created, obj_with_exclusions is the same as selected_obj with the context - this is shown in #50878.

Until some behavior is deprecated (especially in apply), we won't be able to remove selected_obj. I agree we should be also removing the context - my thought is to remove obj_with_exclusions and then invert the _group_selection_context (with perhaps renaming it). That is, when it is not activated we exclude the groupings and when it is activated we include them. This way we:

  • Get to remove one of selected_obj or obj_with_exclusions
  • Remove the context manager where possible, and then start deprecating where it's used

@jbrockmendel
Copy link
Member Author

It looks like with group_selection_context enabled, _selected_obj (with the fix in #50846) is equivalent to _obj_with_exclusions except for cases where self._selection is a scalar. We have many tests where _selection is not a scalar, but only one that accesses _selected_obj: test_resample_groupby_agg_listlike. In that test, returning _obj_with_exclusions results in a RecursionError. Patching Apply.agg_list_like to check for scalar _selection fixes that test.

@rhshadrach
Copy link
Member

rhshadrach commented Jan 31, 2023

Rethinking this, I'd prefer to have both obj_with_exclusions and selected_obj with no context manager than having selected_obj with a context manager - it is much easier to tell what's going on. I agree with replacing selected_obj when used in the context manager with obj_with_exclusions.

@jbrockmendel
Copy link
Member Author

_selected_obj (with the fix in #50846) is equivalent to _obj_with_exclusions except for cases where self._selection is a scalar

After some more sleuthing, in order for the two to not-match: In addition to self._selection being a scalar, self.obj needs to not be a Series, and self._selection needs to only appear once in self.obj.columns. I have convinced myself that a) this is impossible for SeriesGroupBy/DataFrameGroupBy, b) that impossibility can be better documented, c) Resample's _gotitem should be refactored to make this easier to reason about.

For SeriesGroupBy and DataFrameGroupBy, the only way _selection is non-None is when it is passed to the constructor, which only happens when it is called by _gotitem. DataFrameGroupBy._gotitem should be getting subsets that are produced either by self.obj[selection] (i.e. via __getitem__) or by iloc[:, i] (from apply-like column-by-column operation). Either way, obj_with_exclusions should always be equivalent to selected_obj-with-group_selection_context.

@jbrockmendel jbrockmendel mentioned this issue Mar 17, 2023
5 tasks
@jbrockmendel
Copy link
Member Author

@rhshadrach are we at a point where we can remove one of these?

@rhshadrach
Copy link
Member

rhshadrach commented Oct 4, 2023

Here is an inventory of usages of _selected_obj that still exist:

  • SeriesGroupBy: this is due to CLN: Replace _obj_with_exclusions with _select_obj for Series #49136 when I was trying to remove obj_with_exclusions before being convinced to go the other way. This will be trivial to revert.
  • filters: In a filter, we do not want to exclude the grouping columns, so we want to use _selected_obj in perpetuity.
  • plotting: This needs to be investigated more.
  • iteration: It's not clear to me that it's right to include grouping columns when iterating over a groupby.
  • apply: Deprecated, will be removed in 3.0.
  • window: This needs to be investigated more.
  • transforms: transforms that use _python_apply_general seem to use _selected_obj. I suspect that either this isn't necessary or there are even bugs lurking here due to its use, but I need to investigate more.

For filters, plotting, and iteration (and others), I wonder if we can instead subset self.obj when getitem is used (e.g. groupby('a')[['b', 'c']]) and use self.obj instead of self._selected_obj.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants