-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix duplicates in intersection of multiindexes #36927
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirming that #31312 caused the regression
Looking at CI possibly related so actually not sure this is good as is |
� Conflicts: � doc/source/whatsnew/v1.1.4.rst � pandas/core/indexes/multi.py
You are right, this definitely is related. The problem is in I think this should be |
� Conflicts: � doc/source/whatsnew/v1.1.4.rst
# Conflicts: # pandas/core/indexes/base.py
With the change introduced by f4dc9f9 we have to handle both issues at once to avoid bugs at other places |
pandas/core/ops/__init__.py
Outdated
if len(cols) and not (cols.equals(left.columns) and cols.equals(right.columns)): | ||
if len(cols) and not ( | ||
cols.equals(left.columns.unique()) and cols.equals(right.columns.unique()) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a test that fails without this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the builds are gone. I will run the tests in the evening to find the relevant tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_column_dups_operations(self): |
This one is crashing in line 255 because of memory issues. The condition is never fullfilled, so it blows up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the condition in master is never fulfilled or the condition in the PR?
btw, usually let the person who asked the question hit the "resolve conversation" button
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, did not know that. Will do that in the future.
The input df
has columns [A,A]
. When intersection is unique, then cols=[A]
, while left.columns=[A,A]
and right.columns=[A,A]
. Without the change adding the unique, cols.equals(left.columns) and cols.equals(right.columns)
will always be False. So we always return True
, which results in duplicating the columns for every passthrough, hence the memory explosion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed?
we now guaranteee that intersection returns unique columns, right so this should no longer be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is the problem. Is intersection is unique, cols.equals(left.columns) won't be True. This leads to a recursion in the test mentioned which blows up the memory, because we can not exit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok pls add some comments to this effect then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would prob calculate left_uniques and right_uniques and make them variables as a bit simpler to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, changed it and added a comment
� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/core/indexes/base.py
Had to change a check in merge, which relied on the wrong behavior of |
After taking another look, im thinking we might want to just disallow set ops when there are duplicates |
� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/tests/indexes/test_setops.py
@jbrockmendel You mean always returning |
I mean that set operations only make sense when dealing with something set-like, i.e. unique. So we could just raise |
If we do this, we should probably raise |
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -774,6 +774,7 @@ Other | |||
- Passing an array with 2 or more dimensions to the :class:`Series` constructor now raises the more specific ``ValueError`` rather than a bare ``Exception`` (:issue:`35744`) | |||
- Bug in ``dir`` where ``dir(obj)`` wouldn't show attributes defined on the instance for pandas objects (:issue:`37173`) | |||
- Bug in :meth:`RangeIndex.difference` returning :class:`Int64Index` in some cases where it should return :class:`RangeIndex` (:issue:`38028`) | |||
- Bug in :meth:`Index.intersection` returning duplicates when at least one of the indexes had duplicates (:issue:`31326`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to remove this, if we are backporting there should be nothing added to doc/source/whatsnew/v1.2.0.rst (doesn't exist on 1.1.x branch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, probably messed up the merge. Is gone now
this might be a regression but going to be work to backport. @phofl see if you can get all passing on master. |
Failing tests relied on the wrong behavior, adjusted them |
pandas/core/ops/__init__.py
Outdated
if len(cols) and not (cols.equals(left.columns) and cols.equals(right.columns)): | ||
if len(cols) and not ( | ||
cols.equals(left.columns.unique()) and cols.equals(right.columns.unique()) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed?
we now guaranteee that intersection returns unique columns, right so this should no longer be the case.
@@ -2858,7 +2859,7 @@ def _intersection(self, other, sort=False): | |||
indexer = algos.unique1d(Index(rvals).get_indexer_non_unique(lvals)[0]) | |||
indexer = indexer[indexer != -1] | |||
|
|||
result = other.take(indexer)._values | |||
result = other.take(indexer).unique()._values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add an assert that L2867 is always unique (e.g. we are guaranteed uniqu here L2862 (and I think safe_sort guaranteess this), but let's be explicit (and add a comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should check if this is_unique first. it will be computed but is likely faster than always doing this (and copies yet again).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may have a numpy array here, so we can not do result.is_unique. Is there a better way than using algos.unique on result and comparing shapes then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure you can do (measure perf)
result = Index(other.take(indexer), copy=False)
if not result.is_unique:
.....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, did not get that. Did it now, is in #38154
One open point is what we should do with the unique check, because is_unique may not exist. This is only relevant to performance |
lgtm. small point about perf above (but i don't think it matters to backport, can just do on master). |
thanks @phofl |
@meeseeksdev backport 1.1.x |
This comment has been minimized.
This comment has been minimized.
…es (#38155) Co-authored-by: patrick <61934744+phofl@users.noreply.github.com>
@@ -3601,6 +3601,8 @@ def intersection(self, other, sort=False): | |||
other, result_names = self._convert_can_do_setop(other) | |||
|
|||
if self.equals(other): | |||
if self.has_duplicates: | |||
return self.unique().rename(result_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phofl it looks like we do something slightly different in many of the FooIndex.intersection
methods in the self.equals(other)
case:
# Index (and IntervalIndex pending #38190)
if self.equals(other) and not self.has_duplicates:
return self._get_reconciled_name_object(other)
# datetimelike
if self.equals(other):
return self._get_reconciled_name_object(other)
# MultiIndex
if self.equals(other):
if self.has_duplicates:
return self.unique().rename(result_names)
return self._get_reconciled_name_object(other)
# PeriodIndex
if self.equals(other):
return self._get_reconciled_name_object(other)
# RangeIndex
if self.equals(other):
return self._get_reconciled_name_object(other)
The RangeIndex one is equivalent to the Index/Intervalindex bc it never has duplicates (can add that check to make the code match exactly). Can the others be made to have identical logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there still an open case where I can be of help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're all set for now, will take another look after all the current Index.intersection PRs go through. thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like several of them are still either
if self.equals(other):
return self._get_reconciled_name_object(other)
which i think is wrong if it has duplicates, or
if self.equals(other) and not self.has_duplicates:
return self._get_reconciled_name_object(other)
which isnt handling the has-duplicates case like the others. can these all be identical?
Also if you really get on a roll, DatetimeTimedeltaMixin._intersection
has cases for len(self) == 0
and len(other)==0
that would be nice to standardize+test. RangeIndex._intersection has a similar check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look through them and try to standardize them as much as possible
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Seems like this was not introduced on purpose. Probably introduced in #31312