Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Intersection of multiindex returns duplicates #36915

Closed
egorvavilov opened this issue Oct 6, 2020 · 2 comments · Fixed by #36927
Closed

BUG: Intersection of multiindex returns duplicates #36915

egorvavilov opened this issue Oct 6, 2020 · 2 comments · Fixed by #36927
Labels
MultiIndex Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@egorvavilov
Copy link

import pandas as pd

arraysA = [['val1', 'val1', 'val1', 'val1'], ['val2', 'val2', 'val2', 'val2']]
arraysB = [['val1'], ['val2']]

# MultiIndex([('val1', 'val2'),
#             ('val1', 'val2'),
#             ('val1', 'val2'),
#             ('val1', 'val2')],
#            names=['idx1', 'idx2'])
indexA = pd.MultiIndex.from_arrays(arraysA, names=('idx1', 'idx2'))
# MultiIndex([('val1', 'val2')],
#            names=['idx1', 'idx2'])
indexB = pd.MultiIndex.from_arrays(arraysB, names=('idx1', 'idx2'))

res = indexA.intersection(indexB)

Problem description

Intersection of multiindexes must produce result without duplicates. According to definition the intersection of two sets A and B, denoted by A ∩ B, is the set containing all elements of A that also belong to B (or equivalently, all elements of B that also belong to A.

Version 1.0.3 gave correct output, versions 1.1.2 , 1,1,3 - not.

Output: Pandas 1.1.3

MultiIndex([('val1', 'val2'),
('val1', 'val2'),
('val1', 'val2'),
('val1', 'val2')],
names=['idx1', 'idx2'])

Output: Pandas 1.0.3

MultiIndex([('val1', 'val2')],
names=['idx1', 'idx2'])

Expected Output

I assume correct output should look like pandas 1.0.3 version:
MultiIndex([('val1', 'val2')], names=['idx1', 'idx2'])

Output of pd.show_versions()

INSTALLED VERSIONS

commit : db08276
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.17763
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.1.3
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 46.0.0
Cython : None
pytest : 6.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.0
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.14.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : 1.0.6
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : None
None

@egorvavilov egorvavilov added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 6, 2020
@phofl
Copy link
Member

phofl commented Oct 6, 2020

Hi, thanks for your report. Is related to #31326. Interesting: Behavior of MultiIndex is broken now too.

@phofl phofl added MultiIndex Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Oct 6, 2020
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Nov 27, 2020
@simonjayhawkins
Copy link
Member

Version 1.0.3 gave correct output, versions 1.1.2 , 1,1,3 - not.

first bad commit: [c2f3ce3] BUG: MultiIndex intersection with sort=False does not preserve order (#31312) cc @jeffzi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants