Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: inconsistent behaviors for Index.union() and Index.intersection() with duplicates #31326

Closed
jeffzi opened this issue Jan 26, 2020 · 5 comments · Fixed by #36299
Closed
Labels
Bug Index Related to the Index class or subclasses
Milestone

Comments

@jeffzi
Copy link
Contributor

jeffzi commented Jan 26, 2020

While working on #31312, I noticed that the behavior of Index.union() and Index.intersection() is inconsistent when there are duplicates in one of the Index.

import pandas as pd
import traceback

a = pd.Index([1, 2, 2, 3])
b = pd.Index([3, 3, 4])

def test_setops(left, right):
    for op in ["intersection", "union"]:
        for sort in [None, False]:
            result = getattr(left, op)(right, sort=sort)
            print(f"sort = {sort}, {op}: {result} -> has duplicates: {result.has_duplicates}")

test_setops(a, b)
#> sort = None, intersection: Int64Index([3, 3], dtype='int64') -> has duplicates: True
#> sort = False, intersection: Int64Index([3, 3], dtype='int64') -> has duplicates: True
#> sort = None, union: Int64Index([1, 2, 2, 3, 3, 4], dtype='int64') -> has duplicates: True
#> sort = False, union: Int64Index([1, 2, 2, 3, 4], dtype='int64') -> has duplicates: True

arrays = [['a', 'b', 'b', 'c'],
          ['1', '2', '2', '1']]
a_mi = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
arrays = [['c', 'c', 'd'],
          ['1', '1', '2']]
b_mi = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])

test_setops(a_mi, b_mi)
#> sort = None, intersection: MultiIndex([('c', '1')],
#>            names=['first', 'second']) -> has duplicates: False
#> sort = False, intersection: MultiIndex([('c', '1')],
#>            names=['first', 'second']) -> has duplicates: False
#> sort = None, union: MultiIndex([('a', '1'),
#>             ('b', '2'),
#>             ('c', '1'),
#>             ('d', '2')],
#>            names=['first', 'second']) -> has duplicates: False
#> sort = False, union: MultiIndex([('a', '1'),
#>             ('b', '2'),
#>             ('c', '1'),
#>             ('d', '2')],
#>            names=['first', 'second']) -> has duplicates: False

Created on 2020-01-26 by the reprexpy package

Problem description

  1. The behavior of intersection() and union() when duplicates are present is not consistent between Index and MultiIndex. Those operations return duplicates with Index but not with MultiIndex. The documentation doesn't clearly state what to expect.

  2. When duplicates are present, the size of the result of Index.union() depends on sort is None or False.

  3. If duplicates are present on only one side, Index.intersection() always return duplicates.

Here are more succinct examples for 2. and 3.

import pandas as pd

a = pd.Index([1, 2, 2, 3])
b = pd.Index([3, 3, 4])

# expected [1, 2, 2, 3, 3, 3, 4]
a.union(b, sort=None) 
#> Int64Index([1, 2, 2, 3, 3, 4], dtype='int64')
a.union(b, sort=False) 
#> Int64Index([1, 2, 2, 3, 4], dtype='int64')

# expected [3]
a.intersection(b, sort=None)
#> Int64Index([3, 3], dtype='int64')
a.intersection(b, sort=False)
#> Int64Index([3, 3], dtype='int64')

# expected [3, 3]
b.intersection(a, sort=None)
#> Int64Index([3, 3], dtype='int64')
b.intersection(a, sort=False)
#> Int64Index([3, 3], dtype='int64')

Created on 2020-01-26 by the reprexpy package

Expected Output

For consistency and clarity, I think it would be better to enforce unicity in the index returned by logical operations. Index.union() and Index.interesection() are the only ones allowing duplicates.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : ca3bfcc
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.0rc0+212.gca3bfcc54
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.1
setuptools : 45.1.0.post20200119
Cython : 0.29.14
pytest : 5.3.4
hypothesis : 5.3.0
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.3.4
pyxlsb : None
s3fs : 0.4.0
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : 0.8.6
xarray : 0.14.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.47.0

@charlesdong1991
Copy link
Member

interesting, looks like a bug

@charlesdong1991 charlesdong1991 added Bug Index Related to the Index class or subclasses labels Jan 26, 2020
@dsaxton
Copy link
Member

dsaxton commented Jan 28, 2020

I think an argument could be made for following the definitions for union and intersection of multisets: https://en.wikipedia.org/wiki/Multiset#Basic_properties_and_operations, in which case the "expected" output would be

>>> a                                                                                                                             
Int64Index([1, 2, 2, 3], dtype='int64')
>>> b                                                                                                                             
Int64Index([3, 3, 4], dtype='int64')
>>> a.union(b)
Int64Index([1, 2, 2, 3, 3, 4], dtype='int64')
>>> a.intersection(b)
Int64Index([3], dtype='int64')

@jorisvandenbossche
Copy link
Member

@dsaxton I think I agree with the example you show (and the general rules it shows)

However, when having sort=False, there are some questions about how the result should be "sorted". In principle it should not sort, but what with the duplicate values that might be present in the right index?

>>> pd.Index([1, 2, 3]).union(pd.Index([1, 2, 2, 4]), sort=False)
## hypothetical results
Int64Index([1, 2, 2, 3, 4], dtype='int64')
# or
Int64Index([1, 2, 3, 2, 4], dtype='int64')

(it seems the second is the "correct" output: first taking everything from the left index, and then appending additional elements from the right index. But, the first might actually be more useful in practice)

@dsaxton
Copy link
Member

dsaxton commented Jul 15, 2020

@jorisvandenbossche Yeah I think the first looks more useful generally, although that output can also be obtained by setting sort=True. It feels like with sort=False there's no guarantee on the order so whatever is most convenient to construct is what you get (in which case there's really no correct answer).

@jorisvandenbossche
Copy link
Member

For this case, the first potential output is indeed the same as with sort=True, but you could easily construct another example where the initial indices are not sorted for which that would no longer be the case ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants