-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: inconsistent behaviors for Index.union() and Index.intersection() with duplicates #31326
Comments
interesting, looks like a bug |
I think an argument could be made for following the definitions for union and intersection of multisets: https://en.wikipedia.org/wiki/Multiset#Basic_properties_and_operations, in which case the "expected" output would be >>> a
Int64Index([1, 2, 2, 3], dtype='int64')
>>> b
Int64Index([3, 3, 4], dtype='int64')
>>> a.union(b)
Int64Index([1, 2, 2, 3, 3, 4], dtype='int64')
>>> a.intersection(b)
Int64Index([3], dtype='int64') |
@dsaxton I think I agree with the example you show (and the general rules it shows) However, when having >>> pd.Index([1, 2, 3]).union(pd.Index([1, 2, 2, 4]), sort=False)
## hypothetical results
Int64Index([1, 2, 2, 3, 4], dtype='int64')
# or
Int64Index([1, 2, 3, 2, 4], dtype='int64') (it seems the second is the "correct" output: first taking everything from the left index, and then appending additional elements from the right index. But, the first might actually be more useful in practice) |
@jorisvandenbossche Yeah I think the first looks more useful generally, although that output can also be obtained by setting sort=True. It feels like with sort=False there's no guarantee on the order so whatever is most convenient to construct is what you get (in which case there's really no correct answer). |
For this case, the first potential output is indeed the same as with sort=True, but you could easily construct another example where the initial indices are not sorted for which that would no longer be the case .. |
While working on #31312, I noticed that the behavior of
Index.union()
andIndex.intersection()
is inconsistent when there are duplicates in one of the Index.Created on 2020-01-26 by the reprexpy package
Problem description
The behavior of
intersection()
andunion()
when duplicates are present is not consistent betweenIndex
andMultiIndex
. Those operations return duplicates withIndex
but not withMultiIndex
. The documentation doesn't clearly state what to expect.When duplicates are present, the size of the result of
Index.union()
depends on sort is None or False.If duplicates are present on only one side,
Index.intersection()
always return duplicates.Here are more succinct examples for 2. and 3.
Created on 2020-01-26 by the reprexpy package
Expected Output
For consistency and clarity, I think it would be better to enforce unicity in the index returned by logical operations.
Index.union()
andIndex.interesection()
are the only ones allowing duplicates.Output of
pd.show_versions()
INSTALLED VERSIONS
commit : ca3bfcc
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.0rc0+212.gca3bfcc54
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.1
setuptools : 45.1.0.post20200119
Cython : 0.29.14
pytest : 5.3.4
hypothesis : 5.3.0
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.3.4
pyxlsb : None
s3fs : 0.4.0
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : 0.8.6
xarray : 0.14.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.47.0
The text was updated successfully, but these errors were encountered: