Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Index.equals is not commutative for string and category dtypes #60194

Closed
3 tasks done
DhruvBShetty opened this issue Nov 5, 2024 · 7 comments
Closed
3 tasks done
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action

Comments

@DhruvBShetty
Copy link
Contributor

DhruvBShetty commented Nov 5, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

index1=pd.Index(["apple","mango","orange","pear"],dtype="string")
index2=pd.Index(["apple","mango","orange","pear"],dtype="category")

assert index1.equals(index2)==index2.equals(index1)
Traceback (most recent call last):
  File "/home/pandas/Draft/rough1.py", line 63, in <module>
    assert index1.equals(index2)==index2.equals(index1)
AssertionError

Issue Description

pandas.Index.equals method is giving different results for string and category dtypes based on order

assert index2.equals(index1) gives True whereas
assert index1.equals(index2) gives False which makes these operations non-commutative

The first statement returns True because index 2 is pandas.core.indexes.category.CategoricalIndex which calls the .equals method in the sub class that overrides the Base Index class and implements the correct logic however

The second statement returns False because it enters this block

if isinstance(self._values, ExtensionArray):
# Dispatch to the ExtensionArray's .equals method.
if not isinstance(other, type(self)):
return False
earr = cast(ExtensionArray, self._data)
return earr.equals(other._data)

which converts the the string and categorical indexes into an ExtensionArray subclass and calls .equals again on those inputs but now it calls .equals method from the ExtensionArray class which does not ignore data types.

I can make a pull request and supplement with a few tests to resolve this issue.

Expected Behavior

assert index1.equals(index2)==index2.equals(index1) should pass
assert index1.equals(index2) should pass

Installed Versions

commit : 0691c5c
python : 3.10.8
python-bits : 64
OS : Linux
OS-release : 5.15.153.1-microsoft-standard-WSL2
Version : #1 SMP Fri Mar 29 23:14:13 UTC 2024
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 1.26.4
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.2
Cython : 3.0.11
sphinx : 8.0.2
IPython : 8.27.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : 1.4.0
dataframe-api-compat : None
fastparquet : 2024.5.0
fsspec : 2024.9.0
html5lib : 1.1
hypothesis : 6.112.1
gcsfs : 2024.9.0post1
jinja2 : 3.1.4
lxml.etree : 5.3.0
matplotlib : 3.9.2
numba : 0.60.0
numexpr : 2.10.1
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : 2.9.9
pymysql : 1.4.6
pyarrow : 17.0.0
pyreadstat : 1.2.7
pytest : 8.3.3
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.9.0
scipy : 1.14.1
sqlalchemy : 2.0.35
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.9.0
xlrd : 2.0.1
xlsxwriter : 3.2.0
zstandard : 0.23.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

@DhruvBShetty DhruvBShetty added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 5, 2024
@DhruvBShetty
Copy link
Contributor Author

take

@rhshadrach
Copy link
Member

Thanks for the report. It seems in more generality we need to adjust not checking the dtype for all EAs:

int64_idx = pd.Index([1, 2, 3], dtype='Int64')
uint64_idx = pd.Index([1, 2, 3], dtype='UInt64')
print(int64_idx.equals(uint64_idx))
# False

Replacing Int64 and UInt64 with int64 and uint64 gives True, as is documented in the docstring of Index.equals.

However, when doing so, we should be careful of any internal uses that could result in a change of behavior. Since this is a long-standing behavior, it's not clear to me if this should be deprecated.

cc @pandas-dev/pandas-core

@rhshadrach rhshadrach added ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 5, 2024
@WillAyd
Copy link
Member

WillAyd commented Nov 5, 2024

Definitely tricky - do we have any documented rules for type promotion in this method today?

@jorisvandenbossche
Copy link
Member

A general issue about this from a while ago about how strict the comparison should be:

As mentioned in that issue, there also seems to be a difference between Series and Index ...

import pandas as pd

index1 = pd.Index(["apple","mango","orange","pear"],dtype="string")
index2 = pd.Index(["apple","mango","orange","pear"],dtype="category")

# with Index, as reported in the top post
>>> index1.equals(index2)
False
>>> index2.equals(index1)
True

# but with Series we are strict about the dtype and so always return False
>>> pd.Series(index1).equals(pd.Series(index2))
False
>>> pd.Series(index2).equals(pd.Series(index1))
False

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 5, 2024

IMHO, the behavior of Index should match Series, i.e., the 2 objects should not be equals if the dtype is different. Here's another one to consider:

>>> import pandas as pd
>>> si = pd.Series([1,2,3], dtype="int")
>>> sf = pd.Series([1,2,3], dtype="float")
>>> si.equals(sf)
False
>>> sf.equals(si)
False
>>> ii = pd.Index(si)
>>> ii
Index([1, 2, 3], dtype='int64')
>>> iif = pd.Index(sf)
>>> iif
Index([1.0, 2.0, 3.0], dtype='float64')
>>> ii.equals(iif)
True
>>> iif.equals(ii)
np.True_

So the actual type of the result is different in this case, aside from the fact that using Index and Series returns different results.

@jbrockmendel
Copy link
Member

FWIW the main place we use this internally is to fastpath indexing lookups. That faspath ceases to be fast if casting is involved.

@rhshadrach
Copy link
Member

Thanks @jorisvandenbossche for pointing out that other issue. I think we should consolidate discussion there. Closing.

@DhruvBShetty DhruvBShetty removed their assignment Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants