-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Index.equals is not commutative for string and category dtypes #60194
Comments
take |
Thanks for the report. It seems in more generality we need to adjust not checking the dtype for all EAs: int64_idx = pd.Index([1, 2, 3], dtype='Int64')
uint64_idx = pd.Index([1, 2, 3], dtype='UInt64')
print(int64_idx.equals(uint64_idx))
# False Replacing However, when doing so, we should be careful of any internal uses that could result in a change of behavior. Since this is a long-standing behavior, it's not clear to me if this should be deprecated. cc @pandas-dev/pandas-core |
Definitely tricky - do we have any documented rules for type promotion in this method today? |
A general issue about this from a while ago about how strict the comparison should be: As mentioned in that issue, there also seems to be a difference between Series and Index ... import pandas as pd
index1 = pd.Index(["apple","mango","orange","pear"],dtype="string")
index2 = pd.Index(["apple","mango","orange","pear"],dtype="category")
# with Index, as reported in the top post
>>> index1.equals(index2)
False
>>> index2.equals(index1)
True
# but with Series we are strict about the dtype and so always return False
>>> pd.Series(index1).equals(pd.Series(index2))
False
>>> pd.Series(index2).equals(pd.Series(index1))
False |
IMHO, the behavior of >>> import pandas as pd
>>> si = pd.Series([1,2,3], dtype="int")
>>> sf = pd.Series([1,2,3], dtype="float")
>>> si.equals(sf)
False
>>> sf.equals(si)
False
>>> ii = pd.Index(si)
>>> ii
Index([1, 2, 3], dtype='int64')
>>> iif = pd.Index(sf)
>>> iif
Index([1.0, 2.0, 3.0], dtype='float64')
>>> ii.equals(iif)
True
>>> iif.equals(ii)
np.True_ So the actual type of the result is different in this case, aside from the fact that using |
FWIW the main place we use this internally is to fastpath indexing lookups. That faspath ceases to be fast if casting is involved. |
Thanks @jorisvandenbossche for pointing out that other issue. I think we should consolidate discussion there. Closing. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
pandas.Index.equals method is giving different results for string and category dtypes based on order
assert index2.equals(index1) gives True whereas
assert index1.equals(index2) gives False which makes these operations non-commutative
The first statement returns True because index 2 is pandas.core.indexes.category.CategoricalIndex which calls the .equals method in the sub class that overrides the Base Index class and implements the correct logic however
The second statement returns False because it enters this block
pandas/pandas/core/indexes/base.py
Lines 5637 to 5643 in bdc79c1
which converts the the string and categorical indexes into an ExtensionArray subclass and calls .equals again on those inputs but now it calls .equals method from the ExtensionArray class which does not ignore data types.
I can make a pull request and supplement with a few tests to resolve this issue.
Expected Behavior
assert index1.equals(index2)==index2.equals(index1) should pass
assert index1.equals(index2) should pass
Installed Versions
commit : 0691c5c
python : 3.10.8
python-bits : 64
OS : Linux
OS-release : 5.15.153.1-microsoft-standard-WSL2
Version : #1 SMP Fri Mar 29 23:14:13 UTC 2024
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.3
numpy : 1.26.4
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.2
Cython : 3.0.11
sphinx : 8.0.2
IPython : 8.27.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : 1.4.0
dataframe-api-compat : None
fastparquet : 2024.5.0
fsspec : 2024.9.0
html5lib : 1.1
hypothesis : 6.112.1
gcsfs : 2024.9.0post1
jinja2 : 3.1.4
lxml.etree : 5.3.0
matplotlib : 3.9.2
numba : 0.60.0
numexpr : 2.10.1
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : 2.9.9
pymysql : 1.4.6
pyarrow : 17.0.0
pyreadstat : 1.2.7
pytest : 8.3.3
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.9.0
scipy : 1.14.1
sqlalchemy : 2.0.35
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.9.0
xlrd : 2.0.1
xlsxwriter : 3.2.0
zstandard : 0.23.0
tzdata : 2024.1
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: