Skip to content

DataFrame.equals false negatives with dtype=object #8437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ischwabacher opened this issue Oct 1, 2014 · 9 comments
Closed

DataFrame.equals false negatives with dtype=object #8437

ischwabacher opened this issue Oct 1, 2014 · 9 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@ischwabacher
Copy link
Contributor

I still don't have a good enough grasp on the internals to go diving for this one quickly, but here's the observed behavior:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: temp = [pd.Series([False, np.nan]), pd.Series([False, np.nan]), pd.Series(index=range(2)), pd.Series(index=range(2)), pd.Series(index=range(2)), pd.Series(index=range(2))]

In [4]: temp[2][:-1] = temp[3][:-1] = temp[4][0] = temp[5][0] = False

In [5]: [[x.equals(y) for y in temp] for x in temp]
Out[5]: 
[[True, True, False, False, False, False],
 [True, True, False, False, False, False],
 [False, False, True, False, False, False],
 [False, False, False, True, False, False],
 [False, False, False, False, True, True],
 [False, False, False, False, True, True]]

Here the 4x4 square in the upper left should be True. (temp[4:6] are different because inserting False into a float64 array or vice-versa coerces False into 0.0; this is inconvenient when attempting to use the array for boolean indexing.)

@ischwabacher
Copy link
Contributor Author

I have an example that isn't fixed by replacing x.equals(y) with x.isnull().equals(y.isnull()) and x[~x.isnull()].equals(y[~y.isnull()]) and needs to be taken all the way to x.isnull().equals(y.isnull()) and (x[~x.isnull()] == y[~y.isnull()]).all(), but I haven't gotten it down to a test-case-sized example yet.

@jreback
Copy link
Contributor

jreback commented Oct 1, 2014

see #5283

I think that here: https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L1058

should use pandas.core.common.array_equivalent rather np.array_equal which is wrong (note that is called here ONLY for non-numeric blocks). array_equivalent handles the nan location comparisons correctly.

@jreback
Copy link
Contributor

jreback commented Oct 1, 2014

cc @unutbu

@jreback
Copy link
Contributor

jreback commented Oct 1, 2014

actually their is a comment here that I don't recall!

I just want to record this here, since my thoughts on this matter have been so muddled:

np.array_equal treats NaNs in object arrays as equal:

>>> np.array_equal(np.array([np.nan], dtype='O'), np.array([np.nan], dtype='O'))
True
so object arrays do not need to be special-cased, ObjectBlock.equals does not need to be defined, and ObjectBlock.equals can defer to Block.equals. Only FloatBlocks and ComplexBlocks need to define their own equals method, since, for example,

>>> np.array_equal(np.array([np.nan], dtype='<f4'), np.array([np.nan], dtype='<f4'))
False
>>> np.array_equal(np.array([np.nan], dtype='complex128'), np.array([np.nan], dtype='complex128'))
False
Thankfully, np.array_equal also treats NaTs as equal:

>>> np.array_equal(np.array([np.datetime64('nat')]), np.array([np.datetime64('nat')]))
True
NaNs can not exist in bool arrays:

>>> np.array([np.nan], dtype='bool')
array([ True], dtype=bool)
or can NaNs exist in timedeltas:

>>> np.timedelta64(np.nan,'D')
ValueError: Could not convert object to NumPy timedelta
so np.array_equal should work as usual for these kinds of values.

@ischwabacher
Copy link
Contributor Author

Figured out the problem with the other (unposted) cases-- one of them was chained assignment (arrrg!) and the other was comparing an object series with a bool series. But the test in the OP still fails.

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Oct 2, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 2, 2014
@jreback
Copy link
Contributor

jreback commented Oct 2, 2014

@ischwabacher see my comment above, I think this is pretty trivial to fix, just change the np.array_equal to array_equaivalent in core/internals/Block/equals (and obviously add a test for this)

@unutbu
Copy link
Contributor

unutbu commented Oct 2, 2014

Mea culpa. I've made a PR: #8443

@jreback jreback modified the milestones: 0.15.0, 0.15.1 Oct 2, 2014
@ischwabacher
Copy link
Contributor Author

Awesome, thanks!

@jreback
Copy link
Contributor

jreback commented Oct 2, 2014

closd by #8443

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

3 participants