-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: Add equals method to NDFrames. #5283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pls run a perf check on this (test_perf.sh) these comparisons are used everywhere do u need the shape check? |
@jreback - seems like it doesn't work for this example, but we could be missing something left = pd.Float64Index([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, nan], dtype='object')
right = pd.Float64Index([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, nan], dtype='object')
# OR
left = np.array([1.0, 2.0, nan], dtype=object)
right= np.array([1.0, 2.0, nan], dtype=object) (fully enumerated here - https://gist.github.com/unutbu/7070565) |
to be explicit: left = np.array([1.0, 2.0, nan], dtype=object)
...: right= np.array([1.0, 2.0, nan], dtype=object)
...:
left != right
Out[16]: array([False, False, False], dtype=bool)
left != left
Out[17]: array([False, False, False], dtype=bool)
right != right
Out[18]: array([False, False, False], dtype=bool)
nan != nan
Out[19]: True |
Though I guess they compare true with |
you have to astype to float! before you can do the comparison (not sure exactly why) only works if they r all float values (so you need to do it in a try except) |
@jreback: I'm working on installing vbench and figuring out how to run test_perf.sh... |
@jreback: When I run
I get
I see I can limit
which yielded
Clearly I don't know what I'm doing. What is the right |
b should be the commit before 1st of yours and t should be the last commit of yours generally I rebase to master before this |
With array-equivalent rebased to master,
yields vb_suite.log |
so look at these in master and in your PR using %prun...and see if you can figure out what's up... |
null_right = np.isnan(right) | ||
except TypeError: | ||
return np.array_equal(left, right) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just coerce to float (if it fails that your fallback is fine, though that itself takes some time, might be better just to check the index type first) you don't need the isnull/isnan checking at all, just do (left != left) & (right != right)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I tried
def array_equivalent(left, right):
left, right = np.asarray(left), np.asarray(right)
try:
left = left.astype(float)
right = right.astype(float)
except (ValueError, TypeError):
return np.array_equal(left, right)
else:
return (left.shape == right.shape
and ((left == right) | (left != left) & (right != right)).all())
time ./test_perf.sh -b master -t coerce-to-float
yields (using Python2.7, Numpy 1.7)
series_align_irregular_string | 97.3604 | 68.7210 | 1.4167 |
series_align_left_monotonic | 32.0517 | 22.5259 | 1.4229 |
concat_series_axis1 | 430.0770 | 82.5344 | 5.2109 |
reindex_frame_level_align | 23.5590 | 1.2616 | 18.6734 |
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
Also, coercing to float drops the imaginary part of complex arrays:
>>> np.array([nan, 1+1j], dtype='complex').astype(float)
array([ nan, 1.])
So np.isnan
will (I think) handle more dtypes than (x != x)
, and has comparable, maybe even favorable speed, when applied to float arrays:
In [6]: x = np.array([1, 2, nan])
In [7]: %timeit x != x
1000000 loops, best of 3: 1.23 µs per loop
In [5]: %timeit np.isnan(x)
1000000 loops, best of 3: 1.1 µs per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah....so my suggestion made it worse!
I think you need to detect if you need to do this in the first place (maybe by only checking on Index/Float64Index) types (as Int64Index cannot hold nan)....so you avoid the try: except: overhead
With the current commit,
I'm going to try adding a check for Int64Index arrays next... |
also try doing test_perf again....these could be 'random'.....(e.g. if they are not similar on subsequent runs) then its just an artifact of the data....you can also try with a bigger n (numcalls) |
question here - why do you need to cast it to float first? I thought it worked with just ==? I'm sure I'm missing something but just wanted to make sure we had an example that fails using ==. (or maybe it's just float dtype that fails) |
I think object dtype that has floats in it (iow float64index) fails ; not sure why though |
@jreback Regarding #5219, yes, I am striving to make The tests in |
@jtratner: I did try coercing to float (#5283 (diff)), but found there were problems. (See the link for more details.) (Fixed incorrect link.) Currently, |
Again, can we take a quick step back here: what's an example where it if you pass array of floats with dtype object and some are nan, it compares |
So if you're thinking of Float64Index - just do '.view(ndarray)' so you're Once we get it to work for ndarray, then can consider what to do for
|
@jtratner: I don't quite understand. What is the "it" in the phrase "where it doesn't work..."? Currently the test
pass.s |
Finally have a computer - just need to look at something for myself. I |
I just used this: def array_equiv(n1, n2):
return n1.shape == n2.shape and n1.dtype == n2.dtype and ((n1 == n2) | ((n1 != n1) & (n2 != n2))).all() And it worked for all of these - am I missing why this is complicated? Is there a numpy version issue?
Then callers should be responsible for checking anything at pandas-level. |
How about:
However, my |
okay, thanks - just wanted to make sure we had something that explicitly didn't work for the simpler version. |
actually...why don't we do both... use the simpler version...if its True (then we are done as we don't have false positives), however a False can fall back to the slower version |
>>> array_equivalent([1, nan, 2], [1, 2, nan]) | ||
False | ||
""" | ||
if isinstance(left, pd.Int64Index): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can change this to something like if not issubclass(left.dtype.type, (np.object_, np.floating)): return np.array_equal(left, right)
, right? Given that only object and floating can hold nan?
can you put in some tests for datetime/timedeltas? (incluing with NaT) ? and bools too you might need to change the comparisons to something like this:
(it might work w/o this ...not sure exactly what np.array_equal does) |
While writing a test for timedeltas and bools, I've come upon an interesting problem: Suppose
Then the underlying blocks look like this:
Is there a way to massage the BlockManager into a canonical form? (or put more generally, how would you go about comparing these two BlockManagers for equality?) |
before comparing, blocks are created in various operations (e.g. insertion, changing a block dtype, etc)...the consolidate merges them (if it can) |
another slight complication, block order is not-guaranteed, int that you could have so you should prob sort in some kind of order before you iterate (actually many ways to handle this). |
In
causes the blocks to be sorted by However, it is also possible that the blockmanagers might have multiple blocks of the Do you know if the call to |
Yes the blocks CAN be in different orders; but since their are only a small number of block types, you could either order by the block types in a specific way (prob easiest), or iterate over one and find in the other separately you might be able to guarantee that consolidate_inplace puts them in the same order (e.g. it would insert into a specific order rather than always appending at the end); I think this would be pretty straightfoward to do |
I think I need some help. I've been trying to create a test where the current code fails, but haven't been able to find one. I'm pushing my
makes the returned value the same for both blockmanagers because I wonder if there might be a problem if Can you help me find and example which breaks the current code? |
here's a non-unique example; essentially the placement is a set index to locations (as opposed to This may not answer your question about the
output
|
@@ -4004,6 +4024,9 @@ def _merge_blocks(blocks, items, dtype=None, _can_consolidate=True): | |||
raise AssertionError("_merge_blocks are invalid!") | |||
dtype = blocks[0].dtype | |||
|
|||
if not items.is_unique: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example you gave did indeed break the code. I've added your example to test_internals.py and am handling this case by sorting the blocks according to their ref_locs
.
Problem:
I re-ran these Benchmarks and found the ratio is consistently large. |
are you rebased on master? I just added these |
Oops, thanks for the reminder. Now, much better:
|
yep...that looks fine |
this looks ok to me.....@y-p ? @unutbu rebase maybe just to fix the release notes if you have a chance |
Can't review, up to you. |
I think the Travis test failed for a reason unrelated to my commits. Is there a way to restart Travis on the same build, or should a push an innocuous change to try it again? |
there is a little button on the rhs of the screen where you can restart an individual job or can always
|
…nt`, which is similar to `np.array_equal` except that it handles object arrays and treats NaNs in corresponding locations as equal. TST: Add tests for NDFrame.equals and BlockManager.equals DOC: Mention the equals method in basics, release and v.0.13.1
API: Add equals method to NDFrames.
@@ -215,6 +215,14 @@ These operations produce a pandas object the same type as the left-hand-side inp | |||
that if of dtype ``bool``. These ``boolean`` objects can be used in indexing operations, | |||
see :ref:`here<indexing.boolean>` | |||
|
|||
As of v0.13.1, Series, DataFrames and Panels have an equals method to compare if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged this thanks! maybe as a small followup....can you explain in the docs why one would need to do this, maybe a small example is in order?
yep already merged one thing on the doc update can u put a link from v0.13.1 back to your new section thanks |
Also adds
array_equivalent
, whichis similar to
np.array_equal
except that it handles object arrays andtreats NaNs in corresponding locations as equal.
closes #5183