TST/PERF: Re-write assert_almost_equal() in cython #4398 #5219

danbirken · 2013-10-14T19:34:42Z

Add a testing.pyx cython file, and port assert_almost_equal() from
python to cython. This also fixes a few minor bugs that were in the
python version of assert_almost_equal() and adds more test cases to
test_tests.py

On my machine this brings a modest gain to the suite of "not slow" tests
(160s -> 140s), but on assert_almost_equal() heavy tests, like
test_expressions.py, it shows a large improvement (14s -> 4s).

jreback · 2013-10-14T19:40:45Z

you might be able to use the null machinery (for scalars) thats already in cython, e.g. util._checknull.....(good except for dates and such)

danbirken · 2013-10-14T19:41:10Z

As noted in the commit message, this brings a measurable but only modest gain to the "not slow" test suite, 10-15% on my machine. It also beefs up the test for assert_almost_equal(), which did unearth a couple of edge cases which I fixed.

While the entire test suite doesn't improve that much, this does significantly speed up assert_almost_equal(). The main performance gain comes from replacing np.testing.assert_almost_equal(float_value1, float_value2, decimal=decimal) into the simplified very fast C cython function. As noted in the commit message, a test like test_expressions.py which heavily uses this function runs over 3x faster with the cython version.

There are other ways to squeak out more performance (moving isnull(obj) into cython for example), but given that assert_almost_equal() isnt' that much of a bottleneck I err-ed on the side of maintaining readability instead of converting everything into cython.

danbirken · 2013-10-14T20:33:11Z

Just tried naively replacing isnull with some combination of util._checknull and util._checknan and it didn't work. Had that worked I think it would be fine, but otherwise I don't think the additional complexity is worth it for the really minor speed gain.

jtratner · 2013-10-14T21:08:22Z

I'm glad you're starting on this (and that you've fixed weird cases in assert_almost_equal) - I have a few points I'd like to consider for this:

It's incredibly important to actually print what's wrong (i.e., isnull should show repr of other, etc.) and with asserts, part after the comma isn't executed unless the assert fails so it's not a big deal.
My preference is not to merge this until 0.13, but I'm okay either way.
Is your check for int and friends sufficient? I thought some numpy types (like maybe smaller ints) were separate (i.e., np.integer_, np.floating, np.complexfloating) - test suite passes so I'm probably wrong.

Finally, could we change all the uses of np's assert_equal in the test suite to use this instead?

danbirken · 2013-10-14T21:39:53Z

Good point, I'll update all the messages
My preference is to commit it as soon as possible, because a) I want to keep exploring speeding up the tests which I imagine is going to including cython-ing more common test functions and b) It is a pain to keep a rewrite like this in sync with the main branch since merging won't work.
I actually started down this path a little bit (as you can see with the minor change I made), but then stopped because the less I change the easier it is to evaluate this change as a simple python --> cython change that isn't going to break anything. The current code does need to be improved, it fails to cases like this:

assert_almost_equal(2.00000001, np.int32(2))  # fine
assert_almost_equal(np.int32(2), 2.00000001)  # error

My plan was to fix this in a separate change, but I can fix it here. My main reason for keeping it simple was for the sake of the reviewer and my general development philosophy (described below).

No problem updating all of assert_equals, but I think that should be in a follow-up change.

This may come down to a development philosophy issue (where I am happy to align my style to the project), but my preferred way of development is making changes as small as possible, then getting them merged as quickly as possible and then continuing to iterate in that way to add complexity. So for an example like this, instead of going from "python function --> fixed and updated cython function", I prefer "python function --> cython function --> fixed and updated cython function", with a commit and merge at each step of the way. I find this is easier for other people's readability of the changes and makes it easier to locate where bugs were added (IE did I add the bug when going from python --> cython or did I add the bug trying to update the function?).

However, the overhead of reviewing the changes falls with you guys, so in that way I am happy to develop in whatever way you prefer.

jtratner · 2013-10-14T21:47:20Z

@danbirken you're right, I'm okay with merging sooner. I've been on the other side and it is a heck of a pain to maintain (though I've also been responsible for some huge PRs recently).

That said, we need to fix the type checks and effective failure messages before this can be merged. How about you have one commit that moves to Cython, then add a second commit that fixes type checks and other elements like that (so we can see changes you've made on top of the Cython code)?

Also, there have been a number of bugs in the error messages in this module, mostly stemming from the fact that the argument names are a and b rather than actual, expected. After this PR, we should fix that.

danbirken · 2013-10-14T23:56:21Z

I think these commits should handle all the above (other than the a/b naming thing -- which is pretty straightforward to fix afterwards).

jreback · 2013-10-15T17:56:02Z

ok with with merging this...

@wesm ?

jreback · 2013-10-15T17:57:27Z

@danbirken can you add a 1-liner in release notes as well...thanks

danbirken · 2013-10-15T18:15:26Z

Added release notes into first commit.

Also I spent a few hours playing with test performance yesterday and I didn't really find any other places where it seems you could get a huge performance gain just by refactoring stuff (other than this change). I did find a couple areas where you could shave off a few percent in the test speed (mostly by making test cases smaller and other assorted minor performance tweaks), but I didn't see any other low hanging fruit. It seems the tests take a while to run because there are a lot of them and because matplotlib is slow, neither of which is really solvable.

jreback · 2013-10-15T18:23:36Z

ok...perfect...

jtratner · 2013-10-15T20:09:56Z

It's somewhat important to have longer frames in some places (especially
those involving sorting), because there's a break between <10ish items and

10ish items where numpy chooses different sort algorithms, it certainly
doesn't matter everywhere, but in places where sorting could be involved
(i.e., anything involving alignment, indexing, etc) might be useful to keep
in mind.

jtratner · 2013-10-17T12:24:58Z

This looks good to me - any comments or critiques?

jreback · 2013-10-17T12:44:48Z

ok by me

jtratner · 2013-10-21T12:15:35Z

Can you rebase this on top of master one more time to make sure everything passes? then I'll look to merge it tonight.

jtratner · 2013-10-21T12:15:35Z

Can you rebase this on top of master one more time to make sure everything passes? then I'll look to merge it tonight.

Add a testing.pyx cython file, and port assert_almost_equal() from python to cython. On my machine this brings a modest gain to the suite of "not slow" tests (160s -> 140s), but on assert_almost_equal() heavy tests, like test_expressions.py, it shows a large improvement (14s -> 4s).

Many of the edge cases were related to ordering of the items, but in some cases there were also issues with type checking. This fixes both of those issues and massively expands the testing for this function.

danbirken · 2013-10-21T19:29:19Z

Should be good. Luckily no rebase issues.

jreback · 2013-10-21T19:34:28Z

can you take a look at #5283 believe you are just going to call these routines, yes?

danbirken · 2013-10-21T20:30:44Z

The "np.array_equal" part can be replaced by this function, yes. Might provide some speed up in the average case.

jtratner · 2013-10-21T21:02:55Z

Maybe we merge this first, then consider whether we reuse the code for equals/array_equivalent/whatever (and then move the code, obviously)

danbirken · 2013-10-21T21:28:37Z

Works for me. This change is pretty independent from that one, in that the two main things assert_almost_equal does are provide useful error messages about why 2 things are not equal and compare floats while allowing some small margin of difference.

If there is a function which is better at determining if two arrays are exactly equal quickly (like np.array_equal or array_equivalent), that is a useful optimization for this function, but it doesn't actually replace any of the functionality here.

jreback · 2013-10-21T21:30:08Z

ok....just going to quick build on windows and then merge

TST/PERF: Re-write assert_almost_equal() in cython #4398

jreback · 2013-10-21T21:36:11Z

thank you sir!

jtratner · 2013-10-21T21:48:46Z

I was confused - nvm

danbirken added 3 commits October 21, 2013 12:02

TST: Fix edge cases in assert_almost_equal() + tests pandas-dev#4398

81ef690

Many of the edge cases were related to ordering of the items, but in some cases there were also issues with type checking. This fixes both of those issues and massively expands the testing for this function.

CLN: Make the name of the test file more accurate

850220f

jreback mentioned this pull request Oct 21, 2013

API: Add equals method to NDFrames. #5283

Merged

jreback added a commit that referenced this pull request Oct 21, 2013

Merge pull request #5219 from danbirken/c_assert_almost_equal_simple

1023b79

TST/PERF: Re-write assert_almost_equal() in cython #4398

jreback merged commit 1023b79 into pandas-dev:master Oct 21, 2013

Uh oh!

TST/PERF: Re-write assert_almost_equal() in cython #4398 #5219

TST/PERF: Re-write assert_almost_equal() in cython #4398 #5219

Uh oh!

Conversation

danbirken commented Oct 14, 2013

Uh oh!

jreback commented Oct 14, 2013

Uh oh!

danbirken commented Oct 14, 2013

Uh oh!

danbirken commented Oct 14, 2013

Uh oh!

jtratner commented Oct 14, 2013

Uh oh!

danbirken commented Oct 14, 2013

Uh oh!

jtratner commented Oct 14, 2013

Uh oh!

danbirken commented Oct 14, 2013

Uh oh!

jreback commented Oct 15, 2013

Uh oh!

jreback commented Oct 15, 2013

Uh oh!

danbirken commented Oct 15, 2013

Uh oh!

jreback commented Oct 15, 2013

Uh oh!

jtratner commented Oct 15, 2013

Uh oh!

jtratner commented Oct 17, 2013

Uh oh!

jreback commented Oct 17, 2013

Uh oh!

jtratner commented Oct 21, 2013

Uh oh!

jtratner commented Oct 21, 2013

Uh oh!

danbirken commented Oct 21, 2013

Uh oh!

jreback commented Oct 21, 2013

Uh oh!

danbirken commented Oct 21, 2013

Uh oh!

jtratner commented Oct 21, 2013

Uh oh!

danbirken commented Oct 21, 2013

Uh oh!

jreback commented Oct 21, 2013

Uh oh!

jreback commented Oct 21, 2013

Uh oh!

jtratner commented Oct 21, 2013

Uh oh!

Uh oh!