Skip to content

TST/PERF: Re-write assert_almost_equal() in cython #4398 #5219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 21, 2013

Conversation

danbirken
Copy link
Contributor

closes #4398

Add a testing.pyx cython file, and port assert_almost_equal() from
python to cython. This also fixes a few minor bugs that were in the
python version of assert_almost_equal() and adds more test cases to
test_tests.py

On my machine this brings a modest gain to the suite of "not slow" tests
(160s -> 140s), but on assert_almost_equal() heavy tests, like
test_expressions.py, it shows a large improvement (14s -> 4s).

@jreback
Copy link
Contributor

jreback commented Oct 14, 2013

you might be able to use the null machinery (for scalars) thats already in cython, e.g. util._checknull.....(good except for dates and such)

@danbirken
Copy link
Contributor Author

As noted in the commit message, this brings a measurable but only modest gain to the "not slow" test suite, 10-15% on my machine. It also beefs up the test for assert_almost_equal(), which did unearth a couple of edge cases which I fixed.

While the entire test suite doesn't improve that much, this does significantly speed up assert_almost_equal(). The main performance gain comes from replacing np.testing.assert_almost_equal(float_value1, float_value2, decimal=decimal) into the simplified very fast C cython function. As noted in the commit message, a test like test_expressions.py which heavily uses this function runs over 3x faster with the cython version.

There are other ways to squeak out more performance (moving isnull(obj) into cython for example), but given that assert_almost_equal() isnt' that much of a bottleneck I err-ed on the side of maintaining readability instead of converting everything into cython.

@danbirken
Copy link
Contributor Author

Just tried naively replacing isnull with some combination of util._checknull and util._checknan and it didn't work. Had that worked I think it would be fine, but otherwise I don't think the additional complexity is worth it for the really minor speed gain.

@jtratner
Copy link
Contributor

I'm glad you're starting on this (and that you've fixed weird cases in assert_almost_equal) - I have a few points I'd like to consider for this:

  1. It's incredibly important to actually print what's wrong (i.e., isnull should show repr of other, etc.) and with asserts, part after the comma isn't executed unless the assert fails so it's not a big deal.
  2. My preference is not to merge this until 0.13, but I'm okay either way.
  3. Is your check for int and friends sufficient? I thought some numpy types (like maybe smaller ints) were separate (i.e., np.integer_, np.floating, np.complexfloating) - test suite passes so I'm probably wrong.

Finally, could we change all the uses of np's assert_equal in the test suite to use this instead?

@danbirken
Copy link
Contributor Author

  1. Good point, I'll update all the messages

  2. My preference is to commit it as soon as possible, because a) I want to keep exploring speeding up the tests which I imagine is going to including cython-ing more common test functions and b) It is a pain to keep a rewrite like this in sync with the main branch since merging won't work.

  3. I actually started down this path a little bit (as you can see with the minor change I made), but then stopped because the less I change the easier it is to evaluate this change as a simple python --> cython change that isn't going to break anything. The current code does need to be improved, it fails to cases like this:

assert_almost_equal(2.00000001, np.int32(2))  # fine
assert_almost_equal(np.int32(2), 2.00000001)  # error

My plan was to fix this in a separate change, but I can fix it here. My main reason for keeping it simple was for the sake of the reviewer and my general development philosophy (described below).

  1. No problem updating all of assert_equals, but I think that should be in a follow-up change.

This may come down to a development philosophy issue (where I am happy to align my style to the project), but my preferred way of development is making changes as small as possible, then getting them merged as quickly as possible and then continuing to iterate in that way to add complexity. So for an example like this, instead of going from "python function --> fixed and updated cython function", I prefer "python function --> cython function --> fixed and updated cython function", with a commit and merge at each step of the way. I find this is easier for other people's readability of the changes and makes it easier to locate where bugs were added (IE did I add the bug when going from python --> cython or did I add the bug trying to update the function?).

However, the overhead of reviewing the changes falls with you guys, so in that way I am happy to develop in whatever way you prefer.

@jtratner
Copy link
Contributor

@danbirken you're right, I'm okay with merging sooner. I've been on the other side and it is a heck of a pain to maintain (though I've also been responsible for some huge PRs recently).

That said, we need to fix the type checks and effective failure messages before this can be merged. How about you have one commit that moves to Cython, then add a second commit that fixes type checks and other elements like that (so we can see changes you've made on top of the Cython code)?

Also, there have been a number of bugs in the error messages in this module, mostly stemming from the fact that the argument names are a and b rather than actual, expected. After this PR, we should fix that.

@danbirken
Copy link
Contributor Author

I think these commits should handle all the above (other than the a/b naming thing -- which is pretty straightforward to fix afterwards).

@jreback
Copy link
Contributor

jreback commented Oct 15, 2013

ok with with merging this...

@wesm ?

@jreback
Copy link
Contributor

jreback commented Oct 15, 2013

@danbirken can you add a 1-liner in release notes as well...thanks

@danbirken
Copy link
Contributor Author

Added release notes into first commit.

Also I spent a few hours playing with test performance yesterday and I didn't really find any other places where it seems you could get a huge performance gain just by refactoring stuff (other than this change). I did find a couple areas where you could shave off a few percent in the test speed (mostly by making test cases smaller and other assorted minor performance tweaks), but I didn't see any other low hanging fruit. It seems the tests take a while to run because there are a lot of them and because matplotlib is slow, neither of which is really solvable.

@jreback
Copy link
Contributor

jreback commented Oct 15, 2013

ok...perfect...

@jtratner
Copy link
Contributor

It's somewhat important to have longer frames in some places (especially
those involving sorting), because there's a break between <10ish items and

10ish items where numpy chooses different sort algorithms, it certainly
doesn't matter everywhere, but in places where sorting could be involved
(i.e., anything involving alignment, indexing, etc) might be useful to keep
in mind.

@jtratner
Copy link
Contributor

This looks good to me - any comments or critiques?

@jreback
Copy link
Contributor

jreback commented Oct 17, 2013

ok by me

@jtratner
Copy link
Contributor

Can you rebase this on top of master one more time to make sure everything passes? then I'll look to merge it tonight.

1 similar comment
@jtratner
Copy link
Contributor

Can you rebase this on top of master one more time to make sure everything passes? then I'll look to merge it tonight.

Add a testing.pyx cython file, and port assert_almost_equal() from
python to cython.

On my machine this brings a modest gain to the suite of "not slow" tests
(160s -> 140s), but on assert_almost_equal() heavy tests, like
test_expressions.py, it shows a large improvement (14s -> 4s).
Many of the edge cases were related to ordering of the items, but in
some cases there were also issues with type checking.  This fixes both
of those issues and massively expands the testing for this function.
@danbirken
Copy link
Contributor Author

Should be good. Luckily no rebase issues.

@jreback
Copy link
Contributor

jreback commented Oct 21, 2013

can you take a look at #5283 believe you are just going to call these routines, yes?

@danbirken
Copy link
Contributor Author

The "np.array_equal" part can be replaced by this function, yes. Might provide some speed up in the average case.

@jtratner
Copy link
Contributor

Maybe we merge this first, then consider whether we reuse the code for equals/array_equivalent/whatever (and then move the code, obviously)

@danbirken
Copy link
Contributor Author

Works for me. This change is pretty independent from that one, in that the two main things assert_almost_equal does are provide useful error messages about why 2 things are not equal and compare floats while allowing some small margin of difference.

If there is a function which is better at determining if two arrays are exactly equal quickly (like np.array_equal or array_equivalent), that is a useful optimization for this function, but it doesn't actually replace any of the functionality here.

@jreback
Copy link
Contributor

jreback commented Oct 21, 2013

ok....just going to quick build on windows and then merge

jreback added a commit that referenced this pull request Oct 21, 2013
TST/PERF: Re-write assert_almost_equal() in cython #4398
@jreback jreback merged commit 1023b79 into pandas-dev:master Oct 21, 2013
@jreback
Copy link
Contributor

jreback commented Oct 21, 2013

thank you sir!

@jtratner
Copy link
Contributor

I was confused - nvm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rewrite pandas.util.testing.assert_almost_equal for performance
3 participants