Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track performance regressions in CI #25262

Open
saraedum opened this issue Apr 29, 2018 · 72 comments
Open

Track performance regressions in CI #25262

saraedum opened this issue Apr 29, 2018 · 72 comments

Comments

@saraedum
Copy link
Member

I am currently playing with airspeed velocity to track speed regressions in Sage. I would like to benchmark every doctest that has a long time or benchmark marker in it and also benchmark every method that has a time_ prefix (probably only in some benchmark module.)

We have something similar set up for https://github.com/MCLF/mclf/tree/master/mclf/benchmarks now. There are only two benchmarks but it works nicely.

I ran the above proposal for all the tags from 8.3.beta0 to 8.3. There's a lot of noise (because there was other activity on the machine) but you get the idea: https://saraedum.github.io/sage/

Another interesting demo of airspeedvelocity that is not related to Sage is here: https://pv.github.io/numpy-bench/#/regressions

Depends on #24655

CC: @roed314 @embray @nthiery @koffie @videlec

Component: doctest framework

Keywords: ContinuousIntegration

Work Issues: documentation, doctests, CI

Author: Julian Rüth

Branch/Commit: public/airspeed_velo @ 68869ae

Issue created by migration from https://trac.sagemath.org/ticket/25262

@saraedum
Copy link
Member Author

comment:1

I think we have to work with the Advanced API (https://docs.python.org/2/library/doctest.html#advanced-api) and hook into DocTestRunner.run() to track timings and export them into an artitfical benchmark/ directory that just prints these timings for asv.

@roed314
Copy link
Contributor

roed314 commented Apr 29, 2018

comment:2

This is great, and I'm happy to help!

We're already using the advanced API. See sage/doctest/forker.py, lines 425 to 786 (maybe it would make sense to do the exporting in summarize).

@jdemeyer
Copy link

comment:3

Just to say something which I have always said before: measuring timings is the easy part. The hard part is doing something useful with those timings.

@jdemeyer
Copy link

comment:4

Duplicate of #12720.

@jdemeyer jdemeyer removed this from the sage-8.3 milestone Apr 29, 2018
@saraedum
Copy link
Member Author

comment:5

I don't think this is a duplicate. This is about integrating speed regression checks into CI (GitLab CI, CircleCI.) Please reopen.

@saraedum saraedum added this to the sage-8.3 milestone Apr 29, 2018
@saraedum
Copy link
Member Author

comment:7

Replying to @jdemeyer:

Just to say something which I have always said before: measuring timings is the easy part. The hard part is doing something useful with those timings.

That's what airspeed velocity is good for.

@embray
Copy link
Contributor

embray commented Apr 29, 2018

comment:8

I am currently playing with airspeed velocity to track speed regressions in Sage

Great! It's an excellent tool and I've wanted to see it used for Sage for a long time, but wasn't sure where to begin. In case it helps I know and have worked with its creator personally.

@embray
Copy link
Contributor

embray commented Apr 29, 2018

comment:9

Even if #12720 was addressing a similar problem, it's an orthogonal approach, and if Julian can get ASV working this could supersede #12720.

@embray embray reopened this Apr 29, 2018
@jdemeyer
Copy link

comment:10

Replying to @saraedum:

Replying to @jdemeyer:

Just to say something which I have always said before: measuring timings is the easy part. The hard part is doing something useful with those timings.

That's what airspeed velocity is good for.

Well, I'd love to be proven wrong. I thought it was just a tool to benchmark a given set of commands across versions and display fancy graphs.

@embray
Copy link
Contributor

embray commented Apr 30, 2018

comment:12

Not just across versions but across commits, even (though I think you can change the granularity). Here are Astropy's ASV benchmarks: http://www.astropy.org/astropy-benchmarks/

There are numerous benchmark tests for various common and/or time-critical operations. For example, we can track how coordinate transformations perform over time (which is one example of complex code that can fairly easily be thrown into bad performance by just a few small changes somewhere).

@videlec
Copy link
Contributor

videlec commented Aug 3, 2018

comment:14

update milestone 8.3 -> 8.4

@videlec videlec modified the milestones: sage-8.3, sage-8.4 Aug 3, 2018
@saraedum
Copy link
Member Author

saraedum commented Aug 9, 2018

Author: Julian Rüth

@saraedum
Copy link
Member Author

saraedum commented Aug 9, 2018

comment:15

Adding this to all doctests is probably hard and would require too much hacking on asv. It's probably best to use the tool as it was intended to be used. Once #24655 is in, I would like to setup a prototype within Sage. Any area that you would like to have benchmarked from the start?

@saraedum

This comment has been minimized.

@jdemeyer
Copy link

jdemeyer commented Aug 9, 2018

comment:16

Replying to @saraedum:

Any area that you would like to have benchmarked from the start?

This is the "hard part" that I mentioned in [comment:3]. Ideally, we shouldn't have to guess where regressions might occur, the tool would do that for us. I believe that the intention of #12720 was to integrate this in the doctest framework such that all(?) doctests would also be regression tests.

But that's probably not feasible, so here is a more productive answer:

  1. All # long time tests should definitely be regression tests.

  2. For each Parent (more precisely: every time that a TestSuite appears in a doctest): test creating a parent, test creating elements, test some basic arithmetic (also with elements of different such that we check the coercion model too).

@jdemeyer
Copy link

jdemeyer commented Aug 9, 2018

Replying to @saraedum:

We could have specially named methods, say starting in _benchmark_time_…

Adding a new method for each regression tests sounds quite heavy. Could it be possible to integrate this in doctests instead? I would love to do

EXAMPLES::

    sage: some_sage_code()  # airspeed

@embray
Copy link
Contributor

embray commented Aug 9, 2018

comment:18

Replying to @saraedum:

Adding this to all doctests is probably hard and would require too much hacking on asv. It's probably best to use the tool as it was intended to be used. Once #24655 is in, I would like to setup a prototype within Sage. Any area that you would like to have benchmarked from the start?

I didn't realize you were trying to do that. And yeah, I think benchmarking every test would be overkill and would produce too much noise to be useful. Better to write specific benchmark tests, and also add new ones as regression tests whenever some major performance regression is noticed.

@saraedum
Copy link
Member Author

comment:45

Replying to @jdemeyer:

Replying to @saraedum:

So, what do you think? Should we try to run time_* methods

I don't like this part because it doesn't mix well with doctests. I would really want to write a doctest like

sage: a = something
sage: b = otherthing
sage: c = computation(a, b)  # benchmark this

and being forced to wrap this in a time_ method is just ugly.

I see. I think it would be easy to track lines that say, e.g., # benchmark time separately. I am not sure if it's a good idea to add more magic comments to our doctesting. I've nothing against them in general, I am just worried that these features are relatively obscure so not many people are going to use them?

Let me try to start with the benchmarking of blocks that say # long time and add more features later.

@nthiery
Copy link
Contributor

nthiery commented Aug 20, 2018

comment:46

Just two cents without having though two much about it.

I like the # benchmark approach too. It mixes well with how we write doctests and makes it trivial to create new benchmarks / annotate things as useful to benchmark.

I'd rather have a different annotation than # long time; otherwise devs will have to take a decision between benchmarking and running the tests always, not just with --long.

Of course, at this stage using # long time is a good way for experimenting. And it may be reasonable to keep benchmarking # long time lines later on.

Thanks!

@embray
Copy link
Contributor

embray commented Aug 20, 2018

comment:47

Replying to @jdemeyer:

Replying to @saraedum:

So, what do you think? Should we try to run time_* methods

I don't like this part because it doesn't mix well with doctests. I would really want to write a doctest like

sage: a = something
sage: b = otherthing
sage: c = computation(a, b)  # benchmark this

and being forced to wrap this in a time_ method is just ugly.

Yes, something like that could be done. Again, it all comes down to providing a different benchmark discovery plugin for ASV. For discovering benchmarks in our doctest, all lines leading up to a # benchmark line could be considered setup code, with the # benchmark line being the one actually benchmarked (obviously).

Multiple # benchmark tests in the same file would work fine too, with every line prior to it (including other previously benchmarked lines) considered the setup for it.

It might be trickier to do this in such a way that avoids duplication but I'll think about that. I think it could still be done.

@mantepse
Copy link
Collaborator

comment:48

I think that this is wonderful.

Since I tried to improve performance of certain things recently, and will likely continue to do so, I would like to add doctests for speed regression already now. Should I use long or benchmark or something else?

@saraedum
Copy link
Member Author

comment:49

Thanks for the feedback.

Replying to @mantepse:

Since I tried to improve performance of certain things recently, and will likely continue to do so, I would like to add doctests for speed regression already now. Should I use long or benchmark or something else?

Nothing has been decided upon yet. I could imagine something like # benchmark time or # benchmark runtime so that we can add # benchmark memory later. What do you think?

@nthiery
Copy link
Contributor

nthiery commented Aug 24, 2018

comment:50

Presumably time benchmarking is more usual than memory benchmarking, so I would tend to
have "benchmark" be a short hand for "benchmark time", but that may be just me.

For memory usage, do you foresee using fine grained tools that instrument the code and actually slow down the execution? Otherwise, could "benchmark" just do both always?

@embray
Copy link
Contributor

embray commented Aug 24, 2018

comment:51

I would actually like # benchmark - time, # benchmark - memory, etc. (syntax similar to # optional -) because this would fit very nicely with the existing model for ASV, which implements different benchmark types as subclasses of Benchmark, which are selected from by doing a string match--currently on function and class names--but the same string match could also be performed on a parameter to # benchmark. This would be the most extensible choice--the parameters allowed following # benchmark need not be hard-coded.

Of course, I agree time benchmarks are going to be the most common, so we could still have # benchmark without a parameter default to "time".

@embray
Copy link
Contributor

embray commented Aug 24, 2018

comment:52

Once we're past this deliverable due date I'll spend some more time poking at ASV to get the features we would need in it to make it easier to extend how benchmark collection is performed, and also to integrate it more directly into our existing test runner.

@nthiery
Copy link
Contributor

nthiery commented Aug 24, 2018

comment:53

Replying to @embray:

I would actually like # benchmark - time, # benchmark - memory, ...

I very much like this (well informed!) proposal.

@simon-king-jena
Copy link
Member

comment:54

What is the status of this ticket? There is a branch attached. So, is it really new? Are people working on it?

For the record, I too think that having # benchmark - time and # benchmark - memory would be nice and very useful.

@embray
Copy link
Contributor

embray commented Jan 14, 2019

comment:55

Right now we need to get the GitLab CI pipeline going again. I need to about getting some more build runners up and running; it's been on my task list for ages. That, or if we can get more time from GCE (if anyone knows anyone at Google or other cloud computing providers who can help getting CPU time donated to the project it would be very helpful).

@saraedum

This comment has been minimized.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Sep 12, 2019

Changed commit from d7ff532 to 89d4afb

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Sep 12, 2019

Branch pushed to git repo; I updated commit sha1. New commits:

89d4afbMerge remote-tracking branch 'trac/develop' into HEAD

@saraedum
Copy link
Member Author

comment:58

Now that the CI seems to be mostly stable (except for the docbuild timing out for test-dev) we should probably look into this again?

I would like to get a minimal version of this working somehow. We should probably not attempt to get the perfect solution in the first run. The outputs this created are actually quite useful already imho. If our contributors actually end up looking at the results, we can add more features (more keywords, more iterations, memory benchmarking, comparisons to other CAS,…)

So, my proposal would be to go with this version (modulo cleanup & documentation & CI integration.) If somebody wants to improve/reimplement this in a good way, I am very happy to review that later.

I am not sure how much time I will have to work on this so if anybody wants to get more involved, please let me know :)

@saraedum
Copy link
Member Author

Work Issues: documentation, doctests, CI

@saraedum
Copy link
Member Author

Changed keywords from none to ContinuousIntegration

@fchapoton
Copy link
Contributor

comment:61

rebased


New commits:

68869aeMerge branch 'u/saraedum/25262' in 9.5.b1

@fchapoton
Copy link
Contributor

Changed branch from u/saraedum/25262 to public/airspeed_velo

@fchapoton
Copy link
Contributor

Changed commit from 89d4afb to 68869ae

@fchapoton
Copy link
Contributor

comment:62

this needs adaptation to python3, apparently

@saraedum
Copy link
Member Author

saraedum commented Feb 2, 2022

comment:63

I am thinking about reviving this issue with a different application in mind that is a bit easier than regression testing. Namely, to have a better understanding how different values for the algorithm keyword affect runtime.

I find that we rarely update the default algorithms. However, this could be quite beneficial say when we upgrade a dependency such as PARI or FLINT. It would be very nice to easily see how the different algorithms perform after an update and also a way to document the instances that have been used to determine the cutoffs that we are using.

Currently, we are using some homegrown solutions for this, e.g., matrix/benchmark.py or misc/benchmark.py.

@mantepse
Copy link
Collaborator

mantepse commented Feb 2, 2022

comment:64

What is actually the problem with the original goal?

@saraedum
Copy link
Member Author

saraedum commented Feb 2, 2022

comment:65

Replying to @mantepse:

What is actually the problem with the original goal?

There's no fundamental problem. But doing the CI setup is quite a bit of work.

@saraedum
Copy link
Member Author

saraedum commented Feb 9, 2023

cc @roed314 @seblabbe @alexjbest @mezzarobba.

@roed314 and I started to work on this again at days 117.

@mezzarobba
Copy link
Member

Sorry that I missed the discussion. I'm happy to help too, but will have very little time for that after the end of the Sage Days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests