-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revive performance regression testing? #11709
Comments
Here are some things about my dream benchmark suite (maybe too much work, but one can hope) :
To summarize, we need to have something very robust that we can trust. I don't want to have what we can see right now with everyone handwaving "unrelated" CI crashes which then makes us introduce actual bugs that should have been caught. It is starting to irritate me a bit :-) |
@carnaval What about The Phoronix Test Suite I've never use it myself but it seems that it is pretty good at comparing two things (commits) and they are pretty serious about it (so they are probably doing the statistics right). A timeline like what codespeed provide would still be nice though. |
Couldn't a first step be to simply revive http://speed.julialang.org? Was the problem a lack of finances for hardware or was there significant manual work needed keeping the site up? I think it is easier to get people to write performance test if they have some pretty graphs/numbers to look at. Getting something decent going is good enough while being on the way to the "performance testing nirvana" @carnaval mentioned. |
All great points. My take on each issue:
|
+1 for porting Criterion, have heard nothing but praise for that library. Based on @KristofferC's comments here #11683 (comment) I will retract my suggestion of looking into CDash, it looked like it could save some work in just having a place to put large amounts of historical test data, but if it's a Jenkins-style quagmire of configuration files to set up then nevermind. Also looks like they only store 30 days worth of data on free plans. This jogged my memory, there was a past issue on this at #9456 which was closed without any particular resolution. |
I've got about half of the work done to get something inspired by Criterion finished. Should have final results tomorrow night. |
Speed.julialang.org was @staticfloat's baby. At the time we didn't have the build infrastructure we have now, so I think it because it was trying to run on every commit and was also dealing with build issues there were a lot of problems. Many of the perf tests didn't seem to to be set up to run long enough as well, creating some spurious results. It'd probably be easier to get something working now (using e.g. generic nightly linux binaries, like PackageEvaluator does) |
Would love to write some perf tests too, especially for natural-but-not-highest-possible-perf code (the kind you see on Stackoverflow). Only worth it if we're tracking them. |
On the infrastructure side of plumbing up the github webhooks, we should probably borrow some tools from Rust and use https://github.com/barosl/homu to help address JuliaCI/julia-buildbot#1. The buildbots we have at http://buildbot.e.ip.saba.us:8010/builders run surpringly often on master. To start with we should work out a method of pinging a |
And I was messing around with this: https://github.com/MechaJulia/MechaJulia for a while |
I believe I have @julia-bot with a dash. |
MechaJulia's not a bad name, but dash vs no dash would get misspelled way too often. Realistically buildbot and other projects have already done a lot of work here, there are plenty of off-the-shelf solutions we can start from. The inspired-by-criterion benchmarking harness should probably be written in Julia, but most of the other infrastructure shouldn't IMO. |
Totally agree with that perspective |
Didn't manage to finish this work tonight, but the core ideas are in place at https://github.com/johnmyleswhite/Benchmarks.jl I'd like to make it run a lot faster, but it currently does satisfy my core design constraints:
There's still a lot more work to make it nice to use and, more importantly, to produce correct confidence intervals, but I'm hopeful this design should improve upon both the Benchmark.jl and BenchmarkLite.jl packages. |
I'm late to the party! Here were/are the challenges with speed.julialang.org:
|
Yeah, if you're willing to dedicate a machine, that would be awesome. My personal feeling is that doing testing and benchmarking right is one of the best ways that Julia could spend the money that's donated via NumFocus, so I'd like to think we could acquire at least one dedicated machine for Julia as an institution. I think using codespeed is better than nothing. I guess my question is: do you think their data schema is bad or just their interface? I would be just as happy throwing up a few trivial static graphs that would be easy to generate. The interactive interface doesn't interest me that much. |
Let's settle on the metrics we want to report, and then evaluate whether codespeed in its current incarnation will serve the purpose properly. |
@johnmyleswhite, I'd definitely be in favor of static graphs for making the maintenance load on any one person as simple as possible. If someone wants to help out with Codespeed, there is going to be a fixed cost to get orientated that might not be worth it. Just a thought. |
Perhaps we can make life easy for ourselves by having a simple webpage with, for example, Gadfly plots embedded that show the general trends of things, and then just have the full data for the last N days downloadable as The reason I mention Gadfly is because It is really, really nice to be able to hover over the first datapoint that shows a regression and see what commit hash that was. |
I'm on board with all of that. |
Thats the http://pkg.julialang.org/pulse.html philosophy! 👍 |
I just took a look at how pkg.julialang.org is run. That is kind of hilarious. |
Indeed :D For those who don't know, the process is:
#devops |
Can I get more details on step 2, please? |
Essentially, runs this script: That last bit is a lot of the madness, and it was from before |
I like the simplicity of Chapel's simple performance tracking website |
Somewhat on-topic, the choice of mean for measuring relative performance is a pretty interesting topic (to me at least), here is one paper: http://mprc.pku.cn/courses/organization/spring2008/PaperReading/Performance%20and%20Evaluation/Characterizing%20computer%20performance%20with%20a%20single%20number.pdf I see people comparing relative performance with arithmetic means worryingly often... |
I would argue that the arithmetic mean is generally what you want since it's the only number that trivially extrapolates to the long-term cost of evaluating something repeatedly in production. Cf. Radford Neal's disapproval of the use of medians in one R microbenchmarking paper: https://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/ In an ideal world, you'd want to compare the full probability distributions of timings for stochastic dominance, but that's rarely viable. FWIW, I think the main problem with relative performance via arithmetic means isn't a problem with arithmetic means per se, but with the lack of application of a proper ratio estimator: https://en.wikipedia.org/wiki/Ratio_estimator |
Arithmetic mean over samples on a single benchmark, sure. I meant, compare Julia 0.4 vs Julia 0.3 on a suite of metrics. Taking arithmetic mean for average speed up of Julia 0.4 over 0.3 on a suite of benchmarks wouldn't be right. That ratio estimator thing is neat, need to look at that. |
I think I'm confused. You're taking arithmetic mean over a suite of benchmarks where speedups are measured in absolute units or percent change? |
Like I have 10 benchmarks, and lets just say total time for Julia 0.3 and Julia 0.4 on those benchmarks, total over 100 samples. For each benchmark I have ratio J4 / J3, that (hopefully) measures how much fatster Julia 0.4 is on that benchmark. To get a single number for performance improvement, arithmetic mean of those ratios would not be a good idea (from my understanding), and can give some non-intuitive results that don't really capture what was intended. There seems to be debate whether geometric mean or harmonic mean is more appropriate. |
So percentage change, not absolute |
Isn't the arithmetic mean over all of the ratios just some scalarized attempt to solve a multi-objective optimization problem -- and hence one that at least guarantees that your induced objective function would always lead to Pareto improvements? I need to read that paper in depth, but I find it hard to imagine a priori that geometric means or harmonic means are a big improvement when the core difficulty is that you can't know what the relative importance of each benchmark is. |
That article from Neal Radford is interesting, but it seems to me that the reason he prefers mean is that the system has a lot of hidden costs (garbage collection) that cause large performance drops on some of the samples, and that needs to be taken into account. |
I agree with @johnmyleswhite, (at least about the parts I understand!), you can't just take a mean on different benchmark percentages or ratios, the number you get is meaningless, unless you have different sets of weightings for the results, to match how much those operations are used in different use cases. |
@johnmyleswhite thats an interesting perspective... I guess one thing that is clear is that arithmetic mean isn't inherently wrong without some definition of what wrong (or right) is. I've basically seen and done some playing around myself and found that arithmetic mean did some fairly unintuitive things when the ratios are not uniformly >= or <= 1, and that the geometric mean was a better way to capture it in a single number - I think thats what these papers are getting at. Would be interesting to formalize it though. |
Still more work to be done, but I think the basic design of the Criterion-style benchmarking library is close to finished: https://github.com/johnmyleswhite/Benchmarks.jl/ |
@IainNZ I think you really want to use (possibly weighted) geometric means when you're working with ratios. If on one machine 0.4 ran twice faster than 0.3, and another one twice slower, the mean should be |
To backup what @nalimilan said, search for the paper "How not to lie with statistics: the correct way to summarize benchmark results". |
This being the internet, we can also just link to things: Fleming1986 is the paper in question. |
Just to drive you all a bit crazy 😀, have you thought of multiprocess benchmarking? |
Ahh @GlenHertz @pao, that was the paper I was thinking of! @nalimilan yes that was what I was saying. |
FWIW, here's how I think one should handle these kinds of problems: http://eytan.github.io/benchmarking.pdf |
At least in operations research (surprised Iain hasn't mentioned this yet) the standard presentation format for benchmarks over large test sets with a variety of large and small problem instances is Dolan-More performance profiles: http://arxiv.org/pdf/cs/0102001.pdf That format of presenting results is much more valuable when you're comparing more than 2 alternatives though. |
Oh man how did I forget that! Seriously, performance profiles are some interesting stuff once you wrap your head around them. |
Just to add to the confusion: Citron et al., in their The harmonic or geometric mean: does it really matter?, have an interesting historical timeline of “The War of the Means,” which seems to have been going on for a while in the hardware community. I also kinda like Bast and Weber's Don't Compare Averages. I think Eugster, Hothorn and Leisch have some interesting solutions in their Exploratory and Inferential Analysis of Benchmark Experiments, which they have implemented in the R package benchmark. Many of these solutions avoid the use of averages, by summarizing all the data in other ways, e.g., histograms if the occurrence of each of the compared methods at each ranking, from first to last, or by constructing a consensus ranking across the various benchmarks (although that raises the question of how to weigh them in constructing the consensus, I guess). BTW: Kudos & thanks to @johnmyleswhite for |
see #13893 |
To repeat what others have said, you've done a really impressive job putting this together @jrevels. :applause: |
Thanks! |
As discussed in #11683, it would be great to revive some kind of performance regression testing so that we don't slowly accumulate perf problems.
In an ideal world, we'd standardize on a system that non-Base packages could use as well. But for now it would be good to just get Base being tested for perf rigorously again.
The text was updated successfully, but these errors were encountered: