-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark noise #551
Comments
While not as noisy as the example above, the benchmarks do seem to be noisier than they used to be. |
This example is a little noisier than usual, but not much. Here is the comparison of the two commits This is around +/- 3% for the 5th and 95th percentiles. When I did the original A/A test, I saw +/- 2.5% variance. (The better 2% variance there is only if you use the exact same binary -- PGO and recompilation in general always seems to introduce some variance). It probably wouldn't hurt to automate the A/A test so we are always aware of these limits and include this information in every plot. In short, I don't think this is new. The work we can do is to improve the variance of individual benchmarks where possible (as I've done recently for mypy, for example). I've been tackling them 1 by 1 starting with the most variable.
Short of improving the benchmarks themselves, we could flag known-problematic ones, or reduce the weight of ones that perform poorly in A/A testing. |
Given the typical time to run the benchmarks is about an hour and only 20 minutes or so are actually running benchmarks (IIRC), we could run more iterations. Don't know how much that would help. |
The majority of that 50 minutes is installing dependencies, not running benchmarks. |
I think running the benchmarks may only be 15 minutes or so. |
The reminds me -- I think there may be a bug in pyperformance: It creates a seperate virtual environment for each benchmark (fine), but I think each has the union of all of the dependencies for all of the benchmarks. I'm going to put together a proper bug report and see if anyone can confirm if that's a bug. If so, it seems like a lot of extra work. |
Unfortunately, no. The log is helpfully annotated with timestamps. It looks from this like 13:56 to 14:01 (5 minutes) is spent installing dependencies (dependencies are cached so it's basically just copying files within the same SSD). The rest until 14:52 is spent running the benchmarks. |
I don't think it's a bug. It tries to use the same venv whenever it can, and only creates a unique venv if there are dependency conflicts. Seems like an optimization on purpose, actually. |
Some observations from playing with the data:
I also played around with some different heuristics for creating the "master distribution" from all of the benchmarks.
The results of these for Guido's two runs are: The diffs are:
The bottom line is that even when the benchmarks leap all over the place, the "one big number" is remarkably stable, and the exact heuristic doesn't matter too much. Given how much As a side effort, I think automating the collection of A/A test data and surfacing that in these plots would be helpful. For example, we can add error bars (in red) to each benchmark and change the color if the mean is inside or outside of it: This would be particularly useful if a change is targetting a specific benchmark to change. [1] https://github.com/faster-cpython/benchmarking#analysis-changes |
Wow, thanks for doing those experiments! Indeed I didn't realize that the x scales were different in the two plots. It looks like my change actually makes things 0.4% slower (not the end of the world, but I had hoped there wouldn't be a difference). It also looks like the first run was just a lot noisier than the second. I guess we'll never know why. If we're agreeing that EXCLUDED is the best heuristic to get more stable numbers, should we perhaps just delete those seven benchmarks? Or give them a separate tag that excludes them from the "geometric mean" calculation? (We'd have to indicate that somehow in how we display the data.) |
Yes, I think the best solution is to either delete or ignore high-variability benchmarks. I suppose ignoring is better than not running at all because they are still a signal of something just a very imprecise one. With everything above, I was just comparing the two commits (with identical source) you listed in the OP. It occurred to me over the weekend, that we actually have a bunch of results from CPython main during which the overall change is very close to 1.0 that would serve as a much larger and less anecdotal A/A test. I took the 57 commits from main that we've run since October 22, and did a pairwise comparison of each, and then summed the results. When you have this much data, each benchmark has a mean extremely close to 1.0, which is nice. Sorting by standard deviation (largest at the top), you can see which benchmarks tend to have more variation. It's a slightly different set than what I got in Guido's two commits, but is probably a much better way to determine which benchmarks are just adding noise. |
Anecdotally, we have observed "unexplained" variability in Instagram perf experiments in A/A' tests (with A and A' being independent builds from the same commit). |
Thanks. I certainly haven't experimented with BOLT at all, but it would be worth trying to run the above experiment with BOLT and see how it compares. |
IIUC @corona10 has BOLT experience. |
I will try to help @mdboom if he needs some help.
For CPython, currently, we are using subsets of standard unit test (same as PGO), so if @itamaro means that we need to get the profile from the pyperformance(LBR sampling), please refer to my experimentation note of the first attempts. They are fully different scenarios and different outputs. note: #224 |
Thanks for the help and pointers, @corona10! I'm fine with using the subset of unit tests for optimization for now. I'm mostly interested in seeing if the variability of the tests decreases when using BOLT, not in absolute performance improvements. I think using a different set of benchmarks for optimization training is a separate issue that can be looked at independently. |
right, the question of training workload is independent (and was discussed in #99) - here I'm just thinking about variability. |
cc @aaupov (You may have interests in this topic) |
Agree. If anything, BOLT could be a source of extra variability, which can be countered by using deterministic profile (simply saving/checking in profile that was used). |
I am despairing at the nosiness of our benchmarks. For example, here are two different runs from identical code (I forgot to run pystats, so I re-ran the benchmark, and since yesterday's run one commit was added and then reverted):
The commit merge base is the same too.
How can we derive useful signal from this data?
The text was updated successfully, but these errors were encountered: