Benchmark run time baseline estimation #78

FObersteiner · 2024-05-12T10:35:29Z

addressing #77

add example to illustrate the no-op baseline and the potential effect of a baseline correction
add baseline timing in benchmark definition and display baseline stats in prettyprint output (can be removed later)
add optional baseline correction
compare results from different build modes
test on Windows, with its low-granularity (100ns) clock
test on Mac
revise output (console, json)
revise documentation

FObersteiner · 2024-05-12T10:40:09Z

Marking this as a draft for now as it doesn't contain an attempt to include (subtract) the baseline from the actual benchmark results

a way to do so that seems reasonable to me is to record a no-op time before each op time, then subtract the average no-op time from the average (total) benchmark time.
we could report this as a separate quantity first ("baseline-corrected time/run") and see how this goes.

Let me know what you think.

hendriknielaender · 2024-05-13T15:19:37Z

Subtracting the average no-op time from the average benchmark time sounds reasonable.
We could also ship this via an experimental flag in the config, if its not set then we just report this baseline metric. What do you think?

FObersteiner · 2024-05-14T17:08:29Z

I've just added a demo implementation. Obtaining the baseline timing is a very simple

        const baseline: u64 = t.read();
        t.reset();

It is not subtracted from the results, there is no active baseline correction yet since I wanted to check first if this implementation works in principal.

In the demo, the baseline timings are appended to the results. If you run the baseline example, you can observe for example

benchmark              runs     total time     time/run (avg ± σ)     (min ... max)                p75        p99        p995       baseline              
------------------------------------------------------------------------------------------------------------------------------------------------------
Bench baseline, n=10   10       1.353us        135ns ± 37ns           (118ns ... 243ns)            125ns      243ns      243ns     150ns ± 43ns           
Bench baseline, n=1k   1000     120.049us      120ns ± 5ns            (112ns ... 251ns)            122ns      130ns      142ns     134ns ± 3ns            
Bench baseline, n=10k  10000    1.334ms        133ns ± 26ns           (97ns ... 2.333us)           140ns      147ns      157ns     150ns ± 83ns           
Bench baseline, n=10M  10000000 216.112ms      21ns ± 5ns             (19ns ... 4.02us)            21ns       37ns       39ns      24ns ± 7ns

Some observations:

baseline vs. "normal" timings agree relatively well; if the standard deviation is considered
sometimes, time - baseline would be < 0, that has to be excluded of course
if you compile in debug mode, there seems to be a considerable overhead by the zBench code itself
all in all, I think this is going in the right direction; it also indicates the accuracy towards which the benchmarks converge at the low end - a couple of nanoseconds at best - which is fine I think, as I don't see us aiming to be a micro-benchmarking library.

FObersteiner · 2024-05-17T18:43:59Z

just added a simple implementation of the baseline correction; looks still reasonable to me (very second run has the correction active):

Bench baseline, n=10   10       611ns          61ns ± 21ns            (45ns ... 121ns)             62ns       121ns      121ns     54ns ± 4ns
Bench baseline, n=10   10       19ns           1ns ± 6ns              (0ns ... 19ns)               0ns        19ns       19ns      70ns ± 40ns
Bench baseline, n=1k   1000     62.01us        62ns ± 17ns            (53ns ... 503ns)             63ns       105ns      117ns     66ns ± 14ns
Bench baseline, n=1k   1000     1.263us        1ns ± 16ns             (0ns ... 408ns)              0ns        32ns       34ns      60ns ± 13ns
Bench baseline, n=10k  10000    540.284us      54ns ± 8ns             (41ns ... 143ns)             59ns       85ns       89ns      58ns ± 8ns
Bench baseline, n=10k  10000    3.154us        0ns ± 8ns              (0ns ... 579ns)              0ns        12ns       15ns      28ns ± 8ns

To be fair, the 0 ± something might be a bit confusing, as this is not to say that is goes below zero. I wasn't able to find a hack for faster-than-light travel, sorry ;-)

…path

zbench.zig

hendriknielaender · 2024-05-18T09:55:43Z

Should we compare this feature with all build options? I think you mentioned that you tested this in debug mode.

FObersteiner · 2024-05-18T09:58:47Z

Should we compare this feature will all build options? I think you mentioned that you tested this in debug mode.

added a checkbox :) Also, I'll see how this all looks on Windows this evening (probably). This has been an interesting deep-dive into system clocks already!

FObersteiner · 2024-05-18T17:41:44Z

Comparing different optimization modes, it looks all fine to me on Linux. I've put the results here since this would be a bit clunky to post as a comment.

For Windows, results are here. The baseline correction doesn't look very effective, TBH.

All in all, I think on Linux this can actually help to improve accuracy for benchmarking functions that run very fast, in the nanosecond range. @hendriknielaender could you test this on Mac? As for Windows, the 100ns granularity of the "performance counter" is giving trouble; potentially, subtracting the average baseline time from the results afterwards instead of subtracting each no-op time individually in runImpl might help. However, that would require some refactoring of other parts of zBench since the config (correction on/off?) cannot be easily accessed from everywhere. My suggestion would be to postpone this, maybe open an issue to keep track of it.

hendriknielaender · 2024-05-18T20:41:32Z

Sure I will test this on mac 👍

For the windows case, we could implement a generic experimental flag. This would allow us to enable the feature on linux and macos while excluding it on windows until the necessary refactoring is complete. In the long run I think we should have a stable baseline calculation, which is per default subtracted.

FObersteiner · 2024-05-20T05:25:50Z

Here's an idea how we could do this in one go. It might not be that complicated. My idea would be to keep the baseline correction feature a simple on/off setting in the benchmark config. It should be up to the user to experiment with it, independent of the platform.

The changes I imagine for this, based on the current state of this branch:

only record a baseline timing if the feature is enabled, otherwise zero.
add a data processor method to the Result type, that uses Statistics (which I would want to keep generic as it is now) plus baseline correction. If the feature is off, all baseline timings are zero, thus subtracting them from the actual timings doesn't change anything. If it is on, timings will be updated.

hendriknielaender · 2024-05-21T20:48:31Z

Here are the mac baseline results they were executed with an old 2016 macbook on macOS 12.7. I could also add another test with a newer macbook pro, but earliest 3 of june.

zbench.zig

FObersteiner · 2024-05-25T18:05:33Z

By the way, the run implementation calls for a new Timer every time a function gets benchmarked. This may sound stupid but I just realized this now; it might be worth to investigate if it would be better in terms of baseline stability to instantiate one timer, then query it repeatedly. IIRC, that was how it was implemented in one of the early versions of zBench. For benchmarked functions that have very short execution times (i.e. many benchmark runs), that would avoid a lot of system calls (assuming that the Zig compiler does not optimize in this direction already).

hendriknielaender · 2024-05-25T22:02:07Z

You are right, I think we need to move the timer instantiation out of the runImpl function and move it into a higher scope where it is created once and reused. And then modify the runImpl function to use the pre-instantiated timer for each benchmark run.

hendriknielaender · 2024-07-27T17:06:10Z

@FObersteiner how we wanna continue with this PR? I really like the feature, to have a more precise benchmark.

FObersteiner · 2024-07-27T18:39:17Z

@FObersteiner how we wanna continue with this PR? I really like the feature, to have a more precise benchmark.

To be honest, I didn't reach a point where I thought this is ready to ship. Determining the no-op time turned out to be very hard; results are highly variable. And that is even true if you use the RDTSC register of the CPU directly. Here's some code that does this: https://codeberg.org/FObersteiner/benchmarks/src/branch/main/src/bench_noop_rdtsc.zig - While this was a nice deep-dive into computer clocks, I find that micro-benchmarking is really hard. Don't trust a benchmarking library that claims to do "micro-benchmarks" and just uses the standard means ^^ So for "normal" benchmarking, I think it is important to be aware that there is an uncertainty of, say 15-20 nanoseconds. Maybe we can clarify that in the readme. At the moment, I don't see a reliable "baseline correction", one that actually improves precision (or accuracy).

hendriknielaender · 2024-07-27T18:46:58Z

@FObersteiner how we wanna continue with this PR? I really like the feature, to have a more precise benchmark.

To be honest, I didn't reach a point where I thought this is ready to ship. Determining the no-op time turned out to be very hard; results are highly variable. And that is even true if you use the RDTSC register of the CPU directly. Here's some code that does this: https://codeberg.org/FObersteiner/benchmarks/src/branch/main/src/bench_noop_rdtsc.zig - While this was a nice deep-dive into computer clocks, I find that micro-benchmarking is really hard. Don't trust a benchmarking library that claims to do "micro-benchmarks" and just uses the standard means ^^ So for "normal" benchmarking, I think it is important to be aware that there is an uncertainty of, say 15-20 nanoseconds. Maybe we can clarify that in the readme. At the moment, I don't see a reliable "baseline correction", one that actually improves precision (or accuracy).

Good point! Then i will close this, and then lets add this info to the readme.

FObersteiner added 2 commits May 12, 2024 12:31

increase iterations type width

a26c4f7

add baseline example

4a78c12

FObersteiner requested review from bens and hendriknielaender May 12, 2024 10:35

FObersteiner added the tracking Kepp track of an issue label May 12, 2024

baseline result demo

3b0eef4

baseline to zero in tests

35e5425

FObersteiner self-assigned this May 14, 2024

FObersteiner added 2 commits May 17, 2024 20:40

add test

9290b0b

simple correction implementation

2998098

FObersteiner marked this pull request as ready for review May 17, 2024 20:41

use std.Build.path function instead of deprecated std.Build.LazyPath.…

59bac27

…path

FObersteiner added the feature label May 18, 2024

hendriknielaender reviewed May 18, 2024

View reviewed changes

zbench.zig Outdated Show resolved Hide resolved

FObersteiner added 2 commits May 18, 2024 11:51

Merge branch 'hendriknielaender:main' into main

345fee9

Merge branch 'hendriknielaender:main' into baseline

b550dfb

FObersteiner added 3 commits May 18, 2024 18:40

revise baseline example

7dc1d6b

cleanup, add baseline results to json output

c6931f3

update docs

6940992

add some colors

0f5ec5f

fix text output header spacing, nicer color for baseline timings

8575155

hendriknielaender reviewed May 22, 2024

View reviewed changes

zbench.zig Outdated Show resolved Hide resolved

FObersteiner added 4 commits May 23, 2024 07:38

use Timer.lap

d989d28

Merge branch 'hendriknielaender:main' into main

71a4091

Merge remote-tracking branch 'origin/main' into baseline

30c5733

upstream merge

9bcfe86

hendriknielaender closed this Jul 27, 2024

FObersteiner mentioned this pull request Aug 16, 2024

15ns run time on empty benchmark #77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark run time baseline estimation #78

Benchmark run time baseline estimation #78

FObersteiner commented May 12, 2024 •

edited

Loading

FObersteiner commented May 12, 2024 •

edited

Loading

hendriknielaender commented May 13, 2024

FObersteiner commented May 14, 2024 •

edited

Loading

FObersteiner commented May 17, 2024 •

edited

Loading

hendriknielaender commented May 18, 2024 •

edited

Loading

FObersteiner commented May 18, 2024 •

edited

Loading

FObersteiner commented May 18, 2024

hendriknielaender commented May 18, 2024

FObersteiner commented May 20, 2024

hendriknielaender commented May 21, 2024

FObersteiner commented May 25, 2024 •

edited

Loading

hendriknielaender commented May 25, 2024

hendriknielaender commented Jul 27, 2024

FObersteiner commented Jul 27, 2024

hendriknielaender commented Jul 27, 2024

Benchmark run time baseline estimation #78

Benchmark run time baseline estimation #78

Conversation

FObersteiner commented May 12, 2024 • edited Loading

FObersteiner commented May 12, 2024 • edited Loading

hendriknielaender commented May 13, 2024

FObersteiner commented May 14, 2024 • edited Loading

FObersteiner commented May 17, 2024 • edited Loading

hendriknielaender commented May 18, 2024 • edited Loading

FObersteiner commented May 18, 2024 • edited Loading

FObersteiner commented May 18, 2024

hendriknielaender commented May 18, 2024

FObersteiner commented May 20, 2024

hendriknielaender commented May 21, 2024

FObersteiner commented May 25, 2024 • edited Loading

hendriknielaender commented May 25, 2024

hendriknielaender commented Jul 27, 2024

FObersteiner commented Jul 27, 2024

hendriknielaender commented Jul 27, 2024

FObersteiner commented May 12, 2024 •

edited

Loading

FObersteiner commented May 12, 2024 •

edited

Loading

FObersteiner commented May 14, 2024 •

edited

Loading

FObersteiner commented May 17, 2024 •

edited

Loading

hendriknielaender commented May 18, 2024 •

edited

Loading

FObersteiner commented May 18, 2024 •

edited

Loading

FObersteiner commented May 25, 2024 •

edited

Loading