Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark run time baseline estimation #78

Closed
wants to merge 18 commits into from

Conversation

FObersteiner
Copy link
Collaborator

@FObersteiner FObersteiner commented May 12, 2024

addressing #77

  • add example to illustrate the no-op baseline and the potential effect of a baseline correction
  • add baseline timing in benchmark definition and display baseline stats in prettyprint output (can be removed later)
  • add optional baseline correction
  • compare results from different build modes
  • test on Windows, with its low-granularity (100ns) clock
  • test on Mac
  • revise output (console, json)
  • revise documentation

@FObersteiner FObersteiner added the tracking Kepp track of an issue label May 12, 2024
@FObersteiner
Copy link
Collaborator Author

FObersteiner commented May 12, 2024

Marking this as a draft for now as it doesn't contain an attempt to include (subtract) the baseline from the actual benchmark results

  • a way to do so that seems reasonable to me is to record a no-op time before each op time, then subtract the average no-op time from the average (total) benchmark time.
  • we could report this as a separate quantity first ("baseline-corrected time/run") and see how this goes.

Let me know what you think.

@hendriknielaender
Copy link
Owner

Subtracting the average no-op time from the average benchmark time sounds reasonable.
We could also ship this via an experimental flag in the config, if its not set then we just report this baseline metric. What do you think?

@FObersteiner
Copy link
Collaborator Author

FObersteiner commented May 14, 2024

I've just added a demo implementation. Obtaining the baseline timing is a very simple

        const baseline: u64 = t.read();
        t.reset();

It is not subtracted from the results, there is no active baseline correction yet since I wanted to check first if this implementation works in principal.

In the demo, the baseline timings are appended to the results. If you run the baseline example, you can observe for example

benchmark              runs     total time     time/run (avg ± σ)     (min ... max)                p75        p99        p995       baseline              
------------------------------------------------------------------------------------------------------------------------------------------------------
Bench baseline, n=10   10       1.353us        135ns ± 37ns           (118ns ... 243ns)            125ns      243ns      243ns     150ns ± 43ns           
Bench baseline, n=1k   1000     120.049us      120ns ± 5ns            (112ns ... 251ns)            122ns      130ns      142ns     134ns ± 3ns            
Bench baseline, n=10k  10000    1.334ms        133ns ± 26ns           (97ns ... 2.333us)           140ns      147ns      157ns     150ns ± 83ns           
Bench baseline, n=10M  10000000 216.112ms      21ns ± 5ns             (19ns ... 4.02us)            21ns       37ns       39ns      24ns ± 7ns  

Some observations:

  • baseline vs. "normal" timings agree relatively well; if the standard deviation is considered
  • sometimes, time - baseline would be < 0, that has to be excluded of course
  • if you compile in debug mode, there seems to be a considerable overhead by the zBench code itself
  • all in all, I think this is going in the right direction; it also indicates the accuracy towards which the benchmarks converge at the low end - a couple of nanoseconds at best - which is fine I think, as I don't see us aiming to be a micro-benchmarking library.

@FObersteiner FObersteiner self-assigned this May 14, 2024
@FObersteiner
Copy link
Collaborator Author

FObersteiner commented May 17, 2024

just added a simple implementation of the baseline correction; looks still reasonable to me (very second run has the correction active):

Bench baseline, n=10   10       611ns          61ns ± 21ns            (45ns ... 121ns)             62ns       121ns      121ns     54ns ± 4ns
Bench baseline, n=10   10       19ns           1ns ± 6ns              (0ns ... 19ns)               0ns        19ns       19ns      70ns ± 40ns
Bench baseline, n=1k   1000     62.01us        62ns ± 17ns            (53ns ... 503ns)             63ns       105ns      117ns     66ns ± 14ns
Bench baseline, n=1k   1000     1.263us        1ns ± 16ns             (0ns ... 408ns)              0ns        32ns       34ns      60ns ± 13ns
Bench baseline, n=10k  10000    540.284us      54ns ± 8ns             (41ns ... 143ns)             59ns       85ns       89ns      58ns ± 8ns
Bench baseline, n=10k  10000    3.154us        0ns ± 8ns              (0ns ... 579ns)              0ns        12ns       15ns      28ns ± 8ns

To be fair, the 0 ± something might be a bit confusing, as this is not to say that is goes below zero. I wasn't able to find a hack for faster-than-light travel, sorry ;-)

@FObersteiner FObersteiner marked this pull request as ready for review May 17, 2024 20:41
zbench.zig Outdated Show resolved Hide resolved
@hendriknielaender
Copy link
Owner

hendriknielaender commented May 18, 2024

Should we compare this feature with all build options? I think you mentioned that you tested this in debug mode.

@FObersteiner
Copy link
Collaborator Author

FObersteiner commented May 18, 2024

Should we compare this feature will all build options? I think you mentioned that you tested this in debug mode.

added a checkbox :) Also, I'll see how this all looks on Windows this evening (probably). This has been an interesting deep-dive into system clocks already!

@FObersteiner
Copy link
Collaborator Author

Comparing different optimization modes, it looks all fine to me on Linux. I've put the results here since this would be a bit clunky to post as a comment.

For Windows, results are here. The baseline correction doesn't look very effective, TBH.

All in all, I think on Linux this can actually help to improve accuracy for benchmarking functions that run very fast, in the nanosecond range. @hendriknielaender could you test this on Mac? As for Windows, the 100ns granularity of the "performance counter" is giving trouble; potentially, subtracting the average baseline time from the results afterwards instead of subtracting each no-op time individually in runImpl might help. However, that would require some refactoring of other parts of zBench since the config (correction on/off?) cannot be easily accessed from everywhere. My suggestion would be to postpone this, maybe open an issue to keep track of it.

@hendriknielaender
Copy link
Owner

Sure I will test this on mac 👍

For the windows case, we could implement a generic experimental flag. This would allow us to enable the feature on linux and macos while excluding it on windows until the necessary refactoring is complete. In the long run I think we should have a stable baseline calculation, which is per default subtracted.

@FObersteiner
Copy link
Collaborator Author

Here's an idea how we could do this in one go. It might not be that complicated. My idea would be to keep the baseline correction feature a simple on/off setting in the benchmark config. It should be up to the user to experiment with it, independent of the platform.

The changes I imagine for this, based on the current state of this branch:

  • only record a baseline timing if the feature is enabled, otherwise zero.
  • add a data processor method to the Result type, that uses Statistics (which I would want to keep generic as it is now) plus baseline correction. If the feature is off, all baseline timings are zero, thus subtracting them from the actual timings doesn't change anything. If it is on, timings will be updated.

@hendriknielaender
Copy link
Owner

Here are the mac baseline results they were executed with an old 2016 macbook on macOS 12.7. I could also add another test with a newer macbook pro, but earliest 3 of june.

zbench.zig Outdated Show resolved Hide resolved
@FObersteiner
Copy link
Collaborator Author

FObersteiner commented May 25, 2024

By the way, the run implementation calls for a new Timer every time a function gets benchmarked. This may sound stupid but I just realized this now; it might be worth to investigate if it would be better in terms of baseline stability to instantiate one timer, then query it repeatedly. IIRC, that was how it was implemented in one of the early versions of zBench. For benchmarked functions that have very short execution times (i.e. many benchmark runs), that would avoid a lot of system calls (assuming that the Zig compiler does not optimize in this direction already).

@hendriknielaender
Copy link
Owner

You are right, I think we need to move the timer instantiation out of the runImpl function and move it into a higher scope where it is created once and reused. And then modify the runImpl function to use the pre-instantiated timer for each benchmark run.

@hendriknielaender
Copy link
Owner

@FObersteiner how we wanna continue with this PR? I really like the feature, to have a more precise benchmark.

@FObersteiner
Copy link
Collaborator Author

@FObersteiner how we wanna continue with this PR? I really like the feature, to have a more precise benchmark.

To be honest, I didn't reach a point where I thought this is ready to ship. Determining the no-op time turned out to be very hard; results are highly variable. And that is even true if you use the RDTSC register of the CPU directly. Here's some code that does this: https://codeberg.org/FObersteiner/benchmarks/src/branch/main/src/bench_noop_rdtsc.zig - While this was a nice deep-dive into computer clocks, I find that micro-benchmarking is really hard. Don't trust a benchmarking library that claims to do "micro-benchmarks" and just uses the standard means ^^ So for "normal" benchmarking, I think it is important to be aware that there is an uncertainty of, say 15-20 nanoseconds. Maybe we can clarify that in the readme. At the moment, I don't see a reliable "baseline correction", one that actually improves precision (or accuracy).

@hendriknielaender
Copy link
Owner

@FObersteiner how we wanna continue with this PR? I really like the feature, to have a more precise benchmark.

To be honest, I didn't reach a point where I thought this is ready to ship. Determining the no-op time turned out to be very hard; results are highly variable. And that is even true if you use the RDTSC register of the CPU directly. Here's some code that does this: https://codeberg.org/FObersteiner/benchmarks/src/branch/main/src/bench_noop_rdtsc.zig - While this was a nice deep-dive into computer clocks, I find that micro-benchmarking is really hard. Don't trust a benchmarking library that claims to do "micro-benchmarks" and just uses the standard means ^^ So for "normal" benchmarking, I think it is important to be aware that there is an uncertainty of, say 15-20 nanoseconds. Maybe we can clarify that in the readme. At the moment, I don't see a reliable "baseline correction", one that actually improves precision (or accuracy).

Good point! Then i will close this, and then lets add this info to the readme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature tracking Kepp track of an issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants