Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate nonius to provide more advanced benchmarking #1616

Merged
merged 3 commits into from
Jun 7, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
249 changes: 249 additions & 0 deletions docs/benchmarks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
# Authoring benchmarks

Writing benchmarks is not easy. Catch simplifies certain aspects but you'll
always need to take care about various aspects. Understanding a few things about
the way Catch runs your code will be very helpful when writing your benchmarks.

First off, let's go over some terminology that will be used throughout this
guide.

- *User code*: user code is the code that the user provides to be measured.
- *Run*: one run is one execution of the user code.
- *Sample*: one sample is one data point obtained by measuring the time it takes
to perform a certain number of runs. One sample can consist of more than one
run if the clock available does not have enough resolution to accurately
measure a single run. All samples for a given benchmark execution are obtained
with the same number of runs.

## Execution procedure

Now I can explain how a benchmark is executed in Catch. There are three main
steps, though the first does not need to be repeated for every benchmark.

1. *Environmental probe*: before any benchmarks can be executed, the clock's
resolution is estimated. A few other environmental artifacts are also estimated
at this point, like the cost of calling the clock function, but they almost
never have any impact in the results.

2. *Estimation*: the user code is executed a few times to obtain an estimate of
the amount of runs that should be in each sample. This also has the potential
effect of bringing relevant code and data into the caches before the actual
measurement starts.

3. *Measurement*: all the samples are collected sequentially by performing the
number of runs estimated in the previous step for each sample.

This already gives us one important rule for writing benchmarks for Catch: the
benchmarks must be repeatable. The user code will be executed several times, and
the number of times it will be executed during the estimation step cannot be
known beforehand since it depends on the time it takes to execute the code.
User code that cannot be executed repeatedly will lead to bogus results or
crashes.

## Benchmark specification

Benchmarks can be specified anywhere inside a Catch test case.
There is a simple and a slightly more advanced version of the `BENCHMARK` macro.

Let's have a look how a naive Fibonacci implementation could be benchmarked:
```c++
std::uint64_t Fibonacci(std::uint64_t number) {
return number < 2 ? 1 : Fibonacci(number - 1) + Fibonacci(number - 2);
}
```
Now the most straight forward way to benchmark this function, is just adding a `BENCHMARK` macro to our test case:
```c++
TEST_CASE("Fibonacci") {
CHECK(Fibonacci(0) == 1);
// some more asserts..
CHECK(Fibonacci(5) == 8);
// some more asserts..

// now let's benchmark:
BENCHMARK("Fibonacci 20") {
return Fibonacci(20);
};

BENCHMARK("Fibonacci 25") {
return Fibonacci(25);
};

BENCHMARK("Fibonacci 30") {
return Fibonacci(30);
};

BENCHMARK("Fibonacci 35") {
return Fibonacci(35);
};
}
```
There's a few things to note:
- As `BENCHMARK` expands to a lambda expression it is necessary to add a semicolon after
the closing brace (as opposed to the first experimental version).
- The `return` is a handy way to avoid the compiler optimizing away the benchmark code.

Running this already runs the benchmarks and outputs something similar to:
```
-------------------------------------------------------------------------------
Fibonacci
-------------------------------------------------------------------------------
C:\path\to\Catch2\Benchmark.tests.cpp(10)
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
Fibonacci 20 100 416439 83.2878 ms
2 ns 2 ns 2 ns
0 ns 0 ns 0 ns

Fibonacci 25 100 400776 80.1552 ms
3 ns 3 ns 3 ns
0 ns 0 ns 0 ns

Fibonacci 30 100 396873 79.3746 ms
17 ns 17 ns 17 ns
0 ns 0 ns 0 ns

Fibonacci 35 100 145169 87.1014 ms
468 ns 464 ns 473 ns
21 ns 15 ns 34 ns
```

### Advanced benchmarking
The simplest use case shown above, takes no arguments and just runs the user code that needs to be measured.
However, if using the `BENCHMARK_ADVANCED` macro and adding a `Catch::Benchmark::Chronometer` argument after
the macro, some advanced features are available. The contents of the simple benchmarks are invoked once per run,
while the blocks of the advanced benchmarks are invoked exactly twice:
once during the estimation phase, and another time during the execution phase.

```c++
BENCHMARK("simple"){ return long_computation(); };

BENCHMARK_ADVANCED("advanced")(Catch::Benchmark::Chronometer meter) {
set_up();
meter.measure([] { return long_computation(); });
};
```

These advanced benchmarks no longer consist entirely of user code to be measured.
In these cases, the code to be measured is provided via the
`Catch::Benchmark::Chronometer::measure` member function. This allows you to set up any
kind of state that might be required for the benchmark but is not to be included
in the measurements, like making a vector of random integers to feed to a
sorting algorithm.

A single call to `Catch::Benchmark::Chronometer::measure` performs the actual measurements
by invoking the callable object passed in as many times as necessary. Anything
that needs to be done outside the measurement can be done outside the call to
`measure`.

The callable object passed in to `measure` can optionally accept an `int`
parameter.

```c++
meter.measure([](int i) { return long_computation(i); });
```

If it accepts an `int` parameter, the sequence number of each run will be passed
in, starting with 0. This is useful if you want to measure some mutating code,
for example. The number of runs can be known beforehand by calling
`Catch::Benchmark::Chronometer::runs`; with this one can set up a different instance to be
mutated by each run.

```c++
std::vector<std::string> v(meter.runs());
std::fill(v.begin(), v.end(), test_string());
meter.measure([&v](int i) { in_place_escape(v[i]); });
```

Note that it is not possible to simply use the same instance for different runs
and resetting it between each run since that would pollute the measurements with
the resetting code.

It is also possible to just provide an argument name to the simple `BENCHMARK` macro to get
the same semantics as providing a callable to `meter.measure` with `int` argument:

```c++
BENCHMARK("indexed", i){ return long_computation(i); };
```

### Constructors and destructors

All of these tools give you a lot mileage, but there are two things that still
need special handling: constructors and destructors. The problem is that if you
use automatic objects they get destroyed by the end of the scope, so you end up
measuring the time for construction and destruction together. And if you use
dynamic allocation instead, you end up including the time to allocate memory in
the measurements.

To solve this conundrum, Catch provides class templates that let you manually
construct and destroy objects without dynamic allocation and in a way that lets
you measure construction and destruction separately.

```c++
BENCHMARK_ADVANCED("construct")(Catch::Benchmark::Chronometer meter)
{
std::vector<Catch::Benchmark::storage_for<std::string>> storage(meter.runs());
meter.measure([&](int i) { storage[i].construct("thing"); });
})

BENCHMARK_ADVANCED("destroy", [](Catch::Benchmark::Chronometer meter)
{
std::vector<Catch::Benchmark::destructable_object<std::string>> storage(meter.runs());
for(auto&& o : storage)
o.construct("thing");
meter.measure([&](int i) { storage[i].destruct(); });
})
```

`Catch::Benchmark::storage_for<T>` objects are just pieces of raw storage suitable for `T`
objects. You can use the `Catch::Benchmark::storage_for::construct` member function to call a constructor and
create an object in that storage. So if you want to measure the time it takes
for a certain constructor to run, you can just measure the time it takes to run
this function.

When the lifetime of a `Catch::Benchmark::storage_for<T>` object ends, if an actual object was
constructed there it will be automatically destroyed, so nothing leaks.

If you want to measure a destructor, though, we need to use
`Catch::Benchmark::destructable_object<T>`. These objects are similar to
`Catch::Benchmark::storage_for<T>` in that construction of the `T` object is manual, but
it does not destroy anything automatically. Instead, you are required to call
the `Catch::Benchmark::destructable_object::destruct` member function, which is what you
can use to measure the destruction time.

### The optimizer

Sometimes the optimizer will optimize away the very code that you want to
measure. There are several ways to use results that will prevent the optimiser
from removing them. You can use the `volatile` keyword, or you can output the
value to standard output or to a file, both of which force the program to
actually generate the value somehow.

Catch adds a third option. The values returned by any function provided as user
code are guaranteed to be evaluated and not optimised out. This means that if
your user code consists of computing a certain value, you don't need to bother
with using `volatile` or forcing output. Just `return` it from the function.
That helps with keeping the code in a natural fashion.

Here's an example:

```c++
// may measure nothing at all by skipping the long calculation since its
// result is not used
BENCHMARK("no return"){ long_calculation(); };

// the result of long_calculation() is guaranteed to be computed somehow
BENCHMARK("with return"){ return long_calculation(); };
```

However, there's no other form of control over the optimizer whatsoever. It is
up to you to write a benchmark that actually measures what you want and doesn't
just measure the time to do a whole bunch of nothing.

To sum up, there are two simple rules: whatever you would do in handwritten code
to control optimization still works in Catch; and Catch makes return values
from user code into observable effects that can't be optimized away.

<i>Adapted from nonius' documentation.</i>
49 changes: 41 additions & 8 deletions docs/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,10 @@
[Specify a seed for the Random Number Generator](#specify-a-seed-for-the-random-number-generator)<br>
[Identify framework and version according to the libIdentify standard](#identify-framework-and-version-according-to-the-libidentify-standard)<br>
[Wait for key before continuing](#wait-for-key-before-continuing)<br>
[Specify multiples of clock resolution to run benchmarks for](#specify-multiples-of-clock-resolution-to-run-benchmarks-for)<br>
[Specify the number of benchmark samples to collect](#specify-the-number-of-benchmark-samples-to-collect)<br>
[Specify the number of benchmark resamples for bootstrapping](#specify-the-number-of-resamples-for-bootstrapping)<br>
[Specify the confidence interval for bootstrapping](#specify-the-confidence-interval-for-bootstrapping)<br>
[Disable statistical analysis of collected benchmark samples](#disable-statistical-analysis-of-collected-benchmark-samples)<br>
[Usage](#usage)<br>
[Specify the section to run](#specify-the-section-to-run)<br>
[Filenames as tags](#filenames-as-tags)<br>
Expand Down Expand Up @@ -57,7 +60,10 @@ Click one of the following links to take you straight to that option - or scroll
<a href="#rng-seed"> ` --rng-seed`</a><br />
<a href="#libidentify"> ` --libidentify`</a><br />
<a href="#wait-for-keypress"> ` --wait-for-keypress`</a><br />
<a href="#benchmark-resolution-multiple"> ` --benchmark-resolution-multiple`</a><br />
<a href="#benchmark-samples"> ` --benchmark-samples`</a><br />
<a href="#benchmark-resamples"> ` --benchmark-resamples`</a><br />
<a href="#benchmark-confidence-interval"> ` --benchmark-confidence-interval`</a><br />
<a href="#benchmark-no-analysis"> ` --benchmark-no-analysis`</a><br />
<a href="#use-colour"> ` --use-colour`</a><br />

</br>
Expand Down Expand Up @@ -267,13 +273,40 @@ See [The LibIdentify repo for more information and examples](https://github.com/
Will cause the executable to print a message and wait until the return/ enter key is pressed before continuing -
either before running any tests, after running all tests - or both, depending on the argument.

<a id="benchmark-resolution-multiple"></a>
## Specify multiples of clock resolution to run benchmarks for
<pre>--benchmark-resolution-multiple &lt;multiplier&gt;</pre>
<a id="benchmark-samples"></a>
## Specify the number of benchmark samples to collect
<pre>--benchmark-samples &lt;# of samples&gt;</pre>

When running benchmarks the clock resolution is estimated. Benchmarks are then run for exponentially increasing
numbers of iterations until some multiple of the estimated resolution is exceed. By default that multiple is 100, but
it can be overridden here.
When running benchmarks a number of "samples" is collected. This is the base data for later statistical analysis.
Per sample a clock resolution dependent number of iterations of the user code is run, which is independent of the number of samples. Defaults to 100.

<a id="benchmark-resamples"></a>
## Specify the number of resamples for bootstrapping
<pre>--benchmark-resamples &lt;# of resamples&gt;</pre>

After the measurements are performed, statistical [bootstrapping] is performed
on the samples. The number of resamples for that bootstrapping is configurable
but defaults to 100000. Due to the bootstrapping it is possible to give
estimates for the mean and standard deviation. The estimates come with a lower
bound and an upper bound, and the confidence interval (which is configurable but
defaults to 95%).

[bootstrapping]: http://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29

<a id="benchmark-confidence-interval"></a>
## Specify the confidence-interval for bootstrapping
<pre>--benchmark-confidence-interval &lt;confidence-interval&gt;</pre>

The confidence-interval is used for statistical bootstrapping on the samples to
calculate the upper and lower bounds of mean and standard deviation.
Must be between 0 and 1 and defaults to 0.95.

<a id="benchmark-no-analysis"></a>
## Disable statistical analysis of collected benchmark samples
<pre>--benchmark-no-analysis</pre>

When this flag is specified no bootstrapping or any other statistical analysis is performed.
Instead the user code is only measured and the plain mean from the samples is reported.

<a id="usage"></a>
## Usage
Expand Down
1 change: 1 addition & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,7 @@ by using `_NO_` in the macro, e.g. `CATCH_CONFIG_NO_CPP17_UNCAUGHT_EXCEPTIONS`.
CATCH_CONFIG_DISABLE // Disables assertions and test case registration
CATCH_CONFIG_WCHAR // Enables use of wchart_t
CATCH_CONFIG_EXPERIMENTAL_REDIRECT // Enables the new (experimental) way of capturing stdout/stderr
CATCH_CONFIG_ENABLE_BENCHMARKING // Enables the integrated benchmarking features (has a significant effect on compilation speed)

Currently Catch enables `CATCH_CONFIG_WINDOWS_SEH` only when compiled with MSVC, because some versions of MinGW do not have the necessary Win32 API support.

Expand Down
20 changes: 19 additions & 1 deletion include/catch.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,6 @@
#include "internal/catch_test_registry.h"
#include "internal/catch_capture.hpp"
#include "internal/catch_section.h"
#include "internal/catch_benchmark.h"
#include "internal/catch_interfaces_exception.h"
#include "internal/catch_approx.h"
#include "internal/catch_compiler_capabilities.h"
Expand All @@ -79,6 +78,10 @@
#include "internal/catch_external_interfaces.h"
#endif

#if defined(CATCH_CONFIG_ENABLE_BENCHMARKING)
#include "internal/benchmark/catch_benchmark.hpp"
#endif

#endif // ! CATCH_CONFIG_IMPL_ONLY

#ifdef CATCH_IMPL
Expand All @@ -89,6 +92,7 @@
#include "internal/catch_default_main.hpp"
#endif


#if !defined(CATCH_CONFIG_IMPL_ONLY)

#ifdef CLARA_CONFIG_MAIN_NOT_DEFINED
Expand Down Expand Up @@ -188,6 +192,13 @@
#define CATCH_THEN( desc ) INTERNAL_CATCH_DYNAMIC_SECTION( " Then: " << desc )
#define CATCH_AND_THEN( desc ) INTERNAL_CATCH_DYNAMIC_SECTION( " And: " << desc )

#if defined(CATCH_CONFIG_ENABLE_BENCHMARKING)
#define CATCH_BENCHMARK(...) \
INTERNAL_CATCH_BENCHMARK(INTERNAL_CATCH_UNIQUE_NAME(____C_A_T_C_H____B_E_N_C_H____), INTERNAL_CATCH_GET_1_ARG(__VA_ARGS__,,), INTERNAL_CATCH_GET_2_ARG(__VA_ARGS__,,))
#define CATCH_BENCHMARK_ADVANCED(name) \
INTERNAL_CATCH_BENCHMARK_ADVANCED(INTERNAL_CATCH_UNIQUE_NAME(____C_A_T_C_H____B_E_N_C_H____), name)
#endif // CATCH_CONFIG_ENABLE_BENCHMARKING

// If CATCH_CONFIG_PREFIX_ALL is not defined then the CATCH_ prefix is not required
#else

Expand Down Expand Up @@ -283,6 +294,13 @@
#define THEN( desc ) INTERNAL_CATCH_DYNAMIC_SECTION( " Then: " << desc )
#define AND_THEN( desc ) INTERNAL_CATCH_DYNAMIC_SECTION( " And: " << desc )

#if defined(CATCH_CONFIG_ENABLE_BENCHMARKING)
#define BENCHMARK(...) \
INTERNAL_CATCH_BENCHMARK(INTERNAL_CATCH_UNIQUE_NAME(____C_A_T_C_H____B_E_N_C_H____), INTERNAL_CATCH_GET_1_ARG(__VA_ARGS__,,), INTERNAL_CATCH_GET_2_ARG(__VA_ARGS__,,))
#define BENCHMARK_ADVANCED(name) \
INTERNAL_CATCH_BENCHMARK_ADVANCED(INTERNAL_CATCH_UNIQUE_NAME(____C_A_T_C_H____B_E_N_C_H____), name)
#endif // CATCH_CONFIG_ENABLE_BENCHMARKING

using Catch::Detail::Approx;

#else // CATCH_CONFIG_DISABLE
Expand Down
Loading