You probably shouldn't (or, if you do, don't rely on the results). The virtualization used by Cloud-CI providers like Travis-CI and Github Actions introduces a great deal of noise into the benchmarking process, and Criterion.rs' statistical analysis can only do so much to mitigate that. This can result in the appearance of large changes in the measured performance even if the actual performance of the code is not changing. A better alternative is to use Iai instead. Iai runs benchmarks inside Cachegrind to directly count the instructions and memory accesses. Iai's measurements won't be thrown off by the virtual machine slowing down or pausing for a time, so it should be more reliable in virtualized environments.
Whichever benchmarking tool you use, though, the process is basically the same. You'll need to:
- Check out the main branch of your code
- Build it and run the benchmarks once, to establish a baseline
- Then switch to the pull request branch
- Built it again and run the benchmarks a second time to compare against the baseline.
By default, Cargo implicitly adds a libtest
benchmark harness to your crate when benchmarking, to
handle any #[bench]
functions, even if you have none. It compiles and runs this executable first,
before any of the other benchmarks. Normally, this is fine - it detects that there are no libtest
benchmarks to execute and exits, allowing Cargo to move on to the real benchmarks. Unfortunately,
it checks the command-line arguments first, and panics when it finds one it doesn't understand.
This causes Cargo to stop benchmarking early, and it never executes the Criterion.rs benchmarks.
This will occur when running cargo bench
with any argument that Criterion.rs supports but libtest
does not. For example, --verbose
and --save-baseline
will cause this issue, while --help
will
not. There are two ways to work around this at present:
You could run only your Criterion benchmark, like so:
cargo bench --bench my_benchmark -- --verbose
Note that my_benchmark
here corresponds to the name of your benchmark in your
Cargo.toml
file.
Another option is to disable benchmarks for your lib or app crate. For example,
for library crates, you could add this to your Cargo.toml
file:
[lib]
bench = false
If your crate produces one or more binaries as well as a library, you may need to add additional
records to Cargo.toml
like this:
[[bin]]
name = "my-binary"
path = "src/bin/my-binary.rs"
bench = false
This is because Cargo automatically discovers some kinds of binaries and it will enable the default benchmark harness for these as well.
Of course, this only works if you define all of your benchmarks in the
benches
directory.
See Rust Issue #47241 for more details.
Exactly the same way as you would benchmark any other function.
It is sometimes suggested that benchmarks of small (nanosecond-scale) functions should iterate the function to be benchmarked many times internally to reduce the impact of measurement overhead. This is not required with Criterion.rs, and it is not recommended.
To see this, consider the following benchmark:
fn compare_small(c: &mut Criterion) {
let mut group = c.benchmark_group("small");
group.bench_with_input("unlooped", 10, |b, i| b.iter(|| i + 10));
group.bench_with_input("looped", 10, |b, i| b.iter(|| {
for _ in 0..10000 {
std::hint::black_box(i + 10);
}
}));
group.finish();
}
This benchmark simply adds two numbers - just about the smallest function that could be performed. On my computer, this produces the following output:
small/unlooped time: [270.00 ps 270.78 ps 271.56 ps]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high severe
small/looped time: [2.7051 us 2.7142 us 2.7238 us]
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
2.714 microseconds/10000 gives 271.4 picoseconds, or pretty much the same result. Interestingly, this is slightly more than one cycle of my 4th-gen Core i7's maximum clock frequency of 4.4 GHz, which shows how good the pipelining is on modern CPUs. Regardless, Criterion.rs is able to accurately measure functions all the way down to single instructions. See the Analysis Process page for more details on how Criterion.rs performs its measurements, or see the Timing Loops page for details on choosing a timing loop to minimize measurement overhead.
black_box
is a function which prevents certain compiler optimizations. Benchmarks are often
slightly artificial in nature and the compiler can take advantage of that to generate faster code
when compiling the benchmarks than it would in real usage. In particular, it is common for
benchmarked functions to be called with constant parameters, and in some cases rustc can
evaluate the function entirely at compile time and replace the function call with a constant.
This can produce unnaturally fast benchmarks that don't represent how some code would perform
when called normally. Therefore, it's useful to black-box the constant input to prevent this
optimization.
However, you might have a function which you expect to be called with one or more constant parameters. In this case, you might want to write your benchmark to represent that scenario instead, and allow the compiler to optimize the constant parameters.
For the most part, Criterion.rs handles this for you - if you use parameterized benchmarks, the parameters are automatically black-boxed by Criterion.rs so you don't need to do anything. If you're writing an un-parameterized benchmark of a function that takes an argument, however, this may be worth considering.
Currently, Cargo treats any *.rs
file in the benches
directory as a
benchmark, unless there are one or more [[bench]]
sections in the
Cargo.toml
file. In that case, the auto-discovery is disabled
entirely.
In Rust 2018 edition, Cargo will be changed so that [[bench]]
no longer
disables the auto-discovery. If your benches
directory contains source files
that are not benchmarks, this could break your build when you update, as Cargo
will attempt to compile them as benchmarks and fail.
There are two ways to prevent this breakage from happening. You can explicitly turn off the autodiscovery like so:
[[package]]
autobenches = false
The other option is to move those non-benchmark files to a subdirectory (eg.
benches/benchmark_code
) where they will no longer be detected as benchmarks.
I would recommend the latter option.
Note that a file which contains a criterion_main!
is a valid benchmark and can
safely stay where it is.
Don't worry, Criterion.rs isn't broken and you (probably) didn't do anything wrong. The most common reason for this is that the optimizer just happened to optimize your function differently after the change.
Optimizing compiler backends such as LLVM (which is used by rustc
) are often complex beasts full
of hand-rolled pattern matching code that detects when a particular optimization is possible and
tries to guess whether it would make the code faster. Unfortunately, despite all of the engineering
work that goes into these compilers, it's pretty common for apparently-trivial changes to the source
like changing the order of lines to be enough to cause these optimizers to act differently. On top of
this, apparently-small changes like changing the type of a variable or calling a slightly different
function (such as unwrap
vs expect
) actually have much larger impacts under the hood than the
slight different in source text might suggest.
If you want to learn more about this (and some proposals for improving this situation in the future), I like this paper by Regehr et al.
On a similar subject, it's important to remember that a benchmark is only ever an estimate of the true performance of your function. If the optimizer can have significant effects on performance in an artificial environment like a benchmark, what about when your function is inlined into a variety of different calling contexts? The optimizer will almost certainly make different decisions for each caller. One hopes that each specialized version will be faster, but that can't be guaranteed. In a world of optimizing compilers, the "true performance" of a function is a fuzzy thing indeed.
If you're still sure that Criterion.rs is doing something wrong, file an issue describing the problem.
Typically this happens because the benchmark environments aren't quite the same. There are a lot of factors that can influence benchmarks. Other processes might be using the CPU or memory. Battery-powered devices often have power-saving modes that clock down the CPU (and these sometimes appear in desktops as well). If your benchmarks are run inside a VM, there might be other VMs on the same physical machine competing for resources.
However, sometimes this happens even with no change. It's important to remember that Criterion.rs detects regressions and improvements statistically. There is always a chance that you randomly get unusually fast or slow samples, enough that Criterion.rs detects it as a change even though no change has occurred. In very large benchmark suites you might expect to see several of these spurious detections each time you run the benchmarks.
Unfortunately, this is a fundamental trade-off in statistics. In order to decrease the rate of false detections, you must also decrease the sensitivity to small changes. Conversely, to increase the sensitivity to small changes, you must also increase the chance of false detections. Criterion.rs has default settings that strike a generally-good balance between the two, but you can adjust the settings to suit your needs.
When Cargo runs benchmarks, it passes the --bench
or --test
command-line arguments to the
benchmark executables. Criterion.rs looks for these arguments and tries to either run benchmarks or
run in test mode. In particular, when you run cargo test --benches
(run tests, including testing
benchmarks) Cargo does not pass either of these arguments. This is perhaps strange, since cargo bench --test
passes both --bench
and --test
. In any case, Criterion.rs benchmarks run in test
mode when --bench
is not present, or when --bench
and --test
are both present.
First, check the Getting Started
guide and ensure that the [[bench]]
section of your Cargo.toml is set up correctly. If it's
correct, read on.
This can be caused by two different things.
Most commonly, this problem happens when trying to benchmark a binary (as opposed to library) crate. Criterion.rs cannot be used to benchmark binary crates (see the Known Limitations page for more details on why). The usual workaround is to structure your application as a library crate that implements most of the functionality of the application and a binary crate which acts as a thin wrapper around the library crate to provide things like a CLI. Then, you can create Criterion.rs benchmarks that depend on the library crate.
Less often, the problem is that the library crate is configured to compile as a cdylib
. In order
to benchmark your crate with Criterion.rs, you will need to set your Cargo.toml to enable generating
an rlib
as well.
The short answer is - you can't, not accurately. The longer answer is below.
When people ask me this, my first response is always "extract that part of the function into a new
function, give it a name, and then benchmark that". It's sort of unsatisfying, but that is also
the only way to get really accurate measurements of that piece of your code. You can always tag it
with #[inline(always)]
to tell rustc to inline it back into the original callsite in the final
executable.
The problem is that your system's clock is not infinitely precise; there is a certain (often
surprisingly large) granularity to the clock time reported by Instant::now
. That means that,
if it were to measure each execution individually, Criterion.rs might see a sequence of times
like "0ms, 0ms, 0ms, 0ms, 0ms, 5ms, 0ms..." for a function that takes 1ms. To mitigate this,
Criterion.rs runs many iterations of your benchmark, to divide that jitter across each iteration.
There would be no way to run such a timing loop on part of your code, unless that part were
already easy to factor out and put in a separate function anyway. Instead, you'd have to
time each iteration individually, resulting in the maximum possible timing jitter.
However, if you need to do this anyway, and you're OK with the reduced accuracy, you can use
Bencher::iter_custom
to measure your code however you want to. iter_custom
exists to allow for
complex cases like multi-threaded code or, yes, measuring part of a function. Just be aware that
you're responsible for the accuracy of your measurements.