Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve benchmark.cc, adding -ryu and -invert options. #72

Merged
merged 8 commits into from
Aug 11, 2018

Conversation

StephanTLavavej
Copy link
Contributor

This contains several improvements for benchmark.cc, grouped into separate commits for ease of reviewing. There are two new features:

  • A -ryu option to run only Ryu without Grisu3. This is a time-saver when profiling Ryu changes.

  • An -invert option to switch the sample/iteration loops. This should help to understand the impact of branch (mis)prediction on Ryu.

Commit notes:


benchmark.cc: Drop if (throwaway == 12345).

This code was present in the float benchmarking, but not double. I suspect that it was a relic of an earlier era when throwaway wasn't returned to main() for inspection.


benchmark.cc: constexpr BUFFER_SIZE, avoid calloc().

BUFFER_SIZE should be a compile-time constant, not a modifiable global variable.

Instead of dynamically allocating memory (and never freeing it), it's simpler to make buffer a global array and bufferown a local array on the stack.


benchmark.cc: Simplify fcv() and dcv().

Part of the code already assumes that fcv() and dcv() update buffer. Instead of calling them again, we can simply compare bufferown and buffer to verify that Ryu and Grisu3 produce identical output.


benchmark.cc: Improve C++ style.

Directly name struct mean_and_variance.

Prefer preincrement.

Use auto for steady_clock::time_point (which is verbose, and type-safe, so there's little risk of getting confused about what it is).

Use static_cast.


benchmark.cc: Add "-ryu" for Ryu-only mode.

This mode still prints Grisu3 stats, as near-zero garbage. (Fixed by a later commit.)


benchmark.cc: mean_and_variance member functions.

This replaces init() with C++11 default member initializers (less verbose than a constructor), changes update() and variance() into member functions, and adds stddev() so sqrt() doesn't need to be repeated later.

One subtlety: Previously, update() said int64_t n = ++mv.n; and later read this local variable. Because this preincremented the data member before storing the local variable, and nothing else is happening, we can drop the local variable and simply read the data member.


benchmark.cc: Add "-invert" to switch loops.

Normally, the benchmark generates samples in an outer loop, and then it tests each sample in an inner loop for iterations. This allows mean and variance statistics to be collected for each sample, which is useful because some samples exercise different codepaths.

However, repeatedly testing each sample causes Ryu's branches to be highly predictable, which is unrealistic for the real world.

The new "-invert" option generates samples into a std::vector (to keep the Mersenne Twister out of the profiling). Then it has an outer loop for iterations, and its inner loop tests each sample once. This has realistic branch (mis)prediction characteristics.

Note that when inverting, we divide the t2 - t1 duration (which is the total time taken to convert all samples in the vector) by samples, so the delta is the average time taken to convert a sample. This means that the printed Average values are comparable between normal mode and inverted mode (and the difference shows how costly branch misprediction is). The printed Stddev values aren't comparable, though - in normal mode, they measure the variation between samples (which is nonzero, even for an ideal machine, because Ryu does different amounts of work for different inputs), while in inverted mode, they measure the variation between each vector loop (which would ideally be zero for a perfectly deterministic machine).

Example output:

C:\Temp\TESTING_X64>benchmark_clang
    Average & Stddev Ryu  Average & Stddev Grisu3
32:   19.561    1.811       91.689   52.230
64:   29.868    1.882      106.720   89.150

C:\Temp\TESTING_X64>benchmark_clang -invert
    Average & Stddev Ryu  Average & Stddev Grisu3
32:   31.694    1.187      117.464    4.087
64:   42.507    0.963      131.933    1.514

benchmark.cc: Improve "-ryu" and "-invert" output.

This avoids printing unnecessary fields.

This code was present in the float benchmarking, but not double.
I suspect that it was a relic of an earlier era when throwaway
wasn't returned to main() for inspection.
BUFFER_SIZE should be a compile-time constant,
not a modifiable global variable.

Instead of dynamically allocating memory (and never freeing it),
it's simpler to make buffer a global array and bufferown a local array
on the stack.
Part of the code already assumes that fcv() and dcv() update buffer.
Instead of calling them again, we can simply compare bufferown and
buffer to verify that Ryu and Grisu3 produce identical output.
Directly name `struct mean_and_variance`.

Prefer preincrement.

Use auto for steady_clock::time_point (which is verbose, and type-safe,
so there's little risk of getting confused about what it is).

Use static_cast.
This mode still prints Grisu3 stats, as near-zero garbage.
This replaces init() with C++11 default member initializers (less
verbose than a constructor), changes update() and variance() into
member functions, and adds stddev() so sqrt() doesn't need to be
repeated later.

One subtlety: Previously, update() said `int64_t n = ++mv.n;` and later
read this local variable. Because this preincremented the data member
before storing the local variable, and nothing else is happening, we
can drop the local variable and simply read the data member.
Normally, the benchmark generates samples in an outer loop, and then it
tests each sample in an inner loop for iterations. This allows mean and
variance statistics to be collected for each sample, which is useful
because some samples exercise different codepaths.

However, repeatedly testing each sample causes Ryu's branches to be
highly predictable, which is unrealistic for the real world.

The new "-invert" option generates samples into a std::vector (to keep
the Mersenne Twister out of the profiling). Then it has an outer loop
for iterations, and its inner loop tests each sample once. This has
realistic branch (mis)prediction characteristics.

Note that when inverting, we divide the t2 - t1 duration (which is the
total time taken to convert all samples in the vector) by samples, so
the delta is the average time taken to convert a sample. This means
that the printed Average values are comparable between normal mode
and inverted mode (and the difference shows how costly branch
misprediction is). The printed Stddev values aren't comparable,
though - in normal mode, they measure the variation between samples
(which is nonzero, even for an ideal machine, because Ryu does
different amounts of work for different inputs), while in inverted
mode, they measure the variation between each vector loop (which would
ideally be zero for a perfectly deterministic machine).

Example output:

```
C:\Temp\TESTING_X64>benchmark_clang
    Average & Stddev Ryu  Average & Stddev Grisu3
32:   19.561    1.811       91.689   52.230
64:   29.868    1.882      106.720   89.150

C:\Temp\TESTING_X64>benchmark_clang -invert
    Average & Stddev Ryu  Average & Stddev Grisu3
32:   31.694    1.187      117.464    4.087
64:   42.507    0.963      131.933    1.514
```
This avoids printing unnecessary fields.
@ulfjack
Copy link
Owner

ulfjack commented Aug 11, 2018

Nice. I've been thinking about that for a few days already, so thank you very much for that.

I wonder if we should switch to the inverse mode by default. Also, maybe we should add test coverage specifically for small integers - performance improvements in that area might also be worthwhile even if they don't show up across the entire range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants