Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate faster performance than upstream implementation #35

Open
dtolnay opened this issue Oct 2, 2020 · 2 comments
Open

Investigate faster performance than upstream implementation #35

dtolnay opened this issue Oct 2, 2020 · 2 comments
Labels
help wanted Extra attention is needed

Comments

@dtolnay
Copy link
Owner

dtolnay commented Oct 2, 2020

As observed in #34 (comment), the 32-bit and 64-bit float-to-string benchmark in this repo is something like 40% faster than the exact same benchmark in https://github.com/ulfjack/ryu. We need to examine whether there is any inadvertent discrepancy between the two benchmarks, difference in build configuration, or any toolchain difference that would account for this. The code is a line-by-line transliteration so there should really be no performance difference, and certainly not this big.

@dtolnay dtolnay added the help wanted Extra attention is needed label Oct 2, 2020
@pheki
Copy link
Contributor

pheki commented Nov 30, 2020

I've bisected it and found the culprit (ulfjack/ryu@b94f5d0), apparently they changed the default benchmark mode from "classic" to "inverted". To use the classic mode you need to add -classic to the options, such as

bazel run -c opt //ryu/benchmark:ryu_benchmark -- -classic

I personally see 3 options:

  • Add both benchmarks modes to this repo
  • Change from classic mode to inverted mode
  • Change instructions to run the upstream benchmark with the classic option

@dtolnay
Copy link
Owner Author

dtolnay commented Nov 30, 2020

ulfjack/ryu#72 (comment)

Normally, the benchmark generates samples in an outer loop, and then it tests each sample in an inner loop for iterations. This allows mean and variance statistics to be collected for each sample, which is useful because some samples exercise different codepaths.

However, repeatedly testing each sample causes Ryu's branches to be highly predictable, which is unrealistic for the real world.

The new "-invert" option generates samples into a std::vector (to keep the Mersenne Twister out of the profiling). Then it has an outer loop for iterations, and its inner loop tests each sample once. This has realistic branch (mis)prediction characteristics.

Note that when inverting, we divide the t2 - t1 duration (which is the total time taken to convert all samples in the vector) by samples, so the delta is the average time taken to convert a sample. This means that the printed Average values are comparable between normal mode and inverted mode (and the difference shows how costly branch misprediction is). The printed Stddev values aren't comparable, though - in normal mode, they measure the variation between samples (which is nonzero, even for an ideal machine, because Ryu does different amounts of work for different inputs), while in inverted mode, they measure the variation between each vector loop (which would ideally be zero for a perfectly deterministic machine).

The justification for taking the formerly-"inverted" mode as the default is compelling -- it hits realistic branch misprediction characteristics.

I would accept a PR to implement both modes (matching upstream) and require opt-in for "classic" mode (matching upstream).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants