Investigate faster performance than upstream implementation #35

dtolnay · 2020-10-02T06:10:00Z

As observed in #34 (comment), the 32-bit and 64-bit float-to-string benchmark in this repo is something like 40% faster than the exact same benchmark in https://github.com/ulfjack/ryu. We need to examine whether there is any inadvertent discrepancy between the two benchmarks, difference in build configuration, or any toolchain difference that would account for this. The code is a line-by-line transliteration so there should really be no performance difference, and certainly not this big.

pheki · 2020-11-30T04:01:09Z

I've bisected it and found the culprit (ulfjack/ryu@b94f5d0), apparently they changed the default benchmark mode from "classic" to "inverted". To use the classic mode you need to add -classic to the options, such as

bazel run -c opt //ryu/benchmark:ryu_benchmark -- -classic

I personally see 3 options:

Add both benchmarks modes to this repo
Change from classic mode to inverted mode
Change instructions to run the upstream benchmark with the classic option

dtolnay · 2020-11-30T04:14:26Z

ulfjack/ryu#72 (comment)

Normally, the benchmark generates samples in an outer loop, and then it tests each sample in an inner loop for iterations. This allows mean and variance statistics to be collected for each sample, which is useful because some samples exercise different codepaths.

However, repeatedly testing each sample causes Ryu's branches to be highly predictable, which is unrealistic for the real world.

The new "-invert" option generates samples into a std::vector (to keep the Mersenne Twister out of the profiling). Then it has an outer loop for iterations, and its inner loop tests each sample once. This has realistic branch (mis)prediction characteristics.

Note that when inverting, we divide the t2 - t1 duration (which is the total time taken to convert all samples in the vector) by samples, so the delta is the average time taken to convert a sample. This means that the printed Average values are comparable between normal mode and inverted mode (and the difference shows how costly branch misprediction is). The printed Stddev values aren't comparable, though - in normal mode, they measure the variation between samples (which is nonzero, even for an ideal machine, because Ryu does different amounts of work for different inputs), while in inverted mode, they measure the variation between each vector loop (which would ideally be zero for a perfectly deterministic machine).

The justification for taking the formerly-"inverted" mode as the default is compelling -- it hits realistic branch misprediction characteristics.

I would accept a PR to implement both modes (matching upstream) and require opt-in for "classic" mode (matching upstream).

dtolnay added the help wanted Extra attention is needed label Oct 2, 2020

dtolnay mentioned this issue Oct 2, 2020

Update upstream benchmark instructions #34

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate faster performance than upstream implementation #35

Investigate faster performance than upstream implementation #35

dtolnay commented Oct 2, 2020

pheki commented Nov 30, 2020

dtolnay commented Nov 30, 2020

Investigate faster performance than upstream implementation #35

Investigate faster performance than upstream implementation #35

Comments

dtolnay commented Oct 2, 2020

pheki commented Nov 30, 2020

dtolnay commented Nov 30, 2020