-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise align_offset for stride=1 further #75728
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
2f009ad
to
9ebaf10
Compare
`stride == 1` case can be computed more efficiently through `-p (mod a)`. That, then translates to a nice and short sequence of LLVM instructions: %address = ptrtoint i8* %p to i64 %negptr = sub i64 0, %address %offset = and i64 %negptr, %a_minus_one And produces pretty much ideal code-gen when this function is used in isolation. Typical use of this function will, however, involve use of the result to offset a pointer, i.e. %aligned = getelementptr inbounds i8, i8* %p, i64 %offset This still looks very good, but LLVM does not really translate that to what would be considered ideal machine code (on any target). For example that's the codegen we obtain for an unknown alignment: ; x86_64 dec rsi mov rax, rdi neg rax and rax, rsi add rax, rdi In particular negating a pointer is not something that’s going to be optimised for in the design of CISC architectures like x86_64. They are much better at offsetting pointers. And so we’d love to utilize this ability and produce code that's more like this: ; x86_64 lea rax, [rsi + rdi - 1] neg rsi and rax, rsi To achieve this we need to give LLVM an opportunity to apply its various peep-hole optimisations that it does during DAG selection. In particular, the `and` instruction appears to be a major inhibitor here. We cannot, sadly, get rid of this load-bearing operation, but we can reorder operations such that LLVM has more to work with around this instruction. One such ordering is proposed in rust-lang#75579 and results in LLVM IR that looks broadly like this: ; using add enables `lea` and similar CISCisms %offset_ptr = add i64 %address, %a_minus_one %mask = sub i64 0, %a %masked = and i64 %offset_ptr, %mask ; can be folded with `gepi` that may follow %offset = sub i64 %masked, %address …and generates the intended x86_64 machine code. One might also wonder how the increased amount of code would impact a RISC target. Turns out not much: ; aarch64 previous ; aarch64 new sub x8, x1, #1 add x8, x1, x0 neg x9, x0 sub x8, x8, #1 and x8, x9, x8 neg x9, x1 add x0, x0, x8 and x0, x8, x9 (and similarly for ppc, sparc, mips, riscv, etc) The only target that seems to do worse is… wasm32. Onto actual measurements – the best way to evaluate snippets like these is to use llvm-mca. Much like Aarch64 assembly would allow to suspect, there isn’t any performance difference to be found. Both snippets execute in same number of cycles for the CPUs I tried. On x86_64, we get throughput improvement of >50%, however!
9ebaf10
to
4bfacff
Compare
@joshtriplett this is ready for review |
@bors try @rust-timer queue |
Awaiting bors try build completion |
⌛ Trying commit 4bfacff with merge 911995c7ac82feb8ff493eb4328f4387f5cf506e... |
☀️ Try build successful - checks-actions, checks-azure |
Queued 911995c7ac82feb8ff493eb4328f4387f5cf506e with parent ced813f, future comparison URL. |
Finished benchmarking try commit (911995c7ac82feb8ff493eb4328f4387f5cf506e): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never |
Seems relatively reasonable to me. Could someone else doublecheck this, please? |
The perf results here don't seem to really correspond nicely to the conclusions laid out in the PR description: wall times contain a bunch of regressions, and the instruction counts are slightly up as well. Maybe LLVM 11 (which I think landed since then) changed the calculus for the compiler, though -- let's generate another round of benchmarks. @bors try @rust-timer queue @nagisa Do the perf results match your expectations? @joshtriplett -- I think you didn't comment on them at all, perhaps you have thoughts? Generally speaking I'm fine with landing this -- the implementation looks correct -- but it seems like if it's a regression on rustc runtime then maybe we shouldn't? |
Awaiting bors try build completion |
⌛ Trying commit 4bfacff with merge 2b39dba3b2c0c5abd15893f173833c02e968a7f6... |
@Mark-Simulacrum I didn't see the wall-times; those look really concerning. Given the nature of the optimization, I was looking heavily at cycles, which are substantially improved across the board for everything other than |
I don't believe this function is used within rustc enough to cause a 50% wall time regression doing anything, but much like @joshtriplett I wasn't even aware there was such a regression. (I feel like its somewhat of a presentation problem in perf; I don't typically look at all measurement kinds) |
#75600 has a prior optimisation of this same function, and it did see comparatively large changes in wall-time, though in that case towards the positive side, but also the way I interpret the results, variance is huge too? |
☀️ Try build successful - checks-actions |
Queued 2b39dba3b2c0c5abd15893f173833c02e968a7f6 with parent 0dce3f6, future comparison URL. |
I agree that a 50% wall time regression is likely to be mostly noise - unfortunately we don't really have a good way yet of determining noise on the presentation layer such that we could hide deltas within the noise bound and then (hopefully) show all the statistics we collect in some way. That said, I am inclined to land this as is presuming this perf run comes back similarly (or better). I think most of the wall time regressions are on tiny crates - several milliseconds - and so aren't too interesting. |
Finished benchmarking try commit (2b39dba3b2c0c5abd15893f173833c02e968a7f6): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never |
This time around wall times are down roughly 50%. So either this PR considerably increases the amount of noise we see (seems highly unlikely) or something has landed in the meantime that shifts things around to this being a net improvement now. Regardless cycles and instructions seem neutral or down, and since llvm-mca points at throughput increases, I think we should merge this. @bors r+ |
📌 Commit 4bfacff has been approved by |
☀️ Test successful - checks-actions |
stride == 1
case can be computed more efficiently through-p (mod a)
. That, then translates to a nice and short sequence of LLVMinstructions:
And produces pretty much ideal code-gen when this function is used in
isolation.
Typical use of this function will, however, involve use of
the result to offset a pointer, i.e.
This still looks very good, but LLVM does not really translate that to
what would be considered ideal machine code (on any target). For example
that's the codegen we obtain for an unknown alignment:
In particular negating a pointer is not something that’s going to be
optimised for in the design of CISC architectures like x86_64. They
are much better at offsetting pointers. And so we’d love to utilize this
ability and produce code that's more like this:
To achieve this we need to give LLVM an opportunity to apply its
various peep-hole optimisations that it does during DAG selection. In
particular, the
and
instruction appears to be a major inhibitor here.We cannot, sadly, get rid of this load-bearing operation, but we can
reorder operations such that LLVM has more to work with around this
instruction.
One such ordering is proposed in #75579 and results in LLVM IR that
looks broadly like this:
…and generates the intended x86_64 machine code.
One might also wonder how the increased amount of code would impact a
RISC target. Turns out not much:
The only target that seems to do worse is… wasm32.
Onto actual measurements – the best way to evaluate snipets like these
is to use llvm-mca. Much like Aarch64 assembly would allow to suspect,
there isn’t any performance difference to be found. Both snippets
execute in same number of cycles for the CPUs I tried. On x86_64,
we get throughput improvement of >50%!
Fixes #75579