Optimise align_offset for stride=1 further #75728

nagisa · 2020-08-20T01:32:05Z

stride == 1 case can be computed more efficiently through -p (mod a). That, then translates to a nice and short sequence of LLVM
instructions:

%address = ptrtoint i8* %p to i64
%negptr = sub i64 0, %address
%offset = and i64 %negptr, %a_minus_one

And produces pretty much ideal code-gen when this function is used in
isolation.

Typical use of this function will, however, involve use of
the result to offset a pointer, i.e.

%aligned = getelementptr inbounds i8, i8* %p, i64 %offset

This still looks very good, but LLVM does not really translate that to
what would be considered ideal machine code (on any target). For example
that's the codegen we obtain for an unknown alignment:

; x86_64
dec     rsi
mov     rax, rdi
neg     rax
and     rax, rsi
add     rax, rdi

In particular negating a pointer is not something that’s going to be
optimised for in the design of CISC architectures like x86_64. They
are much better at offsetting pointers. And so we’d love to utilize this
ability and produce code that's more like this:

; x86_64
lea     rax, [rsi + rdi - 1]
neg     rsi
and     rax, rsi

To achieve this we need to give LLVM an opportunity to apply its
various peep-hole optimisations that it does during DAG selection. In
particular, the and instruction appears to be a major inhibitor here.
We cannot, sadly, get rid of this load-bearing operation, but we can
reorder operations such that LLVM has more to work with around this
instruction.

One such ordering is proposed in #75579 and results in LLVM IR that
looks broadly like this:

; using add enables `lea` and similar CISCisms
%offset_ptr = add i64 %address, %a_minus_one
%mask = sub i64 0, %a
%masked = and i64 %offset_ptr, %mask
; can be folded with `gepi` that may follow
%offset = sub i64 %masked, %address

…and generates the intended x86_64 machine code.
One might also wonder how the increased amount of code would impact a
RISC target. Turns out not much:

; aarch64 previous                 ; aarch64 new
sub     x8, x1, #1                 add     x8, x1, x0
neg     x9, x0                     sub     x8, x8, #1
and     x8, x9, x8                 neg     x9, x1
add     x0, x0, x8                 and     x0, x8, x9

(and similarly for ppc, sparc, mips, riscv, etc)

The only target that seems to do worse is… wasm32.

Onto actual measurements – the best way to evaluate snipets like these
is to use llvm-mca. Much like Aarch64 assembly would allow to suspect,
there isn’t any performance difference to be found. Both snippets
execute in same number of cycles for the CPUs I tried. On x86_64,
we get throughput improvement of >50%!

Fixes #75579

rust-highfive · 2020-08-20T01:32:08Z

r? @joshtriplett

(rust_highfive has picked a reviewer for you, use r? to override)

`stride == 1` case can be computed more efficiently through `-p (mod a)`. That, then translates to a nice and short sequence of LLVM instructions: %address = ptrtoint i8* %p to i64 %negptr = sub i64 0, %address %offset = and i64 %negptr, %a_minus_one And produces pretty much ideal code-gen when this function is used in isolation. Typical use of this function will, however, involve use of the result to offset a pointer, i.e. %aligned = getelementptr inbounds i8, i8* %p, i64 %offset This still looks very good, but LLVM does not really translate that to what would be considered ideal machine code (on any target). For example that's the codegen we obtain for an unknown alignment: ; x86_64 dec rsi mov rax, rdi neg rax and rax, rsi add rax, rdi In particular negating a pointer is not something that’s going to be optimised for in the design of CISC architectures like x86_64. They are much better at offsetting pointers. And so we’d love to utilize this ability and produce code that's more like this: ; x86_64 lea rax, [rsi + rdi - 1] neg rsi and rax, rsi To achieve this we need to give LLVM an opportunity to apply its various peep-hole optimisations that it does during DAG selection. In particular, the `and` instruction appears to be a major inhibitor here. We cannot, sadly, get rid of this load-bearing operation, but we can reorder operations such that LLVM has more to work with around this instruction. One such ordering is proposed in rust-lang#75579 and results in LLVM IR that looks broadly like this: ; using add enables `lea` and similar CISCisms %offset_ptr = add i64 %address, %a_minus_one %mask = sub i64 0, %a %masked = and i64 %offset_ptr, %mask ; can be folded with `gepi` that may follow %offset = sub i64 %masked, %address …and generates the intended x86_64 machine code. One might also wonder how the increased amount of code would impact a RISC target. Turns out not much: ; aarch64 previous ; aarch64 new sub x8, x1, #1 add x8, x1, x0 neg x9, x0 sub x8, x8, #1 and x8, x9, x8 neg x9, x1 add x0, x0, x8 and x0, x8, x9 (and similarly for ppc, sparc, mips, riscv, etc) The only target that seems to do worse is… wasm32. Onto actual measurements – the best way to evaluate snippets like these is to use llvm-mca. Much like Aarch64 assembly would allow to suspect, there isn’t any performance difference to be found. Both snippets execute in same number of cycles for the CPUs I tried. On x86_64, we get throughput improvement of >50%, however!

Dylan-DPC-zz · 2020-10-04T12:14:13Z

@joshtriplett this is ready for review

joshtriplett · 2020-10-05T04:41:58Z

@bors try @rust-timer queue

rust-timer · 2020-10-05T04:41:59Z

Awaiting bors try build completion

bors · 2020-10-05T04:42:17Z

⌛ Trying commit 4bfacff with merge 911995c7ac82feb8ff493eb4328f4387f5cf506e...

bors · 2020-10-05T05:25:23Z

☀️ Try build successful - checks-actions, checks-azure
Build commit: 911995c7ac82feb8ff493eb4328f4387f5cf506e (911995c7ac82feb8ff493eb4328f4387f5cf506e)

rust-timer · 2020-10-05T05:25:24Z

Queued 911995c7ac82feb8ff493eb4328f4387f5cf506e with parent ced813f, future comparison URL.

rust-timer · 2020-10-05T08:56:26Z

Finished benchmarking try commit (911995c7ac82feb8ff493eb4328f4387f5cf506e): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

joshtriplett · 2020-10-05T09:01:16Z

Seems relatively reasonable to me. Could someone else doublecheck this, please?

Dylan-DPC-zz · 2020-10-23T12:28:59Z

r? @Mark-Simulacrum

Mark-Simulacrum · 2020-10-25T21:47:43Z

The perf results here don't seem to really correspond nicely to the conclusions laid out in the PR description: wall times contain a bunch of regressions, and the instruction counts are slightly up as well. Maybe LLVM 11 (which I think landed since then) changed the calculus for the compiler, though -- let's generate another round of benchmarks.

@bors try @rust-timer queue

@nagisa Do the perf results match your expectations? @joshtriplett -- I think you didn't comment on them at all, perhaps you have thoughts?

Generally speaking I'm fine with landing this -- the implementation looks correct -- but it seems like if it's a regression on rustc runtime then maybe we shouldn't?

rust-timer · 2020-10-25T21:47:44Z

Awaiting bors try build completion

bors · 2020-10-25T21:48:15Z

⌛ Trying commit 4bfacff with merge 2b39dba3b2c0c5abd15893f173833c02e968a7f6...

joshtriplett · 2020-10-25T22:12:33Z

@Mark-Simulacrum I didn't see the wall-times; those look really concerning.

Given the nature of the optimization, I was looking heavily at cycles, which are substantially improved across the board for everything other than incr-unchanged.

nagisa · 2020-10-25T22:22:06Z

I don't believe this function is used within rustc enough to cause a 50% wall time regression doing anything, but much like @joshtriplett I wasn't even aware there was such a regression.

(I feel like its somewhat of a presentation problem in perf; I don't typically look at all measurement kinds)

nagisa · 2020-10-25T22:27:33Z

#75600 has a prior optimisation of this same function, and it did see comparatively large changes in wall-time, though in that case towards the positive side, but also the way I interpret the results, variance is huge too?

bors · 2020-10-25T22:33:57Z

☀️ Try build successful - checks-actions
Build commit: 2b39dba3b2c0c5abd15893f173833c02e968a7f6 (2b39dba3b2c0c5abd15893f173833c02e968a7f6)

rust-timer · 2020-10-25T22:33:58Z

Queued 2b39dba3b2c0c5abd15893f173833c02e968a7f6 with parent 0dce3f6, future comparison URL.

Mark-Simulacrum · 2020-10-25T22:50:38Z

I agree that a 50% wall time regression is likely to be mostly noise - unfortunately we don't really have a good way yet of determining noise on the presentation layer such that we could hide deltas within the noise bound and then (hopefully) show all the statistics we collect in some way.

That said, I am inclined to land this as is presuming this perf run comes back similarly (or better). I think most of the wall time regressions are on tiny crates - several milliseconds - and so aren't too interesting.

rust-timer · 2020-10-26T01:07:11Z

Finished benchmarking try commit (2b39dba3b2c0c5abd15893f173833c02e968a7f6): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

Mark-Simulacrum · 2020-10-26T01:55:41Z

This time around wall times are down roughly 50%. So either this PR considerably increases the amount of noise we see (seems highly unlikely) or something has landed in the meantime that shifts things around to this being a net improvement now. Regardless cycles and instructions seem neutral or down, and since llvm-mca points at throughput increases, I think we should merge this.

@bors r+

bors · 2020-10-26T01:55:43Z

📌 Commit 4bfacff has been approved by Mark-Simulacrum

bors · 2020-10-26T06:49:40Z

⌛ Testing commit 4bfacff with merge 69e68cf...

bors · 2020-10-26T09:11:28Z

☀️ Test successful - checks-actions
Approved by: Mark-Simulacrum
Pushing 69e68cf to master...

rust-highfive assigned joshtriplett Aug 20, 2020

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 20, 2020

nagisa force-pushed the improve_align_offset_2 branch 3 times, most recently from 2f009ad to 9ebaf10 Compare August 20, 2020 01:45

nagisa force-pushed the improve_align_offset_2 branch from 9ebaf10 to 4bfacff Compare August 20, 2020 12:52

crlf0710 added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 12, 2020

jyn514 added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Sep 15, 2020

camelid added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 23, 2020

rust-highfive assigned Mark-Simulacrum and unassigned joshtriplett Oct 23, 2020

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 26, 2020

bors added the merged-by-bors This PR was explicitly merged by bors. label Oct 26, 2020

bors merged commit 69e68cf into rust-lang:master Oct 26, 2020

rustbot added this to the 1.49.0 milestone Oct 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise align_offset for stride=1 further #75728

Optimise align_offset for stride=1 further #75728

nagisa commented Aug 20, 2020 •

edited

Loading

rust-highfive commented Aug 20, 2020

Dylan-DPC-zz commented Oct 4, 2020

joshtriplett commented Oct 5, 2020

rust-timer commented Oct 5, 2020

bors commented Oct 5, 2020

bors commented Oct 5, 2020

rust-timer commented Oct 5, 2020

rust-timer commented Oct 5, 2020

joshtriplett commented Oct 5, 2020 •

edited

Loading

Dylan-DPC-zz commented Oct 23, 2020

Mark-Simulacrum commented Oct 25, 2020

rust-timer commented Oct 25, 2020

bors commented Oct 25, 2020

joshtriplett commented Oct 25, 2020

nagisa commented Oct 25, 2020

nagisa commented Oct 25, 2020

bors commented Oct 25, 2020

rust-timer commented Oct 25, 2020

Mark-Simulacrum commented Oct 25, 2020

rust-timer commented Oct 26, 2020

Mark-Simulacrum commented Oct 26, 2020

bors commented Oct 26, 2020

bors commented Oct 26, 2020

bors commented Oct 26, 2020

Optimise align_offset for stride=1 further #75728

Optimise align_offset for stride=1 further #75728

Conversation

nagisa commented Aug 20, 2020 • edited Loading

rust-highfive commented Aug 20, 2020

Dylan-DPC-zz commented Oct 4, 2020

joshtriplett commented Oct 5, 2020

rust-timer commented Oct 5, 2020

bors commented Oct 5, 2020

bors commented Oct 5, 2020

rust-timer commented Oct 5, 2020

rust-timer commented Oct 5, 2020

joshtriplett commented Oct 5, 2020 • edited Loading

Dylan-DPC-zz commented Oct 23, 2020

Mark-Simulacrum commented Oct 25, 2020

rust-timer commented Oct 25, 2020

bors commented Oct 25, 2020

joshtriplett commented Oct 25, 2020

nagisa commented Oct 25, 2020

nagisa commented Oct 25, 2020

bors commented Oct 25, 2020

rust-timer commented Oct 25, 2020

Mark-Simulacrum commented Oct 25, 2020

rust-timer commented Oct 26, 2020

Mark-Simulacrum commented Oct 26, 2020

bors commented Oct 26, 2020

bors commented Oct 26, 2020

bors commented Oct 26, 2020

nagisa commented Aug 20, 2020 •

edited

Loading

joshtriplett commented Oct 5, 2020 •

edited

Loading