-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]Use unaligned read/writes for core::mem::swap
on x86_64
#98892
[WIP]Use unaligned read/writes for core::mem::swap
on x86_64
#98892
Conversation
Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any Examples of
|
This comment was marked as outdated.
This comment was marked as outdated.
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit 4c9609245af2160d5708ac0c07610e7fc8e562bf with merge 86189bb1747c007986340e760682fc32e092f136... |
☀️ Try build successful - checks-actions |
Queued 86189bb1747c007986340e760682fc32e092f136 with parent a3beeaa, future comparison URL. |
Finished benchmarking commit (86189bb1747c007986340e760682fc32e092f136): comparison url. Instruction count
Max RSS (memory usage)Results
CyclesResults
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Footnotes |
This now has conflicts, and still says "WIP" in the name, but let me flip it to a pair of team eyes for feedback first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks mostly good, but I'd like to see it rebased so we can run a current perf test.
library/core/src/mem/mod.rs
Outdated
} | ||
|
||
// Trigger static assertion. | ||
let _ = <T as ConstifyOffsets>::STATIC_TEST; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than playing games with overflow, this can just be a const { assert!(...); }
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is better. Updated.
@rustbot author |
4c96092
to
882407a
Compare
☀️ Try build successful - checks-actions |
1 similar comment
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (cd4875ef17561bd9077809b8d65fcf4e5b439b9e): comparison URL. Overall result: no relevant changes - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 644.092s -> 643.294s (-0.12%) |
This generates better ASM: https://godbolt.org/z/Mr4rWfoad And misaligned accesses on modern x86_64 processors are fast (see docs for `core::mem::swap_chunked`). Main difference is that swapping `#[repr(packed)]` or aligned to 1 byte types now uses bigger chunks by utilizing `movq` and `movl` instructions. Also, bigger types (e.g. bigger or equal to XMM register) would use SIMD more effectively. Old code used them in not very effecient way, copying data to register and storing it in stack, then reading it back. It caused unneccessary memory reads and writes and completely removed benefits from SSE because number of instructions was similar to number of instructions for simple `usize` chunked swapping. New code instead stores temporary SIMD chunks entirely in registers by employing eiter 4 XMM registers, 2 YMM registers or 2 XMM registers depending on type size and compiler flags. Also, made size limit in condition for choosing chunked swap smaller to make types like `std::vec::Vec<T>` (especially `std::str::String`) use new optimizations.
18c41fa
to
4607601
Compare
@scottmcm Last perf run looks promising IMHO. What do you think about using |
I started implementing intrinsic: #111744 |
I think #97712 suggests that it's a good thing that MIRI runs on the block optimization version, so we wouldn't want to hide it, now that MIRI supports it. |
It's not clear to me -- is the swap intrinsic meant to fully replace this PR? Please |
…iler-errors Use `load`+`store` instead of `memcpy` for small integer arrays I was inspired by rust-lang#98892 to see whether, rather than making `mem::swap` do something smart in the library, we could update MIR assignments like `*_1 = *_2` to do something smarter than `memcpy` for sufficiently-small types that doing it inline is going to be better than a `memcpy` call in assembly anyway. After all, special code may help `mem::swap`, but if the "obvious" MIR can just result in the correct thing that helps everything -- other code like `mem::replace`, people doing it manually, and just passing around by value in general -- as well as makes MIR inlining happier since it doesn't need to deal with all the complicated library code if it just sees a couple assignments. LLVM will turn the short, known-length `memcpy`s into direct instructions in the backend, but that's too late for it to be able to remove `alloca`s. In general, replacing `memcpy`s with typed instructions is hard in the middle-end -- even for `memcpy.inline` where it knows it won't be a function call -- is hard [due to poison propagation issues](https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/memcpy.20vs.20load-store.20for.20MIR.20assignments/near/360376712). So because we know more about the type invariants -- these are typed copies -- rustc can emit something more specific, allowing LLVM to `mem2reg` away the `alloca`s in some situations. rust-lang#52051 previously did something like this in the library for `mem::swap`, but it ended up regressing during enabling mir inlining (rust-lang@cbbf06b), so this has been suboptimal on stable for ≈5 releases now. The code in this PR is narrowly targeted at just integer arrays in LLVM, but works via a new method on the [`LayoutTypeMethods`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/trait.LayoutTypeMethods.html) trait, so specific backends based on cg_ssa can enable this for more situations over time, as we find them. I don't want to try to bite off too much in this PR, though. (Transparent newtypes and simple things like the 3×usize `String` would be obvious candidates for a follow-up.) Codegen demonstrations: <https://llvm.godbolt.org/z/fK8hT9aqv> Before: ```llvm define void `@swap_rgb48_old(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #1 { %a.i = alloca [3 x i16], align 2 call void `@llvm.lifetime.start.p0(i64` 6, ptr nonnull %a.i) call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %a.i, ptr noundef nonnull align 2 dereferenceable(6) %x, i64 6, i1 false) tail call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %x, ptr noundef nonnull align 2 dereferenceable(6) %y, i64 6, i1 false) call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %y, ptr noundef nonnull align 2 dereferenceable(6) %a.i, i64 6, i1 false) call void `@llvm.lifetime.end.p0(i64` 6, ptr nonnull %a.i) ret void } ``` Note it going to stack: ```nasm swap_rgb48_old: # `@swap_rgb48_old` movzx eax, word ptr [rdi + 4] mov word ptr [rsp - 4], ax mov eax, dword ptr [rdi] mov dword ptr [rsp - 8], eax movzx eax, word ptr [rsi + 4] mov word ptr [rdi + 4], ax mov eax, dword ptr [rsi] mov dword ptr [rdi], eax movzx eax, word ptr [rsp - 4] mov word ptr [rsi + 4], ax mov eax, dword ptr [rsp - 8] mov dword ptr [rsi], eax ret ``` Now: ```llvm define void `@swap_rgb48(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #0 { start: %0 = load <3 x i16>, ptr %x, align 2 %1 = load <3 x i16>, ptr %y, align 2 store <3 x i16> %1, ptr %x, align 2 store <3 x i16> %0, ptr %y, align 2 ret void } ``` still lowers to `dword`+`word` operations, but has no stack traffic: ```nasm swap_rgb48: # `@swap_rgb48` mov eax, dword ptr [rdi] movzx ecx, word ptr [rdi + 4] movzx edx, word ptr [rsi + 4] mov r8d, dword ptr [rsi] mov dword ptr [rdi], r8d mov word ptr [rdi + 4], dx mov word ptr [rsi + 4], cx mov dword ptr [rsi], eax ret ``` And as a demonstration that this isn't just `mem::swap`, a `mem::replace` on a small array (since replace doesn't use swap since rust-lang#83022), which used to be `memcpy`s in LLVM changes in IR ```llvm define void `@replace_short_array(ptr` noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 { start: %1 = load <3 x i32>, ptr %r, align 4 store <3 x i32> %1, ptr %0, align 4 %2 = load <3 x i32>, ptr %v, align 4 store <3 x i32> %2, ptr %r, align 4 ret void } ``` but that lowers to reasonable `dword`+`qword` instructions still ```nasm replace_short_array: # `@replace_short_array` mov rax, rdi mov rcx, qword ptr [rsi] mov edi, dword ptr [rsi + 8] mov dword ptr [rax + 8], edi mov qword ptr [rax], rcx mov rcx, qword ptr [rdx] mov edx, dword ptr [rdx + 8] mov dword ptr [rsi + 8], edx mov qword ptr [rsi], rcx ret ```
@cuviper Yes, I think, it would replace it fully if it would be merged. And it provides more noticable benefits than this PR. I would make this PR a draft until then. |
@rustbot author |
…<try> Tweak the threshold for chunked swapping Thanks to `@AngelicosPhosphoros` for the tests here, which I copied from rust-lang#98892. This is an experiment as a simple alternative to that PR that just tweaks the existing threshold, since that PR showed that 3×Align (like `String`) currently doesn't work as well as it could.
…Mark-Simulacrum Tweak the threshold for chunked swapping Thanks to `@AngelicosPhosphoros` for the tests here, which I copied from rust-lang#98892. This is an experiment as a simple alternative to that PR that just tweaks the existing threshold, since that PR showed that 3×Align (like `String`) currently doesn't work as well as it could.
@AngelicosPhosphoros any updates on this? |
Closing this as inactive. Feel free to reöpen this pr or create a new pr if you get the time to work on this. Thanks |
This generates better ASM: https://godbolt.org/z/Mr4rWfoad
And misaligned accesses on modern x86_64 processors are fast (see docs for
core::mem::swap_chunked
).Main difference is that swapping
#[repr(packed)]
or aligned to 1 byte types now uses bigger chunks by utilizingmovq
andmovl
instructions.Also, bigger types (e.g. bigger or equal to XMM register) would use SIMD more effectively. Old code used them in not very effecient way, copying data to register and storing it in stack, then reading it back. It caused unneccessary memory reads and writes and completely removed benefits from SSE because number of instructions was similar to number of instructions for simple
usize
chunked swapping. New code instead stores temporary SIMD chunks entirely in registers by employing eiter 4 XMM registers, 2 YMM registers or 2 XMM registers depending on type size and compiler flags.Also, made size limit in condition for choosing chunked swap smaller to make types like
std::vec::Vec<T>
(especiallystd::str::String
) use new optimizations.