-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize memcpy, memmove and memset #405
Conversation
Oh, one thing to note is that the misaligned code currently makes some accesses that might be UB. For example if we are trying to copy from 0x10007-0x100017, then the code will access the entire machine-word containing the byte 0x10007 (thus also 0x10000-0x10006) and the machine-word containing 0x10010-0x10016 (thus 0x100017) which are out of the bound supplied to memcpy/memmove. I don't think that is a UB ever exploitable by compiler though, and I am not aware of any architecture where such accesses could cause trouble. |
Oh big-endian ISAs 🤦, will fix sooon |
On the M1 I believe one of the exploit protection mechanisms may cause such out-of-bounds accesses to crash the program under certain circumstances. I forgot the name of the mechanism though, so I can't check if I remember it correctly. |
Hmm, interesting. Platforms with these kind of protection features I know of usually track at allocation and deallocation, and these memory regions will be machine-word aligned. The oob access here is strictly just accessing the machine-word that contains an accessible byte. |
Anyway I could force at least 7 bytes to be copied bytewise before and after the unaligned copy, and doing so would be UB-free. But that would hurt other platforms where this specific kind of OOB access is permitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been tested on architectures that we primarily would like to optimize for? I think it would be good to have a candidate set of target architectures in mind to compare with LLVM's auto-vectorization and alternative implementations such as unaligned loads/stores.
src/mem/impls.rs
Outdated
while i < n { | ||
*dest.add(i) = *src.add(i); | ||
i += 1; | ||
unsafe fn copy_forward_aligned_words(dest: *mut u8, src: *const u8, n: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this function is taking aligned words, could this perhaps take usize
pointers? Additionally should this take u64
pointers perhaps? Or maybe even u128
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could, but I uses u8
pointer to keep the signature identical to other functions.
I thought about using u64
or u128
here but it turns out to be bad for a few reason:
- Many platforms, especially those which don't have vectorization, can do at most usize load per instruction.
- It requires more alignment than usize, so more bytes have to be copied with bytewise copy.
- It makes the misalignment scenario bad. Shifting and reassemble 2usize actually generates worse code than 1usize.
However, one optimisation that I haven't include in this PR is to unroll the loop for the aligned scenario. If n is a multiple of 8*usize for example, then we can have a 8-unrolled loop followed by a loop for the remainder. On clang this could be done easily with #pragma unroll
but I don't think it is currently possible in Rust so we might have to be it manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like taking usize
here would make the intent clearer where it's copying based on each word. Perhaps there could be a typedef in the module for the copy size? Some platforms may have better performance on some types than others, so I could see it being conditional as well.
For unrolling I would expect LLVM to do this automatically if supported.
if likely(n >= WORD_COPY_THRESHOLD) { | ||
// Align dest | ||
// Because of n >= 2 * WORD_SIZE, dst_misalignment < n | ||
let dest_misalignment = (dest as usize).wrapping_neg() & WORD_MASK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to use the align_offset
method here to avoid bit-tricks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From align_offset's doc:
If it is not possible to align the pointer, the implementation returns usize::MAX. It is permissible for the implementation to always return usize::MAX. Only your algorithm's performance can depend on getting a usable offset here, not its correctness.
So that rules out the use of align_offset. Also note that this is the very hot path and I would prefer simple bit trcisk rather to rely on LLVM to optimise out a very complex function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try benchmarking to see if there is overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not related to performance, it could simply not be used for correctness.
It is permissible for the implementation to always return usize::MAX.
if likely(src_misalignment == 0) { | ||
copy_forward_aligned_words(dest, src, n_words); | ||
} else { | ||
copy_forward_misaligned_words(dest, src, n_words); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, have you tested simply copying using ptr::read_unaligned
and ptr::write_unaligned
? That way alignment wouldn't be an issue, but I don't know the performance impact that would have on various platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, first of all ptr::read_unaligned
and ptr::write_aligned
calls ptr::copy_nonoverlapping
which translates to memcpy, so I would like to avoid the possibility of an recursion if LLVM doesn't optimise them away.
Secondly, ptr::read_unaligned
has really poor performance. On platforms without misaligned memory access support it will translate to size_of<usize>()
number of byte loads.
This branch is necessary because we don't want to bear the burden of all the shifts and checks necessary for misaligned loads if dest and src are perfectly co-aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried this out and seen infinite recursion? Have you tried it out and seen if it's slower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There'll be an infinite recursion if compiled in debug mode.
On architectures that does not support misaligned loads (so most ISAs other than armv8 and x86/64) the performance is much slower because it generates 8 byte load and 16 bit ops rather than 1 word load and 3 bit ops.
src/mem/impls.rs
Outdated
while dest < dest_end { | ||
*dest = *src; | ||
dest = dest.add(1); | ||
src = src.add(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general LLVM has a bunch of heuristics for auto-vectorizing loops like these, and the auto-vectorized code is probably faster as well.
LLVM doesn't auto-vectorize on all platforms, though. What I'm worried about though is that this is hurting performance on platforms that auto-vectorize. Would it be possible to survey some platforms to see if the simple function implementations are enough with LLVM auto-vectorizing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check the benchmarks on x86-64. The code submitted in this PR is actually my second attempt, my first attempt ended up in poor performance because it does not vectorize very well.
The current design that computes dest_end inside these helper functions are a result of learning from the failed attempt. On platforms that vectorizes, the code is simple enough to be detected by LLVM so it effectively treat it as the while i < n
type of loop so there won't be a regression, while on those that does not vectorize LLVM wouldn't bookkeepthe i variable (yes, LLVM couldn't optimise away i) so that's like 20% performance improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand the purpose of benchmarking on x86_64? This file isn't even used on that platform and it seems like we wouldn't want to use it anyway because LLVM otherwise auto-vectorizes very well and probably outperforms what's in this file.
I'm interested in getting data on this for platforms other than x86_64 for those reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I turn off auto vectorization to mimic platforms without auto-vectorization. I did observe very significant improvement on RISC-V but I couldn't just do cargo bench
on a no_std platform. Even on x86-64 the code still provides significant performance improvement for backward copy though.
I have tested it on RISC-V. The RISC-V platform that I tested it on is no_std so there isn't a bencher but it speeds up my entire program by 25%. It could hurt platforms that have both vectorization and hardware misaligned load support, like x86-64 and ARM, but I am not sure what is the best to do there, e.g. how to detect if vectorization is on and how to make use of unaligned load support. BTW I just realized the benches do not test misaligned scenario. I'll add a benchmark soon. |
Misaligned benchmarks: Rust Builtin
New Implementation
Old Implementation
New, with no-vectorize-loops
Old, with no-vectorize-loops
I expected the misaligned code path hurts x86-64. It is slower with vectorization on but faster with vectorization off. It is always faster in memmove, presumably because LLVM couldn't vectorize it. |
Indeed yeah I'm specifically worried where this makes existing platforms worse because LLVM otherwise does something well today. This whole PR should arguably be replaced with an improved LLVM backend for relevant targets, but that's naturally a very large undertaking so it seems fine to have this in the meantime while we wait for LLVM to improve. I would prefer to not have a too-overly-complicated implementation since it seems likely to only get used on some platforms. Having a complicated implementation is also a risk because I don't think anything here is tested on CI since CI mostly does x86_64. Basically I think that for this there should at least be peformance numbers for ARM/AArch64 and investigation into what LLVM does today in terms of vectorization and performance of unaligned loads/stores. Additionally it would be great to ensure that everything here is tested on CI regardless of the platform if possible. |
I would argue this isn't overly complicated. The concept is very simple and is what C libraries do already. Benchmarking on ARM/AArch64 might not be very helpful because there are both vectorization and misaligned load/store support, but I would still expect memmove performance improvement (I could test it myself because I don't have a ARM dev environment setup). This PR is mainly intended to be a much better default for those without (a quick test on godbolt suggests this includes armv7, powerpc, mips, riscv and more). As for the CI, it should be noted a few platforms are already being tested (including big-endian ones like PowerPC). |
Hi, I've carried out some benchmarks on an aarch64-unknown-linux-gnu platform (Graviton 2, Neoverse-N1) to alleviate some concerns. I rebased the branch to something more recent before carrying these out. (all numbers ns/iter. +/- ranges were never more than 1% of the result)
The short story is the results look similar to the ones observed above on x86_64.
|
I suppose I should conditionally select different versions of code for misaligned case. For architectures (e.g. x86/ARM) that have hardware misaligned access support, it might be better to just do misaligned load instead of trying to do bitshifts. |
@adamgemmell I have added a different code path for misaligned case for x86/x64/aarch64. Can you do another bench run? Thanks! |
Same machine, same branch base, same compiler (which I forgot to mention was 1.56.0-nightly (2faabf579 2021-07-27)). Master runs looked close to what I was getting previously.
Looks like a good change, really brings the misaligned numbers in line with the other tests. There's a weird doubling of performance for the memset no-vec case on this machine. Not complaining but I kinda want to look more into it tomorrow. Couldn't replicate something similar on other machines I tried. |
You can get even better performance with unaligned memory accesses by overlapping the accesses:
I recommend reading through these optimized implementations of
For platforms such as RISC-V without support for unaligned accesses, I'm surprised that |
Small
I don't understand how overlapping is an issue for
The tricks applied can slow down some processors while speed up others. For example, use of branches in small memset and memcpy will slow down those with simple branch predictors because there are many data-dependent branches. On the other hand loops are much easier to predict.
The number of instructions reduce from 8 byte-size load + 8 byte-size store per word copy to 1 word-size load + 1 word-size store + 3 arithmetic ops per word copy. LLVM actually wouldn't even unroll the loop, so loop overhead would hit bytewise copy even further. While some tricks you described are interesting, my main focus is on RISC-V which wouldn't benefit from those optimisations and might not even test those code paths. I am afraid it's outside my capacity to optimise for AArch64. |
Failed at |
Tried to run the workflow with current HEAD of master on my fork. Confirmed that it failed https://github.com/nbdd0121/compiler-builtins/actions/runs/1184124369. |
I had a look at the powerpc failure and it seems to be an LLVM bug (I'm still working on a minimal reproduction). In the meantime I think it's fine to merge this PR. |
Published in compiler-builtins 0.1.50. |
Addresses two classes of icache thrash present in the interrupt service path, e.g.: ```asm let mut prios = [0u128; 16]; 40380d44: ec840513 addi a0,s0,-312 40380d48: 10000613 li a2,256 40380d4c: ec840b93 addi s7,s0,-312 40380d50: 4581 li a1,0 40380d52: 01c85097 auipc ra,0x1c85 40380d56: 11e080e7 jalr 286(ra) # 42005e70 <memset> ``` and ```asm prios 40380f9c: dc840513 addi a0,s0,-568 40380fa0: ec840593 addi a1,s0,-312 40380fa4: 10000613 li a2,256 40380fa8: dc840493 addi s1,s0,-568 40380fac: 01c85097 auipc ra,0x1c85 40380fb0: eae080e7 jalr -338(ra) # 42005e5a <memcpy> ``` As an added bonus, performance of the whole program improves dramatically with these routines 1) reimplemented for the esp32 RISC-V µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit calls to these functions, and the versions that ship with compiler-builtins are [highly tuned] for other platforms. It seems like the expectation is that the compiler-builtins versions are "reasonable defaults," and they are [weakly linked] specifically to allow the kind of domain-specific overrides as here. In the context of the 'c3, this ends up producing a fairly large implementation that adds a lot of frequent cache pressure for minimal wins: ```readelf Num: Value Size Type Bind Vis Ndx Name 27071: 42005f72 22 FUNC LOCAL HIDDEN 3 memcpy 27072: 42005f88 22 FUNC LOCAL HIDDEN 3 memset 28853: 42005f9e 186 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memcpy 28854: 42006058 110 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memset ``` NB: these implementations are broken when targeting unaligned loads/stores across the instruction bus; at least in my testing this hasn't been a problem, because they are simply never invoked in that context. Additionally, these are just about the simplest possible implementations, with word-sized copies being the only concession made to runtime performance. Even a small amount of additional effort would probably yield fairly massive wins, as three- or four-instruction hot loops like these are basically pathological for the 'c3's pipeline implementation that seems to predict all branches as "never taken." However: there is a real danger in overtraining on the microbenchmarks here, too, as I would expect almost no one has a program whose runtime is dominated by these functions. Making these functions larger and more complex to eke out wins from architectural niches makes LLVM much less willing to inline them, costing additional function calls and preventing e.g. dead code elimination for always-aligned addresses or automatic loop unrolling, etc. [highly tuned]: rust-lang/compiler-builtins#405 [weakly linked]: rust-lang/compiler-builtins#339 (comment)
Addresses two classes of icache thrash present in the interrupt service path, e.g.: ```asm let mut prios = [0u128; 16]; 40380d44: ec840513 addi a0,s0,-312 40380d48: 10000613 li a2,256 40380d4c: ec840b93 addi s7,s0,-312 40380d50: 4581 li a1,0 40380d52: 01c85097 auipc ra,0x1c85 40380d56: 11e080e7 jalr 286(ra) # 42005e70 <memset> ``` and ```asm prios 40380f9c: dc840513 addi a0,s0,-568 40380fa0: ec840593 addi a1,s0,-312 40380fa4: 10000613 li a2,256 40380fa8: dc840493 addi s1,s0,-568 40380fac: 01c85097 auipc ra,0x1c85 40380fb0: eae080e7 jalr -338(ra) # 42005e5a <memcpy> ``` As an added bonus, performance of the whole program improves dramatically with these routines 1) reimplemented for the esp32 RISC-V µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit calls to these functions, and the versions that ship with compiler-builtins are [highly tuned] for other platforms. It seems like the expectation is that the compiler-builtins versions are "reasonable defaults," and they are [weakly linked] specifically to allow the kind of domain-specific overrides as here. In the context of the 'c3, this ends up producing a fairly large implementation that adds a lot of frequent cache pressure for minimal wins: ```readelf Num: Value Size Type Bind Vis Ndx Name 27071: 42005f72 22 FUNC LOCAL HIDDEN 3 memcpy 27072: 42005f88 22 FUNC LOCAL HIDDEN 3 memset 28853: 42005f9e 186 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memcpy 28854: 42006058 110 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memset ``` NB: these implementations are broken when targeting unaligned loads/stores across the instruction bus; at least in my testing this hasn't been a problem, because they are simply never invoked in that context. Additionally, these are just about the simplest possible implementations, with word-sized copies being the only concession made to runtime performance. Even a small amount of additional effort would probably yield fairly massive wins, as three- or four-instruction hot loops like these are basically pathological for the 'c3's pipeline implementation that seems to predict all branches as "never taken." However: there is a real danger in overtraining on the microbenchmarks here, too, as I would expect almost no one has a program whose runtime is dominated by these functions. Making these functions larger and more complex to eke out wins from architectural niches makes LLVM much less willing to inline them, costing additional function calls and preventing e.g. dead code elimination for always-aligned addresses or automatic loop unrolling, etc. [highly tuned]: rust-lang/compiler-builtins#405 [weakly linked]: rust-lang/compiler-builtins#339 (comment)
Addresses two classes of icache thrash present in the interrupt service path, e.g.: ```asm let mut prios = [0u128; 16]; 40380d44: ec840513 addi a0,s0,-312 40380d48: 10000613 li a2,256 40380d4c: ec840b93 addi s7,s0,-312 40380d50: 4581 li a1,0 40380d52: 01c85097 auipc ra,0x1c85 40380d56: 11e080e7 jalr 286(ra) # 42005e70 <memset> ``` and ```asm prios 40380f9c: dc840513 addi a0,s0,-568 40380fa0: ec840593 addi a1,s0,-312 40380fa4: 10000613 li a2,256 40380fa8: dc840493 addi s1,s0,-568 40380fac: 01c85097 auipc ra,0x1c85 40380fb0: eae080e7 jalr -338(ra) # 42005e5a <memcpy> ``` As an added bonus, performance of the whole program improves dramatically with these routines 1) reimplemented for the esp32 RISC-V µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit calls to these functions, and the versions that ship with compiler-builtins are [highly tuned] for other platforms. It seems like the expectation is that the compiler-builtins versions are "reasonable defaults," and they are [weakly linked] specifically to allow the kind of domain-specific overrides as here. In the context of the 'c3, this ends up producing a fairly large implementation that adds a lot of frequent cache pressure for minimal wins: ```readelf Num: Value Size Type Bind Vis Ndx Name 27071: 42005f72 22 FUNC LOCAL HIDDEN 3 memcpy 27072: 42005f88 22 FUNC LOCAL HIDDEN 3 memset 28853: 42005f9e 186 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memcpy 28854: 42006058 110 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memset ``` NB: these implementations are broken when targeting unaligned loads/stores across the instruction bus; at least in my testing this hasn't been a problem, because they are simply never invoked in that context. Additionally, these are just about the simplest possible implementations, with word-sized copies being the only concession made to runtime performance. Even a small amount of additional effort would probably yield fairly massive wins, as three- or four-instruction hot loops like these are basically pathological for the 'c3's pipeline implementation that seems to predict all branches as "never taken." However: there is a real danger in overtraining on the microbenchmarks here, too, as I would expect almost no one has a program whose runtime is dominated by these functions. Making these functions larger and more complex to eke out wins from architectural niches makes LLVM much less willing to inline them, costing additional function calls and preventing e.g. dead code elimination for always-aligned addresses or automatic loop unrolling, etc. [highly tuned]: rust-lang/compiler-builtins#405 [weakly linked]: rust-lang/compiler-builtins#339 (comment)
They are definitely UB, and IMO this needs to be fixed. Having UB in our own runtime while we tell people "never ever have UB anywhere" is not a good look. |
Partially addresses #339. (memcmp is not implemented in this PR)
This PR modifies the current simple implementation of memcpy, memmove and memset with a more sophisticated one.
It will first align dest to machine-word boundary using bytewise copy.
If dest and src are co-aligned (i.e. dest and src address are congruent modular machine-word size), it will proceed to use machine-word-wide copy to copy as many bytes as possible. Remaining bytes are copied using bytewise copy.
If dest and src are not co-aligned, misaligned copying is performed by reading two adjacent machine words and assemble them together with logical shifts before writing to dest.
The existing implementation can be pretty fast on platforms with SIMD and vectorization, but falls short for those without. The code in this PR is carefully written so it is fast on those platforms while is still vectorizable.
Here are some performance numbers:
Rust Builtin
New Implementation
Old Implementation
New, with no-vectorize-loops
Old, with no-vectorize-loops
All above numbers are tested on my Coffee Lake laptop, with no-asm feature on.
The number shows that if vectorization is enabled or supported on the platform, then memcpy and memset performance of the old and the new implementation are similar (while much faster on memmove, it seems the compiler had a hard time vectorizing backward copy). If there is no vectorization or the platform does not supported misaligned access, the new implementation will have very significant performance gain (>4x with the number above, and on in-order scaler cores you'd expect 3~8x improvement depending on whether access is aligned and the width of usize).
I inspected the code generated assembly for RISC-V and it seems pretty close in quality to my hand optimsied assembly.