Optimize memcpy, memmove and memset #405

nbdd0121 · 2021-02-09T07:13:10Z

Partially addresses #339. (memcmp is not implemented in this PR)

This PR modifies the current simple implementation of memcpy, memmove and memset with a more sophisticated one.
It will first align dest to machine-word boundary using bytewise copy.
If dest and src are co-aligned (i.e. dest and src address are congruent modular machine-word size), it will proceed to use machine-word-wide copy to copy as many bytes as possible. Remaining bytes are copied using bytewise copy.
If dest and src are not co-aligned, misaligned copying is performed by reading two adjacent machine words and assemble them together with logical shifts before writing to dest.

The existing implementation can be pretty fast on platforms with SIMD and vectorization, but falls short for those without. The code in this PR is carefully written so it is fast on those platforms while is still vectorizable.

Here are some performance numbers:

Rust Builtin

test memcpy_builtin_1048576        ... bench:      33,388 ns/iter (+/- 2,619) = 31405 MB/s
test memcpy_builtin_1048576_offset ... bench:      33,988 ns/iter (+/- 2,371) = 30851 MB/s
test memcpy_builtin_4096           ... bench:          37 ns/iter (+/- 2) = 110702 MB/s
test memcpy_builtin_4096_offset    ... bench:          37 ns/iter (+/- 1) = 110702 MB/s
test memmove_builtin_1048576       ... bench:      28,862 ns/iter (+/- 1,013) = 36330 MB/s
test memmove_builtin_4096          ... bench:          38 ns/iter (+/- 1) = 107789 MB/s
test memset_builtin_1048576        ... bench:      21,171 ns/iter (+/- 1,369) = 49528 MB/s
test memset_builtin_1048576_offset ... bench:      21,516 ns/iter (+/- 906) = 48734 MB/s
test memset_builtin_4096           ... bench:          62 ns/iter (+/- 0) = 66064 MB/s
test memset_builtin_4096_offset    ... bench:          68 ns/iter (+/- 2) = 60235 MB/s

New Implementation

test memcpy_rust_1048576           ... bench:      40,900 ns/iter (+/- 2,602) = 25637 MB/s
test memcpy_rust_1048576_offset    ... bench:      46,575 ns/iter (+/- 2,994) = 22513 MB/s
test memcpy_rust_4096              ... bench:          76 ns/iter (+/- 4) = 53894 MB/s
test memcpy_rust_4096_offset       ... bench:          95 ns/iter (+/- 3) = 43115 MB/s
test memmove_rust_1048576          ... bench:      50,363 ns/iter (+/- 2,829) = 20820 MB/s
test memmove_rust_4096             ... bench:         176 ns/iter (+/- 6) = 23272 MB/s
test memset_rust_1048576           ... bench:      27,171 ns/iter (+/- 1,168) = 38591 MB/s
test memset_rust_1048576_offset    ... bench:      36,623 ns/iter (+/- 1,418) = 28631 MB/s
test memset_rust_4096              ... bench:          74 ns/iter (+/- 5) = 55351 MB/s
test memset_rust_4096_offset       ... bench:          95 ns/iter (+/- 2) = 43115 MB/s

Old Implementation

test memcpy_rust_1048576           ... bench:      43,626 ns/iter (+/- 5,963) = 24035 MB/s
test memcpy_rust_1048576_offset    ... bench:      47,138 ns/iter (+/- 4,029) = 22244 MB/s
test memcpy_rust_4096              ... bench:          73 ns/iter (+/- 3) = 56109 MB/s
test memcpy_rust_4096_offset       ... bench:          90 ns/iter (+/- 3) = 45511 MB/s
test memmove_rust_1048576          ... bench:     305,743 ns/iter (+/- 12,242) = 3429 MB/s
test memmove_rust_4096             ... bench:       1,071 ns/iter (+/- 60) = 3824 MB/s
test memset_rust_1048576           ... bench:      28,485 ns/iter (+/- 2,085) = 36811 MB/s
test memset_rust_1048576_offset    ... bench:      36,999 ns/iter (+/- 316) = 28340 MB/s
test memset_rust_4096              ... bench:          69 ns/iter (+/- 0) = 59362 MB/s
test memset_rust_4096_offset       ... bench:          90 ns/iter (+/- 9) = 45511 MB/s

New, with no-vectorize-loops

test memcpy_rust_1048576           ... bench:      70,713 ns/iter (+/- 24,368) = 14828 MB/s
test memcpy_rust_1048576_offset    ... bench:      68,340 ns/iter (+/- 5,622) = 15343 MB/s
test memcpy_rust_4096              ... bench:         272 ns/iter (+/- 15) = 15058 MB/s
test memcpy_rust_4096_offset       ... bench:         279 ns/iter (+/- 7) = 14681 MB/s
test memmove_rust_1048576          ... bench:      68,726 ns/iter (+/- 3,034) = 15257 MB/s
test memmove_rust_4096             ... bench:         275 ns/iter (+/- 10) = 14894 MB/s
test memset_rust_1048576           ... bench:      42,382 ns/iter (+/- 1,542) = 24741 MB/s
test memset_rust_1048576_offset    ... bench:      42,585 ns/iter (+/- 1,722) = 24623 MB/s
test memset_rust_4096              ... bench:         145 ns/iter (+/- 5) = 28248 MB/s
test memset_rust_4096_offset       ... bench:         146 ns/iter (+/- 7) = 28054 MB/s

Old, with no-vectorize-loops

test memcpy_rust_1048576           ... bench:     306,552 ns/iter (+/- 19,395) = 3420 MB/s
test memcpy_rust_1048576_offset    ... bench:     311,564 ns/iter (+/- 14,247) = 3365 MB/s
test memcpy_rust_4096              ... bench:       1,076 ns/iter (+/- 63) = 3806 MB/s
test memcpy_rust_4096_offset       ... bench:       1,074 ns/iter (+/- 30) = 3813 MB/s
test memmove_rust_1048576          ... bench:     315,208 ns/iter (+/- 158,448) = 3326 MB/s
test memmove_rust_4096             ... bench:       1,077 ns/iter (+/- 33) = 3803 MB/s
test memset_rust_1048576           ... bench:     303,456 ns/iter (+/- 7,937) = 3455 MB/s
test memset_rust_1048576_offset    ... bench:     304,886 ns/iter (+/- 21,495) = 3439 MB/s
test memset_rust_4096              ... bench:       1,070 ns/iter (+/- 30) = 3828 MB/s
test memset_rust_4096_offset       ... bench:       1,064 ns/iter (+/- 99) = 3849 MB/s

All above numbers are tested on my Coffee Lake laptop, with no-asm feature on.

The number shows that if vectorization is enabled or supported on the platform, then memcpy and memset performance of the old and the new implementation are similar (while much faster on memmove, it seems the compiler had a hard time vectorizing backward copy). If there is no vectorization or the platform does not supported misaligned access, the new implementation will have very significant performance gain (>4x with the number above, and on in-order scaler cores you'd expect 3~8x improvement depending on whether access is aligned and the width of usize).

I inspected the code generated assembly for RISC-V and it seems pretty close in quality to my hand optimsied assembly.

nbdd0121 · 2021-02-09T07:24:14Z

Oh, one thing to note is that the misaligned code currently makes some accesses that might be UB. For example if we are trying to copy from 0x10007-0x100017, then the code will access the entire machine-word containing the byte 0x10007 (thus also 0x10000-0x10006) and the machine-word containing 0x10010-0x10016 (thus 0x100017) which are out of the bound supplied to memcpy/memmove.

I don't think that is a UB ever exploitable by compiler though, and I am not aware of any architecture where such accesses could cause trouble.

nbdd0121 · 2021-02-09T07:25:51Z

Oh big-endian ISAs 🤦, will fix sooon

bjorn3 · 2021-02-09T08:49:59Z

I don't think that is a UB ever exploitable by compiler though, and I am not aware of any architecture where such accesses could cause trouble.

On the M1 I believe one of the exploit protection mechanisms may cause such out-of-bounds accesses to crash the program under certain circumstances. I forgot the name of the mechanism though, so I can't check if I remember it correctly.

nbdd0121 · 2021-02-09T09:12:39Z

On the M1 I believe one of the exploit protection mechanisms may cause such out-of-bounds accesses to crash the program under certain circumstances. I forgot the name of the mechanism though, so I can't check if I remember it correctly.

Hmm, interesting. Platforms with these kind of protection features I know of usually track at allocation and deallocation, and these memory regions will be machine-word aligned. The oob access here is strictly just accessing the machine-word that contains an accessible byte.

nbdd0121 · 2021-02-09T09:15:31Z

Anyway I could force at least 7 bytes to be copied bytewise before and after the unaligned copy, and doing so would be UB-free. But that would hurt other platforms where this specific kind of OOB access is permitted.

alexcrichton

Has this been tested on architectures that we primarily would like to optimize for? I think it would be good to have a candidate set of target architectures in mind to compare with LLVM's auto-vectorization and alternative implementations such as unaligned loads/stores.

src/mem/impls.rs

alexcrichton · 2021-02-10T02:08:31Z

src/mem/impls.rs

-    while i < n {
-        *dest.add(i) = *src.add(i);
-        i += 1;
+unsafe fn copy_forward_aligned_words(dest: *mut u8, src: *const u8, n: usize) {


Since this function is taking aligned words, could this perhaps take usize pointers? Additionally should this take u64 pointers perhaps? Or maybe even u128?

It could, but I uses u8 pointer to keep the signature identical to other functions.

I thought about using u64 or u128 here but it turns out to be bad for a few reason:

Many platforms, especially those which don't have vectorization, can do at most usize load per instruction.

It requires more alignment than usize, so more bytes have to be copied with bytewise copy.

It makes the misalignment scenario bad. Shifting and reassemble 2usize actually generates worse code than 1usize.

However, one optimisation that I haven't include in this PR is to unroll the loop for the aligned scenario. If n is a multiple of 8*usize for example, then we can have a 8-unrolled loop followed by a loop for the remainder. On clang this could be done easily with #pragma unroll but I don't think it is currently possible in Rust so we might have to be it manually.

I feel like taking usize here would make the intent clearer where it's copying based on each word. Perhaps there could be a typedef in the module for the copy size? Some platforms may have better performance on some types than others, so I could see it being conditional as well.

For unrolling I would expect LLVM to do this automatically if supported.

src/mem/impls.rs

alexcrichton · 2021-02-10T02:10:17Z

src/mem/impls.rs

+    if likely(n >= WORD_COPY_THRESHOLD) {
+        // Align dest
+        // Because of n >= 2 * WORD_SIZE, dst_misalignment < n
+        let dest_misalignment = (dest as usize).wrapping_neg() & WORD_MASK;


Would it be possible to use the align_offset method here to avoid bit-tricks?

From align_offset's doc:

If it is not possible to align the pointer, the implementation returns usize::MAX. It is permissible for the implementation to always return usize::MAX. Only your algorithm's performance can depend on getting a usable offset here, not its correctness.

So that rules out the use of align_offset. Also note that this is the very hot path and I would prefer simple bit trcisk rather to rely on LLVM to optimise out a very complex function.

Could you try benchmarking to see if there is overhead?

It's not related to performance, it could simply not be used for correctness.

It is permissible for the implementation to always return usize::MAX.

alexcrichton · 2021-02-10T02:11:06Z

src/mem/impls.rs

+        if likely(src_misalignment == 0) {
+            copy_forward_aligned_words(dest, src, n_words);
+        } else {
+            copy_forward_misaligned_words(dest, src, n_words);


Out of curiosity, have you tested simply copying using ptr::read_unaligned and ptr::write_unaligned? That way alignment wouldn't be an issue, but I don't know the performance impact that would have on various platforms.

Well, first of all ptr::read_unaligned and ptr::write_aligned calls ptr::copy_nonoverlapping which translates to memcpy, so I would like to avoid the possibility of an recursion if LLVM doesn't optimise them away.

Secondly, ptr::read_unaligned has really poor performance. On platforms without misaligned memory access support it will translate to size_of<usize>() number of byte loads.

This branch is necessary because we don't want to bear the burden of all the shifts and checks necessary for misaligned loads if dest and src are perfectly co-aligned.

Have you tried this out and seen infinite recursion? Have you tried it out and seen if it's slower?

There'll be an infinite recursion if compiled in debug mode.

On architectures that does not support misaligned loads (so most ISAs other than armv8 and x86/64) the performance is much slower because it generates 8 byte load and 16 bit ops rather than 1 word load and 3 bit ops.

src/mem/impls.rs

alexcrichton · 2021-02-10T02:13:44Z

src/mem/impls.rs

+    while dest < dest_end {
+        *dest = *src;
+        dest = dest.add(1);
+        src = src.add(1);


In general LLVM has a bunch of heuristics for auto-vectorizing loops like these, and the auto-vectorized code is probably faster as well.

LLVM doesn't auto-vectorize on all platforms, though. What I'm worried about though is that this is hurting performance on platforms that auto-vectorize. Would it be possible to survey some platforms to see if the simple function implementations are enough with LLVM auto-vectorizing them?

You can check the benchmarks on x86-64. The code submitted in this PR is actually my second attempt, my first attempt ended up in poor performance because it does not vectorize very well.

The current design that computes dest_end inside these helper functions are a result of learning from the failed attempt. On platforms that vectorizes, the code is simple enough to be detected by LLVM so it effectively treat it as the while i < n type of loop so there won't be a regression, while on those that does not vectorize LLVM wouldn't bookkeepthe i variable (yes, LLVM couldn't optimise away i) so that's like 20% performance improvement.

I don't really understand the purpose of benchmarking on x86_64? This file isn't even used on that platform and it seems like we wouldn't want to use it anyway because LLVM otherwise auto-vectorizes very well and probably outperforms what's in this file.

I'm interested in getting data on this for platforms other than x86_64 for those reasons.

That's why I turn off auto vectorization to mimic platforms without auto-vectorization. I did observe very significant improvement on RISC-V but I couldn't just do cargo bench on a no_std platform. Even on x86-64 the code still provides significant performance improvement for backward copy though.

nbdd0121 · 2021-02-10T23:31:47Z

Has this been tested on architectures that we primarily would like to optimize for? I think it would be good to have a candidate set of target architectures in mind to compare with LLVM's auto-vectorization and alternative implementations such as unaligned loads/stores.

I have tested it on RISC-V. The RISC-V platform that I tested it on is no_std so there isn't a bencher but it speeds up my entire program by 25%.

It could hurt platforms that have both vectorization and hardware misaligned load support, like x86-64 and ARM, but I am not sure what is the best to do there, e.g. how to detect if vectorization is on and how to make use of unaligned load support.

BTW I just realized the benches do not test misaligned scenario. I'll add a benchmark soon.

nbdd0121 · 2021-02-10T23:45:10Z

Misaligned benchmarks:

Rust Builtin

test memcpy_builtin_1048576_misalign  ... bench:      33,722 ns/iter (+/- 1,310) = 31094 MB/s
test memcpy_builtin_4096_misalign     ... bench:          38 ns/iter (+/- 1) = 107789 MB/s
test memmove_builtin_1048576_misalign ... bench:      34,791 ns/iter (+/- 1,177) = 30139 MB/s
test memmove_builtin_4096_misalign    ... bench:          38 ns/iter (+/- 1) = 107789 MB/s

New Implementation

test memcpy_rust_1048576_misalign     ... bench:      56,567 ns/iter (+/- 1,812) = 18536 MB/s
test memcpy_rust_4096_misalign        ... bench:         215 ns/iter (+/- 5) = 19051 MB/s
test memmove_rust_1048576_misalign    ... bench:     136,350 ns/iter (+/- 5,338) = 7690 MB/s
test memmove_rust_4096_misalign       ... bench:         550 ns/iter (+/- 27) = 7447 MB/s

Old Implementation

test memcpy_rust_1048576_misalign     ... bench:      49,171 ns/iter (+/- 3,514) = 21325 MB/s
test memcpy_rust_4096_misalign        ... bench:         108 ns/iter (+/- 5) = 37925 MB/s
test memmove_rust_1048576_misalign    ... bench:     306,153 ns/iter (+/- 10,100) = 3425 MB/s
test memmove_rust_4096_misalign       ... bench:       1,070 ns/iter (+/- 31) = 3828 MB/s

New, with no-vectorize-loops

test memcpy_rust_1048576_misalign     ... bench:     138,407 ns/iter (+/- 5,799) = 7576 MB/s
test memcpy_rust_4096_misalign        ... bench:         544 ns/iter (+/- 41) = 7529 MB/s
test memmove_rust_1048576_misalign    ... bench:     139,043 ns/iter (+/- 5,207) = 7541 MB/s
test memmove_rust_4096_misalign       ... bench:         551 ns/iter (+/- 8) = 7433 MB/s

Old, with no-vectorize-loops

test memcpy_rust_1048576_misalign     ... bench:     308,055 ns/iter (+/- 5,330) = 3403 MB/s
test memcpy_rust_4096_misalign        ... bench:       1,073 ns/iter (+/- 13) = 3817 MB/s
test memmove_rust_1048576_misalign    ... bench:     308,875 ns/iter (+/- 27,410) = 3394 MB/s
test memmove_rust_4096_misalign       ... bench:       1,074 ns/iter (+/- 36) = 3813 MB/s

I expected the misaligned code path hurts x86-64. It is slower with vectorization on but faster with vectorization off. It is always faster in memmove, presumably because LLVM couldn't vectorize it.

alexcrichton · 2021-02-17T19:14:03Z

Indeed yeah I'm specifically worried where this makes existing platforms worse because LLVM otherwise does something well today. This whole PR should arguably be replaced with an improved LLVM backend for relevant targets, but that's naturally a very large undertaking so it seems fine to have this in the meantime while we wait for LLVM to improve.

I would prefer to not have a too-overly-complicated implementation since it seems likely to only get used on some platforms. Having a complicated implementation is also a risk because I don't think anything here is tested on CI since CI mostly does x86_64.

Basically I think that for this there should at least be peformance numbers for ARM/AArch64 and investigation into what LLVM does today in terms of vectorization and performance of unaligned loads/stores. Additionally it would be great to ensure that everything here is tested on CI regardless of the platform if possible.

nbdd0121 · 2021-02-17T21:25:28Z

I would argue this isn't overly complicated. The concept is very simple and is what C libraries do already.

Benchmarking on ARM/AArch64 might not be very helpful because there are both vectorization and misaligned load/store support, but I would still expect memmove performance improvement (I could test it myself because I don't have a ARM dev environment setup). This PR is mainly intended to be a much better default for those without (a quick test on godbolt suggests this includes armv7, powerpc, mips, riscv and more).

As for the CI, it should be noted a few platforms are already being tested (including big-endian ones like PowerPC).

adamgemmell · 2021-08-20T17:06:37Z

Hi, I've carried out some benchmarks on an aarch64-unknown-linux-gnu platform (Graviton 2, Neoverse-N1) to alleviate some concerns. I rebased the branch to something more recent before carrying these out.

(all numbers ns/iter. +/- ranges were never more than 1% of the result)

Test Name	Result
memcpy_builtin_104857	45562
memcpy_builtin_1048576_misalign	47097
memcpy_builtin_1048576_offset	47052
memcpy_builtin_4096	108
memcpy_builtin_4096_misalign	111
memcpy_builtin_4096_offset	111
memmove_builtin_1048576	42840
memmove_builtin_1048576_misalign	62671
memmove_builtin_4096	109
memmove_builtin_4096_misalign	161
memset_builtin_1048576	26245
memset_builtin_1048576_offset	26242
memset_builtin_4096	105
memset_builtin_4096_offset	106

Test Name	New	New (no-vec)	Old	Old (no-vec)
memcpy_rust_1048576	47317	108806	46978	709112
memcpy_rust_1048576_misalign	78310	148440	56356	708612
memcpy_rust_1048576_offset	50758	109122	57212	708791
memcpy_rust_4096	113	421	109	2766
memcpy_rust_4096_misalign	287	573	136	2768
memcpy_rust_4096_offset	121	426	136	2768
memmove_rust_1048576	105688	109360	840238	630534
memmove_rust_1048576_misalign	151630	151128	840360	630725
memmove_rust_4096	392	421	3288	2469
memmove_rust_4096_misalign	573	568	3288	2469
memset_rust_1048576	26252	104967	26246	419879
memset_rust_1048576_offset	26257	104960	26247	419857
memset_rust_4096	109	420	106	1658
memset_rust_4096_offset	114	422	106	1654

The short story is the results look similar to the ones observed above on x86_64.

memcpy and memset are largely the same, with nice improvements for non-vectorised code. Performance reaches that of the builtins in at least some cases.
memmove shows large improvments. Non-vectorised performance seems comparable at least for these tests.
Misaligned accesses have a performance regression, though it's smaller than the improvement in memmove.

nbdd0121 · 2021-08-20T17:20:01Z

I suppose I should conditionally select different versions of code for misaligned case. For architectures (e.g. x86/ARM) that have hardware misaligned access support, it might be better to just do misaligned load instead of trying to do bitshifts.

nbdd0121 · 2021-08-21T02:48:09Z

@adamgemmell I have added a different code path for misaligned case for x86/x64/aarch64. Can you do another bench run? Thanks!

adamgemmell · 2021-08-24T16:54:07Z

Same machine, same branch base, same compiler (which I forgot to mention was 1.56.0-nightly (2faabf579 2021-07-27)). Master runs looked close to what I was getting previously.

Test	With misaligned case	With MC (no-vec)
memcpy_rust_1048576	46949	109190
memcpy_rust_1048576_misalign	55425	104213
memcpy_rust_1048576_offset	50890	109193
memcpy_rust_4096	114	421
memcpy_rust_4096_misalign	144	398
memcpy_rust_4096_offset	123	427
memmove_rust_1048576	104361	109690
memmove_rust_1048576_misalign	106661	105638
memmove_rust_4096	393	421
memmove_rust_4096_misalign	411	415
memset_rust_1048576	26254	52526
memset_rust_1048576_offset	26262	52510
memset_rust_4096	110	220
memset_rust_4096_offset	114	222

Looks like a good change, really brings the misaligned numbers in line with the other tests.

There's a weird doubling of performance for the memset no-vec case on this machine. Not complaining but I kinda want to look more into it tomorrow. Couldn't replicate something similar on other machines I tried.

Amanieu · 2021-08-30T20:47:26Z

You can get even better performance with unaligned memory accesses by overlapping the accesses:

This is easy with memset: a 7-byte memset can be done with 2 4-byte stores at [start] and [end - 4].
Small memcpy can be done by reading the entire data into registers with overlapping loads and then writing back everything with overlapping stores. This also allows you to skip the direction check for small memcpy since the direction becomes irrelevant.
For larger memcpy you can perform the first copy as a large unaligned copy and the rest using aligned copies. The trick here to avoid overlap issues is to load the data for the first aligned copy before performing the store for the initial unaligned copy. At the end of the main aligned copy loop, do an unaligned copy for the last block of data.

I recommend reading through these optimized implementations of memcpy and memset for AArch64 which use all of these tricks:

For platforms such as RISC-V without support for unaligned accesses, I'm surprised that copy_forward_misaligned_words is faster than a simple byte-by-byte copy. But if you've run the benchmarks then I guess it's fine.

nbdd0121 · 2021-08-30T21:24:12Z

You can get even better performance with unaligned memory accesses by overlapping the accesses:

This is easy with memset: a 7-byte memset can be done with 2 4-byte stores at [start] and [end - 4].

Small memset/memcpy/memmove isn't particularly of concern. Sure, there might be some improvements possible, but the improvement is not as significant as improvements of large memset/memcpys. Plus, if size is known, the compiler will inline them anyway.

Small memcpy can be done by reading the entire data into registers with overlapping loads and then writing back everything with overlapping stores. This also allows you to skip the direction check for small memcpy since the direction becomes irrelevant.

you can perform the first copy as a large unaligned copy and the rest using aligned copies. The trick here to avoid overlap issues is to load the data for the first aligned copy before performing the store for the initial unaligned copy. At the end of the main aligned copy loop, do an unaligned copy for the last block of data.

I don't understand how overlapping is an issue for memcpy. Do you mean memmove? For the trick to work I believe the gap between src and dst must be small.

I recommend reading through these optimized implementations of memcpy and memset for AArch64 which use all of these tricks:

https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/memset.S

https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/memcpy.S

The tricks applied can slow down some processors while speed up others. For example, use of branches in small memset and memcpy will slow down those with simple branch predictors because there are many data-dependent branches. On the other hand loops are much easier to predict.

For platforms such as RISC-V without support for unaligned accesses, I'm surprised that copy_forward_misaligned_words is faster than a simple byte-by-byte copy. But if you've run the benchmarks then I guess it's fine.

The number of instructions reduce from 8 byte-size load + 8 byte-size store per word copy to 1 word-size load + 1 word-size store + 3 arithmetic ops per word copy. LLVM actually wouldn't even unroll the loop, so loop overhead would hit bytewise copy even further.

While some tricks you described are interesting, my main focus is on RISC-V which wouldn't benefit from those optimisations and might not even test those code paths. I am afraid it's outside my capacity to optimise for AArch64.

src/mem/impls.rs

nbdd0121 · 2021-08-30T23:28:54Z

Failed at testcrate/tests/cmp.rs:34:9. I am not sure I touched anything related to that between recent 2 pushes. Might be a change made in recent nightly?

nbdd0121 · 2021-08-30T23:42:28Z

Tried to run the workflow with current HEAD of master on my fork. Confirmed that it failed https://github.com/nbdd0121/compiler-builtins/actions/runs/1184124369.

Amanieu · 2021-08-31T22:21:26Z

I had a look at the powerpc failure and it seems to be an LLVM bug (I'm still working on a minimal reproduction). In the meantime I think it's fine to merge this PR.

Amanieu · 2021-08-31T22:23:55Z

Published in compiler-builtins 0.1.50.

Addresses two classes of icache thrash present in the interrupt service path, e.g.: ```asm let mut prios = [0u128; 16]; 40380d44: ec840513 addi a0,s0,-312 40380d48: 10000613 li a2,256 40380d4c: ec840b93 addi s7,s0,-312 40380d50: 4581 li a1,0 40380d52: 01c85097 auipc ra,0x1c85 40380d56: 11e080e7 jalr 286(ra) # 42005e70 <memset> ``` and ```asm prios 40380f9c: dc840513 addi a0,s0,-568 40380fa0: ec840593 addi a1,s0,-312 40380fa4: 10000613 li a2,256 40380fa8: dc840493 addi s1,s0,-568 40380fac: 01c85097 auipc ra,0x1c85 40380fb0: eae080e7 jalr -338(ra) # 42005e5a <memcpy> ``` As an added bonus, performance of the whole program improves dramatically with these routines 1) reimplemented for the esp32 RISC-V µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit calls to these functions, and the versions that ship with compiler-builtins are [highly tuned] for other platforms. It seems like the expectation is that the compiler-builtins versions are "reasonable defaults," and they are [weakly linked] specifically to allow the kind of domain-specific overrides as here. In the context of the 'c3, this ends up producing a fairly large implementation that adds a lot of frequent cache pressure for minimal wins: ```readelf Num: Value Size Type Bind Vis Ndx Name 27071: 42005f72 22 FUNC LOCAL HIDDEN 3 memcpy 27072: 42005f88 22 FUNC LOCAL HIDDEN 3 memset 28853: 42005f9e 186 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memcpy 28854: 42006058 110 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memset ``` NB: these implementations are broken when targeting unaligned loads/stores across the instruction bus; at least in my testing this hasn't been a problem, because they are simply never invoked in that context. Additionally, these are just about the simplest possible implementations, with word-sized copies being the only concession made to runtime performance. Even a small amount of additional effort would probably yield fairly massive wins, as three- or four-instruction hot loops like these are basically pathological for the 'c3's pipeline implementation that seems to predict all branches as "never taken." However: there is a real danger in overtraining on the microbenchmarks here, too, as I would expect almost no one has a program whose runtime is dominated by these functions. Making these functions larger and more complex to eke out wins from architectural niches makes LLVM much less willing to inline them, costing additional function calls and preventing e.g. dead code elimination for always-aligned addresses or automatic loop unrolling, etc. [highly tuned]: rust-lang/compiler-builtins#405 [weakly linked]: rust-lang/compiler-builtins#339 (comment)

RalfJung · 2024-01-05T16:26:25Z

Oh, one thing to note is that the misaligned code currently makes some accesses that might be UB.

They are definitely UB, and IMO this needs to be fixed. Having UB in our own runtime while we tell people "never ever have UB anywhere" is not a good look.
Tracked at #559.

nbdd0121 force-pushed the master branch from e79055c to 33bd326 Compare February 9, 2021 07:16

nbdd0121 changed the title ~~Add test cases for memcpy, memmove and memset for different alignment~~ Optimize memcpy, memmove and memset Feb 9, 2021

alexcrichton reviewed Feb 10, 2021

View reviewed changes

nbdd0121 added 3 commits August 21, 2021 03:01

Add test cases for memcpy, memmove and memset for different alignment

0c41d60

Implement word-sized copy

fcfecc1

Add misaligned benchmarks

3ad5fa9

nbdd0121 force-pushed the master branch from 5039287 to 40674f5 Compare August 21, 2021 02:46

Add different misaligned path for archs with unaligned support

2d28f4d

nbdd0121 force-pushed the master branch from 40674f5 to 2d28f4d Compare August 21, 2021 02:49

Amanieu reviewed Aug 30, 2021

View reviewed changes

src/mem/impls.rs Outdated Show resolved Hide resolved

Use atomic_load_unordered for first word load in misaligned case

ce86d41

nbdd0121 mentioned this pull request Aug 31, 2021

compiler-builtins test no longer passing on ppc64le rust-lang/rust#88520

Closed

Amanieu merged commit 461b5f8 into rust-lang:master Aug 31, 2021

tgross35 mentioned this pull request Mar 19, 2025

copy_misaligned_words: avoid out-of-bounds accesses #799

Merged

Optimize memcpy, memmove and memset #405

Optimize memcpy, memmove and memset #405

Uh oh!

Conversation

nbdd0121 commented Feb 9, 2021

Uh oh!

nbdd0121 commented Feb 9, 2021

Uh oh!

nbdd0121 commented Feb 9, 2021

Uh oh!

bjorn3 commented Feb 9, 2021

Uh oh!

nbdd0121 commented Feb 9, 2021

Uh oh!

nbdd0121 commented Feb 9, 2021

Uh oh!

alexcrichton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nbdd0121 commented Feb 10, 2021

Uh oh!

nbdd0121 commented Feb 10, 2021

Uh oh!

alexcrichton commented Feb 17, 2021

Uh oh!

nbdd0121 commented Feb 17, 2021

Uh oh!

adamgemmell commented Aug 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nbdd0121 commented Aug 20, 2021

Uh oh!

nbdd0121 commented Aug 21, 2021

Uh oh!

adamgemmell commented Aug 24, 2021

Uh oh!

Amanieu commented Aug 30, 2021

Uh oh!

nbdd0121 commented Aug 30, 2021

Uh oh!

Uh oh!

nbdd0121 commented Aug 30, 2021

Uh oh!

nbdd0121 commented Aug 30, 2021

Uh oh!

Amanieu commented Aug 31, 2021

adamgemmell commented Aug 20, 2021 •

edited

Loading