-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memcpy and co. are really unoptimized #339
Comments
Redox (MIT-licensed) provides better pure-Rust implementations of those functions at https://gitlab.redox-os.org/redox-os/kernel/-/blob/master/src/externs.rs |
I would caution against using Redox's (current) If you need an optimized implementation, you should probably just link against a |
May I ask that is there a way to specify other I've tried to modify the target triple to not include the name Lines 143 to 148 in f4c7940
Related: #253 |
@pca006132 I think similar to #365, we could also add optimized variants of EDIT: It would seem that LLVM on AArch64 emits calls to |
It is good to have optimized variants for some more targets. However, I think embedded systems may have some different requirements than normal applications, and the best way would be to allow them to provide their own implementation if they have to. For example, for armv7a processors, we could use NEON optimized memcpy if we enabled the FPU, but there may be bare-metal programs running before enabling the FPU, so probably they want an unoptimized memcpy implementation which does not use NEON. I think the easiest way is to use weak linkage, so users can replace it with their own implementation if they want, such as using the implementation from newlib. #378 |
I've experimented a bit with faster
#[inline]
unsafe fn small_copy(dest: *mut u8, src: *const u8, count: usize) {
if count < 2 {
*dest = *src;
return
}
if count <= 4 {
let a = src.cast::<u16>().read_unaligned();
let b = src.add(count - 2).cast::<u16>().read_unaligned();
dest.cast::<u16>().write_unaligned(a);
dest.add(count - 2).cast::<u16>().write_unaligned(b);
return
}
if count <= 8 {
let a = src.cast::<u32>().read_unaligned();
let b = src.add(count - 4).cast::<u32>().read_unaligned();
dest.cast::<u32>().write_unaligned(a);
dest.add(count - 4).cast::<u32>().write_unaligned(b);
return
}
if count <= 16 {
let a = src.cast::<u64>().read_unaligned();
let b = src.add(count - 8).cast::<u64>().read_unaligned();
dest.cast::<u64>().write_unaligned(a);
dest.add(count - 8).cast::<u64>().write_unaligned(b);
return
}
if count <= 32 {
let a = src.cast::<u128>().read_unaligned();
let b = src.add(count - 16).cast::<u128>().read_unaligned();
dest.cast::<u128>().write_unaligned(a);
dest.add(count - 16).cast::<u128>().write_unaligned(b);
return
}
}
My main concern is that these optimizations can significantly bloat the |
iirc LLVM already emits small copy operation as a set of load/store instead of calling memcpy. Not sure what is the threshold though. |
AFAICT it only does if it knows the size beforehand: https://rust.godbolt.org/z/KK7xj6Goh |
I think using the implementation in glibc is probably a better choice here, they are well optimized and are already using things like vectorization etc. |
Non-temporal stores bypass the cache, so it is expected that they are faster. Unfortunately this also means that unless you take care to ensure that the target of the write wasn't in the cache in the first place, you can get inconsistent results as future reads may read it from the cache again and I think if the cacheline is marked as modified that flushing it would overwrite the non-temporal store.
|
We could also add an optimized version using RISC-V's Vector Extension (aka the V Extension). This idea was explored in this Reddit post and this writeup. It seems fairly fast and simple. If we do: asm!(
"vsetvli {n}, {count}, e8, m8, ta, ma",
"vle8.v v0, ({src})",
"vse8.v v0, ({dest})",
...
); in a loop, we get assembly (Godbolt link) that is nearly identical to the handwritten example of a memcpy implementation in the spec. I'm not super familiar with the V extension, but I imagine we could have similar implementations for the other memory functions. |
It would be nice if the loop itself could be inside the asm block. Codegen backends which don't have native inline asm support (like my cg_clif) turn asm blocks into calls to a function defined in assembly compiled using an external assembler. Doing such calls in a loop has a decent amount of overhead compared to doing it once and having the loop inside the asm block. |
I've recently written some ARM memcpy implementations. Hopefully they can be PR'd into this repo at some point. The only catch is that they're handwritten if anyone wants to get started the code is over here (and all |
Addresses two classes of icache thrash present in the interrupt service path, e.g.: ```asm let mut prios = [0u128; 16]; 40380d44: ec840513 addi a0,s0,-312 40380d48: 10000613 li a2,256 40380d4c: ec840b93 addi s7,s0,-312 40380d50: 4581 li a1,0 40380d52: 01c85097 auipc ra,0x1c85 40380d56: 11e080e7 jalr 286(ra) # 42005e70 <memset> ``` and ```asm prios 40380f9c: dc840513 addi a0,s0,-568 40380fa0: ec840593 addi a1,s0,-312 40380fa4: 10000613 li a2,256 40380fa8: dc840493 addi s1,s0,-568 40380fac: 01c85097 auipc ra,0x1c85 40380fb0: eae080e7 jalr -338(ra) # 42005e5a <memcpy> ``` As an added bonus, performance of the whole program improves dramatically with these routines 1) reimplemented for the esp32 RISC-V µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit calls to these functions, and the versions that ship with compiler-builtins are [highly tuned] for other platforms. It seems like the expectation is that the compiler-builtins versions are "reasonable defaults," and they are [weakly linked] specifically to allow the kind of domain-specific overrides as here. In the context of the 'c3, this ends up producing a fairly large implementation that adds a lot of frequent cache pressure for minimal wins: ```readelf Num: Value Size Type Bind Vis Ndx Name 27071: 42005f72 22 FUNC LOCAL HIDDEN 3 memcpy 27072: 42005f88 22 FUNC LOCAL HIDDEN 3 memset 28853: 42005f9e 186 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memcpy 28854: 42006058 110 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memset ``` NB: these implementations are broken when targeting unaligned loads/stores across the instruction bus; at least in my testing this hasn't been a problem, because they are simply never invoked in that context. Additionally, these are just about the simplest possible implementations, with word-sized copies being the only concession made to runtime performance. Even a small amount of additional effort would probably yield fairly massive wins, as three- or four-instruction hot loops like these are basically pathological for the 'c3's pipeline implementation that seems to predict all branches as "never taken." However: there is a real danger in overtraining on the microbenchmarks here, too, as I would expect almost no one has a program whose runtime is dominated by these functions. Making these functions larger and more complex to eke out wins from architectural niches makes LLVM much less willing to inline them, costing additional function calls and preventing e.g. dead code elimination for always-aligned addresses or automatic loop unrolling, etc. [highly tuned]: rust-lang/compiler-builtins#405 [weakly linked]: rust-lang/compiler-builtins#339 (comment)
Addresses two classes of icache thrash present in the interrupt service path, e.g.: ```asm let mut prios = [0u128; 16]; 40380d44: ec840513 addi a0,s0,-312 40380d48: 10000613 li a2,256 40380d4c: ec840b93 addi s7,s0,-312 40380d50: 4581 li a1,0 40380d52: 01c85097 auipc ra,0x1c85 40380d56: 11e080e7 jalr 286(ra) # 42005e70 <memset> ``` and ```asm prios 40380f9c: dc840513 addi a0,s0,-568 40380fa0: ec840593 addi a1,s0,-312 40380fa4: 10000613 li a2,256 40380fa8: dc840493 addi s1,s0,-568 40380fac: 01c85097 auipc ra,0x1c85 40380fb0: eae080e7 jalr -338(ra) # 42005e5a <memcpy> ``` As an added bonus, performance of the whole program improves dramatically with these routines 1) reimplemented for the esp32 RISC-V µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit calls to these functions, and the versions that ship with compiler-builtins are [highly tuned] for other platforms. It seems like the expectation is that the compiler-builtins versions are "reasonable defaults," and they are [weakly linked] specifically to allow the kind of domain-specific overrides as here. In the context of the 'c3, this ends up producing a fairly large implementation that adds a lot of frequent cache pressure for minimal wins: ```readelf Num: Value Size Type Bind Vis Ndx Name 27071: 42005f72 22 FUNC LOCAL HIDDEN 3 memcpy 27072: 42005f88 22 FUNC LOCAL HIDDEN 3 memset 28853: 42005f9e 186 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memcpy 28854: 42006058 110 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memset ``` NB: these implementations are broken when targeting unaligned loads/stores across the instruction bus; at least in my testing this hasn't been a problem, because they are simply never invoked in that context. Additionally, these are just about the simplest possible implementations, with word-sized copies being the only concession made to runtime performance. Even a small amount of additional effort would probably yield fairly massive wins, as three- or four-instruction hot loops like these are basically pathological for the 'c3's pipeline implementation that seems to predict all branches as "never taken." However: there is a real danger in overtraining on the microbenchmarks here, too, as I would expect almost no one has a program whose runtime is dominated by these functions. Making these functions larger and more complex to eke out wins from architectural niches makes LLVM much less willing to inline them, costing additional function calls and preventing e.g. dead code elimination for always-aligned addresses or automatic loop unrolling, etc. [highly tuned]: rust-lang/compiler-builtins#405 [weakly linked]: rust-lang/compiler-builtins#339 (comment)
Addresses two classes of icache thrash present in the interrupt service path, e.g.: ```asm let mut prios = [0u128; 16]; 40380d44: ec840513 addi a0,s0,-312 40380d48: 10000613 li a2,256 40380d4c: ec840b93 addi s7,s0,-312 40380d50: 4581 li a1,0 40380d52: 01c85097 auipc ra,0x1c85 40380d56: 11e080e7 jalr 286(ra) # 42005e70 <memset> ``` and ```asm prios 40380f9c: dc840513 addi a0,s0,-568 40380fa0: ec840593 addi a1,s0,-312 40380fa4: 10000613 li a2,256 40380fa8: dc840493 addi s1,s0,-568 40380fac: 01c85097 auipc ra,0x1c85 40380fb0: eae080e7 jalr -338(ra) # 42005e5a <memcpy> ``` As an added bonus, performance of the whole program improves dramatically with these routines 1) reimplemented for the esp32 RISC-V µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit calls to these functions, and the versions that ship with compiler-builtins are [highly tuned] for other platforms. It seems like the expectation is that the compiler-builtins versions are "reasonable defaults," and they are [weakly linked] specifically to allow the kind of domain-specific overrides as here. In the context of the 'c3, this ends up producing a fairly large implementation that adds a lot of frequent cache pressure for minimal wins: ```readelf Num: Value Size Type Bind Vis Ndx Name 27071: 42005f72 22 FUNC LOCAL HIDDEN 3 memcpy 27072: 42005f88 22 FUNC LOCAL HIDDEN 3 memset 28853: 42005f9e 186 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memcpy 28854: 42006058 110 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memset ``` NB: these implementations are broken when targeting unaligned loads/stores across the instruction bus; at least in my testing this hasn't been a problem, because they are simply never invoked in that context. Additionally, these are just about the simplest possible implementations, with word-sized copies being the only concession made to runtime performance. Even a small amount of additional effort would probably yield fairly massive wins, as three- or four-instruction hot loops like these are basically pathological for the 'c3's pipeline implementation that seems to predict all branches as "never taken." However: there is a real danger in overtraining on the microbenchmarks here, too, as I would expect almost no one has a program whose runtime is dominated by these functions. Making these functions larger and more complex to eke out wins from architectural niches makes LLVM much less willing to inline them, costing additional function calls and preventing e.g. dead code elimination for always-aligned addresses or automatic loop unrolling, etc. [highly tuned]: rust-lang/compiler-builtins#405 [weakly linked]: rust-lang/compiler-builtins#339 (comment)
For very small memcpys, could we not just use a couple of registers like llvm does? This should not much affect binary size, no? |
I've seen the assembly of the memcpy here for both powerpc and wasm (where the one here seems to be used unless you use the WASI target) and in both cases really simplistic byte wise loops are emitted. I haven't really done any benchmarks but these likely perform worse than more optimized loops that work on 32-bit or so at a time.
The text was updated successfully, but these errors were encountered: