-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align destination in x86_64's mem* instructions. #474
Conversation
The PowerPC test failing is odd, I don't think it should be affected by any changes in this PR? |
7b1d90f
to
db0ca0c
Compare
I can't reproduce the PowerPC64 failure. I think it's a bug in LLVM that may have been fixed by now. I run the tests like so:
The versions of rustc, gcc and QEMU are:
EDIT: updated to |
As far as I understand it the current implementation is actually faster for memcpy and memset on Intel. For backwards memmove using unaligned bytes is indeed slower and as such we already use 8 byte aligned operations. See https://docs.google.com/spreadsheets/d/1H-ubR-xCJWomUYDI9D2JH19BNUD7R9kfkl_OHSv6vMk/edit, which is linked from #365. |
copy_backwards does not align to 8 bytes before executing From what I understand recent Intel and Zen3 processors have the ERMSB feature which makes using only I'll see if I can do a benchmark on an Intel processor which does not have the ERMSB feature. |
I don't seem to have any Intel CPU without ERMSB, but I figured I'd do a benchmark on one anyways. Benchmark results for a i5-5287U (Macbook with MacOS):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. However, should we add some unit tests for rep_param and rep_param_rev
I would just want sanity checks that:
pre_byte_count + 8*qword_count + byte_count == count
src/mem/x86_64.rs
Outdated
"rep stosb", | ||
inout("ecx") pre_byte_count => _, | ||
inout("rdi") dest => dest, | ||
in("al") c, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of passing in different values for al
and rax
we can just pull the multiplication to the top and pass the same rax
value for each asm block.
See: https://rust.godbolt.org/z/9hrv8eq1G
This seems to make it easier for the compiler to combine these blocks.
src/mem/x86_64.rs
Outdated
pub unsafe fn copy_forward(mut dest: *mut u8, mut src: *const u8, count: usize) { | ||
let (pre_byte_count, qword_count, byte_count) = rep_param(dest, count); | ||
// Separating the blocks gives the compiler more freedom to reorder instructions. | ||
// It also allows us to trivially skip the rep movsb, which is faster when memcpying | ||
// aligned data. | ||
if pre_byte_count > 0 { | ||
asm!( | ||
"rep movsb", | ||
inout("ecx") pre_byte_count => _, | ||
inout("rdi") dest => dest, | ||
inout("rsi") src => src, | ||
options(nostack, preserves_flags) | ||
); | ||
} | ||
asm!( | ||
"rep movsq", | ||
inout("rcx") qword_count => _, | ||
inout("rdi") dest => _, | ||
inout("rsi") src => _, | ||
options(att_syntax, nostack, preserves_flags) | ||
inout("rdi") dest => dest, | ||
inout("rsi") src => src, | ||
options(nostack, preserves_flags) | ||
); | ||
if byte_count > 0 { | ||
asm!( | ||
"rep movsb", | ||
inout("ecx") byte_count => _, | ||
inout("rdi") dest => _, | ||
inout("rsi") src => _, | ||
options(nostack, preserves_flags) | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assembly looks reasonable here: https://rust.godbolt.org/z/Ejd8Kv6rb
src/mem/x86_64.rs
Outdated
/// Determine optimal parameters for a `rep` instruction. | ||
fn rep_param(dest: *mut u8, mut count: usize) -> (usize, usize, usize) { | ||
// Unaligned writes are still slow on modern processors, so align the destination address. | ||
let pre_byte_count = ((8 - (dest as usize & 0b111)) & 0b111).min(count); | ||
count -= pre_byte_count; | ||
let qword_count = count >> 3; | ||
let byte_count = count & 0b111; | ||
(pre_byte_count, qword_count, byte_count) | ||
} | ||
|
||
/// Determine optimal parameters for a reverse `rep` instruction (i.e. direction bit is set). | ||
fn rep_param_rev(dest: *mut u8, mut count: usize) -> (usize, usize, usize) { | ||
// Unaligned writes are still slow on modern processors, so align the destination address. | ||
let pre_byte_count = ((dest as usize + count) & 0b111).min(count); | ||
count -= pre_byte_count; | ||
let qword_count = count >> 3; | ||
let byte_count = count & 0b111; | ||
(pre_byte_count, qword_count, byte_count) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused why we need two of these functions, can't we just have one rep_param
function and just use the output in reverse order for copy_backward?
If we did that, it might be worth calling these: before_byte_count, qword_count, after_byte_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually just didn't think of that 😅
src/mem/x86_64.rs
Outdated
let (pre_byte_count, qword_count, byte_count) = rep_param_rev(dest, count); | ||
// We can't separate this block due to std/cld | ||
asm!( | ||
"std", | ||
"repe movsq (%rsi), (%rdi)", | ||
"movl {byte_count:e}, %ecx", | ||
"addq $7, %rdi", | ||
"addq $7, %rsi", | ||
"repe movsb (%rsi), (%rdi)", | ||
"rep movsb", | ||
"sub rsi, 7", | ||
"sub rdi, 7", | ||
"mov rcx, {qword_count}", | ||
"rep movsq", | ||
"add rsi, 7", | ||
"add rdi, 7", | ||
"mov ecx, {byte_count:e}", | ||
"rep movsb", | ||
"cld", | ||
byte_count = in(reg) byte_count, | ||
inout("rcx") qword_count => _, | ||
inout("rdi") dest.add(count).wrapping_sub(8) => _, | ||
inout("rsi") src.add(count).wrapping_sub(8) => _, | ||
options(att_syntax, nostack) | ||
qword_count = in(reg) qword_count, | ||
inout("ecx") pre_byte_count => _, | ||
inout("rdi") dest.add(count - 1) => _, | ||
inout("rsi") src.add(count - 1) => _, | ||
// We modify flags, but we restore it afterwards | ||
options(nostack, preserves_flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASM looks reasonable: https://rust.godbolt.org/z/EaGe1vM5b
"addq $7, %rdi", | ||
"addq $7, %rsi", | ||
"repe movsb (%rsi), (%rdi)", | ||
"rep movsb", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to skip the rep movsb
if the count is zero (like we do for the other functions?
rust-lang/rust#83387 make the minimum LLVM version 10, so I think this is fine. Can we have the switch to Intel Syntax be in a standalone Commit/PR? We should switch all the ASM to intel style, rather that just the functions in this PR. |
Where should I put them? Putting them directly in |
src/mem/x86_64.rs
Outdated
// Separating the blocks gives the compiler more freedom to reorder instructions. | ||
// It also allows us to trivially skip the rep stosb, which is faster when memcpying | ||
// aligned data. | ||
if pre_byte_count > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect a rep stosb
with an ecx
value of 0 to act like a no-op anyways. Is there a perf benefit to keeping the if
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a measurable benefit for memcpy_rust_4096
on my machine at least:
with branch:
test memcpy_rust_1048576 ... bench: 53,173 ns/iter (+/- 644) = 19720 MB/s
test memcpy_rust_1048576_misalign ... bench: 58,352 ns/iter (+/- 5,939) = 17969 MB/s
test memcpy_rust_1048576_offset ... bench: 52,561 ns/iter (+/- 1,950) = 19949 MB/s
test memcpy_rust_4096 ... bench: 84 ns/iter (+/- 20) = 48761 MB/s
test memcpy_rust_4096_misalign ... bench: 96 ns/iter (+/- 2) = 42666 MB/s
test memcpy_rust_4096_offset ... bench: 97 ns/iter (+/- 0) = 42226 MB/s
without branch:
test memcpy_rust_1048576 ... bench: 55,051 ns/iter (+/- 4,696) = 19047 MB/s
test memcpy_rust_1048576_misalign ... bench: 57,791 ns/iter (+/- 545) = 18144 MB/s
test memcpy_rust_1048576_offset ... bench: 53,902 ns/iter (+/- 1,893) = 19453 MB/s
test memcpy_rust_4096 ... bench: 89 ns/iter (+/- 0) = 46022 MB/s
test memcpy_rust_4096_misalign ... bench: 97 ns/iter (+/- 1) = 42226 MB/s
test memcpy_rust_4096_offset ... bench: 97 ns/iter (+/- 0) = 42226 MB/s
(Ditto for memset)
It probably makes more sense to leave it out though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The PowerPC failure is fixed in CI, you can rebase onto the latest master.
While misaligned reads are generally fast, misaligned writes aren't and can have severe penalties.
There is currently no measureable performance difference in benchmarks but it likely will make a difference in real workloads.
f03ed0e
to
ef37a23
Compare
While it is measurably faster for older CPUs, removing them keeps the code smaller and is likely more beneficial for newer CPUs.
While misaligned reads are generally fast, misaligned writes aren't and can have severe penalties.
I don't know if LLVM 9 is still supported. I've used Intel syntax anyways since it's more readable IMO.
Benchmark results on a Ryzen 2700X:
master
x86_64-mem-align-dest
builtin