-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add small-copy optimization for copy_from_slice #37573
Conversation
r? @aturon (rust_highfive has picked a reviewer for you, use r? to override) |
Thanks for the PR! To get a handle on the perf impact here, can you also run some benchmarks for memcpys of 1 to maybe 10 bytes in size? Presumably this punishes other small memcpys with an extra branch, but I'd be curious to see by how much |
I created a microbenchmark with the following results: Before:
After:
These were performed with the scaling governor set to The standard deviation estimates cannot be trusted; I get a difference of as much as 325 ns/iter between runs, where a deviation of +/- 4 is reported. To get some more real-world data, I ran Before:
After:
There are wins as big as 15% on some benchmarks, but regressions of a few percent as well (although that might be noise, given that the deviation estimates are too low). Also, keep in mind that this is just for x86_64 on Linux. |
This kind of issue is interesting, but ideally this memcpy should be so easily available to the optimizer that it can make this decision by itself. Might be worth investigating if there's anything standing in the way of that. |
I ran the microbenchmarks again 32 times and took the minimum: Before:
After:
@bluss: for the case of |
I don't know, I don't really like black box; with it, you get the most pessimistic optimization, and without it, the benchmark program might be compiled too optimistically (for exactly how a function is used in that executable only). Benchmarks from real programs such as your own are more interesting. I wanted to look at how gcc and clang optimize memcpy for short sizes (I think they do insert some inline code conditionally, for example when the inputs have constant sizes), but I didn't find any documentation or examples of that yet. |
Yeah LLVM will certainly optimize a memcpy of length 1 (for small sizes) but my guess is that LLVM has lost track of the size so the variable length falls back to the memcpy symbol. Once that happens I'd definitely believe that a branch here is much more performant. Thanks for gathering the numbers @ruuda! The data looks somewhat inconclusive to me. Certainly faster (as expected) for one element but sometimes pretty slower for multiple entries? |
Yes, the overhead is much more than I expected. Calling memcpy involves doing two jumps to a dynamic address and all calling convention bookkeeping, I did not expect the extra branch to be so bad on top of that. But I’m not sure these microbenchmarks are realistic either. In my program the length of the slice was known at compile time, so at least in theory the compiler would be able to omit the memcpy. Do you have any idea what could cause LLVM to lose track of the length? |
Unfortunately there's not particularly anything specific that'd cause that. Just a general lack of inlining or otherwise interference would perhaps cause that over time. @rust-lang/libs thoughts about landing this given the benchmarks numbers? I'm slightly leaning towards no as this seems very much like "microbenchmark" territory, personally. |
I've definitely seen major wins in my own code by optimizing usage of One thing that might be interesting is to expand this to slices of length 8 or perhaps even 16, which should permit a copy using an unaligned load/store. i.e., Instead of |
To copy a fixed number of bytes (e.g. up to 16), we could check something like The only issue then is when Alternatively, we could sidestep the tradeoff for |
The cursor / read impls do a bunch of indexing operations, so it may well be that which is an impediment to optimization, for example if bounds checks are not elided. It might be far fetched, but maybe something can be done there first? Then it seems good to either completely restrict this optimization to the I/O code, or at least start it there as a "prototype". ruuda, do you have a small example program that exhibits this problem? Then I could see if Cursor/Read can be tuned in any way. |
@bluss: I do not have a small test case at the moment, but there are the The program in which I originally found this is the |
Here is a minimal yet vaguely realistic example that suffers from this issue. It counts the number of bytes in a file excluding C-style comments. use std::fs::File;
use std::io::{BufReader, Read};
#[derive(Copy, Clone)]
enum State {
Outside,
BeginSlash,
Inside,
EndAsterisk,
}
fn count_non_comment_bytes(fname: &str) -> std::io::Result<u32> {
let mut input = BufReader::new(File::open(fname)?);
let mut state = State::Outside;
let mut buffer = [0u8];
let mut non_comment_count = 0;
while input.read(&mut buffer)? > 0 {
match (state, buffer[0]) {
(State::Outside, b'/') => state = State::BeginSlash,
(State::Outside, _) => non_comment_count += 1,
(State::BeginSlash, b'*') => state = State::Inside,
(State::BeginSlash, _) => {
state = State::Outside;
non_comment_count += 2;
}
(State::Inside, b'*') => state = State::EndAsterisk,
(State::Inside, _) => {},
(State::EndAsterisk, b'/') => state = State::Outside,
(State::EndAsterisk, _) => state = State::Inside,
}
}
Ok(non_comment_count)
}
fn main() {
for fname in std::env::args().skip(1) {
match count_non_comment_bytes(&fname) {
Ok(n) => println!("{}: {} non-comment bytes", fname, n),
Err(err) => println!("error while processing {}: {:?}", fname, err),
}
}
} I compiled this with
Interestingly, if I change the example to use the |
Great. There is one bounds check from fill_buf that's not optimized out there, let's see what happens if it is removed. |
@ruuda hm I've always been under the assumption that if you're searching byte-by-byte through a file then |
@BurntSushi yeah it definitely makes sense to me that small memcpys are slower than inlining it. That comes at a cost of code size, though, and for me at least it seems odd where we'd put this optimization in the stack. For example many calling |
Eliminating the bounds check had no effect on this, at least it didn't change the way memcpy was called. |
This is not always possible or convenient. To give a concrete example, in the program that I am working on, there is a CRC-16 stored after a certain block of data (the size of which may only become known while reading it). It was very convenient to create a Also, yes, it would be possible to manually read into a buffer and keep an index into that, and on every read check if it required to refill the buffer, and deal with the edge cases of reads straddling the buffer boundary. But that is all boilerplate, and it is exactly what
I share this concern. Would you be more comfortable with moving the optimization to the
I am not familiar with the LLVM optimizer details, but this does make sense to me: if the bounds check is there, then inside the branch bounds on the value are known, which is information that could be abused by the optimizer. |
LLVM should be seeing this if it inlined all the way to copy_from_slice. So maybe an inline is being refused somewhere? |
@arthurprs, no, it is inlining everything properly but still generating a call to
|
Interesting. The assembly shows a couple of bound checks (that shouldn't really be necessary), so I guess the optimizer is getting confused somewhere. Edit: On the other hand it uses a very specific instruction (setne) to set the memcpy length to 1 so the buffer length is correctly propagated. Why it didn't use a movb though, I have no idea. |
// significant. If the element is big then the assignment is a memcopy | ||
// anyway. | ||
if self.len() == 1 { | ||
self[0] = src[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet you could even copy the ptr::copy_non_overlapping
call here and replace self.len()
with 1
.
If that works, then this is specializing for values of self.len()
without it being known at compile-time.
That the overhead of calling memcpy
is higher than what it does is scary. What has dynamic linking done now...
@ruuda yeah I assumed that this sort of case would be handled by reading a bunch of memory (through the stack of read combinators) and then you'd deal with the chunk of memory at the end, processing it in an optimized fashion at then. Although complicated to implement, I'd assume that it's going to be faster than byte-by-byte iteration no matter what level of optimizations we implement here (even assuming this lands). Do you have an example program perhaps though that I could test out this assumption on? I'd be more comfortable, yeah, putting this higher in the stack. I'm still a little skeptical that it's what you'd want to end up with in the long run. That is, it seems like the fastest solution would still not rely on this optimization at all. |
The wins @ruuda posted in #37573 (comment) look pretty compelling to me, if it can be done without making other cases significantly worse. Better performance is good, even if it's ugly and shouldn't have to be written like this. If this lands the doc comment should explain more clearly that this is working around LLVM's inability to see the constant. We could maybe solve this problem generally by having LLVM emit the memcpy intrinsic by a custom symbol name that we intercept, and do this there. (Edit: rather we can sink this optimization into |
@brson Doing it any lower would make a lot of things worse, or at least harder to optimize. |
@brson note that from the previous numbers collected it appears that non-1-length copies are regressing due to this change. |
That'd be expected, there's strictly more conditional branches and icache waste. |
Remove one bounds check from BufReader Very minor thing. Otherwise the optimizer can't be sure that pos <= cap. Added a paranoid debug_assert to ensure correctness instead. CC rust-lang#37573
Remove one bounds check from BufReader Very minor thing. Otherwise the optimizer can't be sure that pos <= cap. Added a paranoid debug_assert to ensure correctness instead. CC rust-lang#37573
I moved the optimization into
Benchmark results for
These are very noisy, but there is still a clear win in a number of cases. See below for more benchmark results.
Of course, code specialized and optimized exactly for its use case is going to be faster. But I am hoping we can get 80% of the win with 20% (or ideally, 0%) of the effort on behalf of the user. Also, with “processing it in an optimized fashion”, keep in mind that even if you have a buffer it is not always possible to do better than reading byte by byte. In the byte count example I posted before you could do something like
What kind of example do you want? There is the toy example I posted earlier in this thread. For a more real-world example, take a look at the benchmarks of html5ever. I also did exactly this (replace most of Here are the timing results on my machine (lower is better):
So here indeed this patch (which requires no change to the code at all) is responsible for 80% of the win. A fairly invasive change to the entire codebase was slightly faster, but not by a lot.
That is true for programs that heavily use |
This looks like an LLVM bug to me: You see this in optimized output:
A memcpy where the length is either a constant or 0 should be optimized to a conditional load, methinks. |
@arielb1 Introducing conditionals is not exactly something LLVM does lightly - or do you mean in the target code generation? Still seems like a non-trivial optimization, even if an useful one. |
Filed bug https://llvm.org/bugs/show_bug.cgi?id=31001 - let see how it goes. |
@ruuda What if you run the compare script from rustc-benchmarks? Is that an inappropriate test? |
The rustc-benchmarks compare script may show some results, but it's only testing compile times, while are the runtimes of the compiler, so it is sort of a valid test. I wouldn't base anything of this scale off of the results, though. |
@mrhota I don’t know, I’ll see if I can run that and get some results. Note that now this PR should only affect code that reads single bytes, which is a very specific kind of IO. Perhaps the parser does that (I don’t know), but then again only a tiny amount of time is spent parsing, so I do not expect to see a significant difference in overall timings. |
During benchmarking, I found that one of my programs spent between 5 and 10 percent of the time doing memmoves. Ultimately I tracked these down to single-byte slices being copied with a memcopy in io::Cursor::read(). Doing a manual copy if only one byte is requested can speed things up significantly. For my program, this reduced the running time by 20%. Why special-case only a single byte, and not a "small" slice in general? I tried doing this for slices of at most 64 bytes and of at most 8 bytes. In both cases my test program was significantly slower.
Ultimately copy_from_slice is being a bottleneck, not io::Cursor::read. It might be worthwhile to move the check here, so more places can benefit from it.
Based on the discussion in rust-lang#37573, it is likely better to keep this limited to std::io, instead of modifying a function which users expect to be a memcpy.
I rebased this on top of 8e373b4 and ran rustc-benchmarks (scaling governor set to powersave to minimize variance):
It looks like compiling a few crates got a percent faster, and for the others there is no significant effect. Which is expected, because I don’t think rustc heavily reads single bytes. With no activity on the LLVM bug for three weeks, what do you all think about merging this in the mean time? |
Oh sorry I think I missed the update to push this into @bors: r+ |
📌 Commit 3be2c3b has been approved by |
Add small-copy optimization for copy_from_slice ## Summary During benchmarking, I found that one of my programs spent between 5 and 10 percent of the time doing memmoves. Ultimately I tracked these down to single-byte slices being copied with a memcopy. Doing a manual copy if the slice contains only one element can speed things up significantly. For my program, this reduced the running time by 20%. ## Background I am optimizing a program that relies heavily on reading a single byte at a time. To avoid IO overhead, I read all data into a vector once, and then I use a `Cursor` around that vector to read from. During profiling, I noticed that `__memmove_avx_unaligned_erms` was hot, taking up 7.3% of the running time. It turns out that these were caused by calls to `Cursor::read()`, which calls `<&[u8] as Read>::read()`, which calls `&[T]::copy_from_slice()`, which calls `ptr::copy_nonoverlapping()`. This one is implemented as a memcopy. Copying a single byte with a memcopy is very wasteful, because (at least on my platform) it involves calling `memcpy` in libc. This is an indirect call when libc is linked dynamically, and furthermore `memcpy` is optimized for copying large amounts of data at the cost of a bit of overhead for small copies. ## Benchmarks Before I made this change, `perf` reported the following for my program. I only included the relevant functions, and how they rank. (This is on a different machine than where I ran the original benchmarks. It has an older CPU, so `__memmove_sse2_unaligned_erms` is called instead of `__memmove_avx_unaligned_erms`.) ``` #3 5.47% bench_decode libc-2.24.so [.] __memmove_sse2_unaligned_erms #5 1.67% bench_decode libc-2.24.so [.] memcpy@GLIBC_2.2.5 #6 1.51% bench_decode bench_decode [.] memcpy@plt ``` `memcpy` is eating up 8.65% of the total running time, and the overhead of dispatching to a specialized fast copy function (`memcpy@GLIBC` showing up) is clearly visible. The price of dynamic linking (`memcpy@plt` showing up) is visible too. After this change, this is what `perf` reports: ``` #5 0.33% bench_decode libc-2.24.so [.] __memmove_sse2_unaligned_erms #14 0.01% bench_decode libc-2.24.so [.] memcpy@GLIBC_2.2.5 ``` Now only 0.34% of the running time is spent on memcopies. The dynamic linking overhead is not significant at all any more. To add some more data, my program generates timing results for the operation in its main loop. These are the timings before and after the change: | Time before | Time after | After/Before | |---------------|---------------|--------------| | 29.8 ± 0.8 ns | 23.6 ± 0.5 ns | 0.79 ± 0.03 | The time is basically the total running time divided by a constant; the actual numbers are not important. This change reduced the total running time by 21% (much more than the original 9% spent on memmoves, likely because the CPU is stalling a lot less because data dependencies are more transparent). Of course YMMV and for most programs this will not matter at all. But when it does, the gains can be significant! ## Alternatives * At first I implemented this in `io::Cursor`. I moved it to `&[T]::copy_from_slice()` instead, but this might be too intrusive, especially because it applies to all `T`, not just `u8`. To restrict this to `io::Read`, `<&[u8] as Read>::read()` is probably the best place. * I tried copying bytes in a loop up to 64 or 8 bytes before calling `Read::read`, but both resulted in about a 20% slowdown instead of speedup.
Summary
During benchmarking, I found that one of my programs spent between 5 and 10 percent of the time doing memmoves. Ultimately I tracked these down to single-byte slices being copied with a memcopy. Doing a manual copy if the slice contains only one element can speed things up significantly. For my program, this reduced the running time by 20%.
Background
I am optimizing a program that relies heavily on reading a single byte at a time. To avoid IO overhead, I read all data into a vector once, and then I use a
Cursor
around that vector to read from. During profiling, I noticed that__memmove_avx_unaligned_erms
was hot, taking up 7.3% of the running time. It turns out that these were caused by calls toCursor::read()
, which calls<&[u8] as Read>::read()
, which calls&[T]::copy_from_slice()
, which callsptr::copy_nonoverlapping()
. This one is implemented as a memcopy. Copying a single byte with a memcopy is very wasteful, because (at least on my platform) it involves callingmemcpy
in libc. This is an indirect call when libc is linked dynamically, and furthermorememcpy
is optimized for copying large amounts of data at the cost of a bit of overhead for small copies.Benchmarks
Before I made this change,
perf
reported the following for my program. I only included the relevant functions, and how they rank. (This is on a different machine than where I ran the original benchmarks. It has an older CPU, so__memmove_sse2_unaligned_erms
is called instead of__memmove_avx_unaligned_erms
.)memcpy
is eating up 8.65% of the total running time, and the overhead of dispatching to a specialized fast copy function (memcpy@GLIBC
showing up) is clearly visible. The price of dynamic linking (memcpy@plt
showing up) is visible too.After this change, this is what
perf
reports:Now only 0.34% of the running time is spent on memcopies. The dynamic linking overhead is not significant at all any more.
To add some more data, my program generates timing results for the operation in its main loop. These are the timings before and after the change:
The time is basically the total running time divided by a constant; the actual numbers are not important. This change reduced the total running time by 21% (much more than the original 9% spent on memmoves, likely because the CPU is stalling a lot less because data dependencies are more transparent). Of course YMMV and for most programs this will not matter at all. But when it does, the gains can be significant!
Alternatives
io::Cursor
. I moved it to&[T]::copy_from_slice()
instead, but this might be too intrusive, especially because it applies to allT
, not justu8
. To restrict this toio::Read
,<&[u8] as Read>::read()
is probably the best place.Read::read
, but both resulted in about a 20% slowdown instead of speedup.