-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of constructing ByteView
s for small strings
#6034
Comments
For anyone of the compiler persuasion, I think the code above is basically a form of loop unrolling (the loop is in |
Another pattern that works well for small copies and generates much smaller code than full specialization, is using divide and conquer, copy as large a chunk as possible followed by consecutive smaller chunks. Would look something like this on a simplified example: pub fn make_view2(mut data: &[u8], block_id: u32, offset: u32) -> u128 {
let len = data.len() as u32;
let mut tmp = [0_u8; 16];
tmp[..4].copy_from_slice(&len.to_le_bytes());
let mut i = 4_usize;
if data.len() >= 8 {
tmp[i..i+8].copy_from_slice(&data[..8]);
i += 8;
data = &data[8..];
}
if data.len() >= 4 {
// unsafe { std::intrinsics::assume(i+4 <= tmp.len()); }
tmp[i..i+4].copy_from_slice(&data[..4]);
i += 4;
data = &data[4..];
}
if data.len() >= 2 {
// unsafe { std::intrinsics::assume(i+2 <= tmp.len()); }
tmp[i..i+2].copy_from_slice(&data[..2]);
i += 2;
data = &data[2..];
}
if data.len() >= 1 {
// unsafe { std::intrinsics::assume(i < tmp.len()); }
tmp[i] = data[0];
i += 1;
data = &data[1..];
}
u128::from_le_bytes(tmp)
} This also avoids bounds checks on |
Update here that I benchmarked @jhorstmann 's implementation through #6101 but the performance did not improve much. I digged slightly deeper into this and convinced myself that the overhead is the function call to So I filed #6102 with the original implementation (with minor tweaks). |
Just wonder if this problem is due to some optimization or other reason, I've tested some similar code in C++: https://godbolt.org/z/czePsTofc . And find the difference between |
Can you elaborate a bit more on this? @mapleFU My understanding is that some string length are fast, e.g., 4 and 8, becuase they align with register width. Some string length need more instructions, e.g., 7, 11, because they need to load more data then shift to clear out the unused bits. |
https://godbolt.org/z/czePsTofc : call memset@PLT
mov rdi, r14
mov rsi, rbx
mov rdx, r15
call memcpy@PLT on the otherside, makeInline2 would generate different function depending on bit-width. For example: .LBB1_2:
movzx esi, byte ptr [rdi]
xor edx, edx
shl rsi, 32
or rax, rsi
ret
.LBB1_3:
movzx esi, word ptr [rdi]
xor edx, edx
shl rsi, 32
or rax, rsi
ret If branch prediction works well, same function would be called again and again.
AFAIK, if same instr is used, and no cross cache-line memory visiting, memory copying instr would not suffer from unaligned access? ( Maybe I'm wrong ) Updated: with help of compiler assume, short makeInline might generate efficient code without memcpy, but clang doesn't apply that optimization: https://godbolt.org/z/YM58jhE5T |
Oh you're right here since some bit-width will get shorter instructions. Thanks for the idea! |
Did you observe the branch prediction suffers on string with varying lengths? That's very interesting! Would love to look at the benchmark code! I think the rust implementation only has one conditional branch, i.e.,
That's a very cool idea! That would lead to cleaner and more efficient code. But it seems that Rust don't suppor them yet. |
Yeah, I'm interesting on this optimization so I tried on C++ code but find it may not so well if string length change fastly. But anyway it's a bit faster than calling memcpy on Clang And I just think out that maybe performance for different string length would not so difference because of store buffer forwarding 🤔 Some analysis might required here
Maybe I'd try |
I've tried in this (I'm not familiar with Rust so maybe I made mistake): https://godbolt.org/z/ejWG65bGf |
The compiler needs |
Oh thanks, so it generate nearly same with C++ switch : https://godbolt.org/z/c9Exx3cbd |
|
|
I opened an issue in LLVM llvm/llvm-project#106895 |
Part of #5374
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
NoteL This is copied from the wonderful description on #6031 from @XiangpengHao and @aoil-al
In our benchmark, every string is larger than 12 bytes, this means that when making the views, we always fall into this branch. The specialty about this branch is that it only reads the first 4 bytes of the string, and LLVM is smart enough to inline loading the values (i.e., prevent calling into
copy_non_overlapping
).Is that a big deal?
Unfortunately, yes. If you change the benchmark string to very small, e.g.,
"test"
, you should be able to see (using the same benchmark script above) the performance of loadingViewArray
drops significantly, and it doesn't make sense -- how could buildingStringViewArray
s wit shorter strings be so much slower than building longer strings?The root cause is that when string is short, we need to
memcpy
the bytes to theview
, as the compiler has no idea how long the string is, it can not inline the bytes and has to call tomemcpy
, which is slow.Describe the solution you'd like
I would like to improve the performance of StringView creation more
Describe alternatives you've considered
here's a special variant of
make_view
, you can replace this function with the following new implementation, and the performance improves.This new implementation makes 13 copies of make_view_inner; each copy is a specialized length known to the compiler (
LEN
is const), and then the compiler can optimize each of the implementations as needed. Looking at the assembly code, there is indeed no call tomemcpy
.(special thanks to @aoli-al for triangulating the root cause and prototype the fix)
While the above is somewhat non intuitive, if it improves performance and is sufficiently commented, it would be a good thing to add
Additional context
The text was updated successfully, but these errors were encountered: