-
-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster memchr, memchr2 and memchr3 in generic version #151
Conversation
3f70ca9
to
08f82c9
Compare
You'll want to use rebar. From the root of this repo on
Then checkout your PR, rebuild the benchmark engine and measure your change:
Then you can diff the results:
Add a
So I see a big improvement in But it looks like there's a smaller regression on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this generally LGTM with one little nit. I do think the regressions warrant some investigation. I don't know when I'll have time to do it, but if you wanted to do an analysis, I'd be happy to read it. I'm inclined to accept the regression given the throughput gains, but it would be nice to spend at least some time trying to eliminate the regression (or reduce it).
Here are my guesses. I ran benchmark several times on my Apple M1 laptop, binaries built properly for arm64.
It shows consistent regression on this benchmark:
which makes sense: it searches for space, and text is:
so it often processes eight bytes at once with shifts and such instead of finding the character on fourth iteration. About your results:
|
9d0ab64
to
885bbc0
Compare
Updated the comment above. Also updated the PR to do unaligned read in the end of find/rfind instead of iterating byte-by-byte. |
Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.
885bbc0
to
ccb1327
Compare
With the updated PR, on
And on my M2 mac mini:
I think this is on balance a good change. I'm still a little worried about the regressions, but I generally side toward throughput gains.
Nice thank you! |
This PR is on crates.io in |
Looks like this broke all big endian targets. #152 |
Unbelievably, some of the steps in the main CI configuration were using the hard-coded `cargo` instead of `${{ env.CARGO }}`. The latter will interpolate to `cross` for targets requiring cross compilation. And since all big endian targets are only tested via cross compilation, our CI was not test big endian at all (beyond testing that compilation succeeded). This led to a [performance optimization] [breaking big endian] targets. [performance optimization]: #151 [breaking big endian]: #152
Unbelievably, some of the steps in the main CI configuration were using the hard-coded `cargo` instead of `${{ env.CARGO }}`. The latter will interpolate to `cross` for targets requiring cross compilation. And since all big endian targets are only tested via cross compilation, our CI was not test big endian at all (beyond testing that compilation succeeded). This led to a [performance optimization] [breaking big endian] targets. [performance optimization]: #151 [breaking big endian]: #152
This PR was reverted in #153, and the revert was released as |
Resubmit of PR BurntSushi#151 That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR BurntSushi#153 for details). Andrew, thank you for new test cases which made it easier to fix the issue. The fix is: ``` --- a/src/arch/all/memchr.rs +++ b/src/arch/all/memchr.rs @@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { lowest_zero_byte(x) } else { - highest_zero_byte(x) + Some(USIZE_BYTES - 1 - highest_zero_byte(x)?) } } @@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { highest_zero_byte(x) } else { - lowest_zero_byte(x) + Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?) } } ``` Original description: Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.
Resubmit of PR BurntSushi#151. That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR BurntSushi#153 for details). Andrew, thank you for new test cases which made it easier to fix the issue. The fix is: ``` --- a/src/arch/all/memchr.rs +++ b/src/arch/all/memchr.rs @@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { lowest_zero_byte(x) } else { - highest_zero_byte(x) + Some(USIZE_BYTES - 1 - highest_zero_byte(x)?) } } @@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { highest_zero_byte(x) } else { - lowest_zero_byte(x) + Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?) } } ``` Original description: Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.
Resubmit of PR BurntSushi#151. That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR BurntSushi#153 for details). Andrew, thank you for new test cases which made it easy to fix the issue. The fix is: ``` --- a/src/arch/all/memchr.rs +++ b/src/arch/all/memchr.rs @@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { lowest_zero_byte(x) } else { - highest_zero_byte(x) + Some(USIZE_BYTES - 1 - highest_zero_byte(x)?) } } @@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { highest_zero_byte(x) } else { - lowest_zero_byte(x) + Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?) } } ``` Original description: Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.
Resubmitted as #154. |
Current generic ("all") implementation checks that a chunk (
usize
) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops.Context: we use
memchr
, but many of our strings are short. Currently SIMD-optimizedmemchr
processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we takeusize
bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.Suggestion how to check whether this PR makes it faster? In any case, it should not be slower.