-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use stable SIMD intrinsics with runtime detection #8
Conversation
Switch to compile-time cfg for just x86/x86-64 and runtime detection for SSE 4.2. Also fix no_std fallbacks, including for detection (without `use_std` CPU features can't be detected at runtime, but it will be still possible to enable pcmp with explicit `-C target-feature=+sse4.2`).
Hmm, I didn't observe these CI errors locally, I'll look into it. In the meanwhile, I noticed that intrinsic versions gets quite a bit slower even on the same compiler/machine, and found what I believe to be a reason in the generated assembly and raised upstream to Rust: rust-lang/rust#54353 |
It should be compilation attribute to avoid that block altogether.
This looks ugly due to infecting every private function in the pcmp chain, but apparently is required for inlining and does help performance: rust-lang/rust#54353 (comment)
@bluss Okay, given some information given in that thread, had to change the code more significantly to take advantage of inlining, but at least performance is now on par (sometimes faster / sometimes slower, probably just within noise margin) with previous asm version and not significantly slower as before. |
Nice work, thanks a lot for working on this. Feel free to ping me, since I can be slow |
@bluss I was thinking of pinging you but saw that you didn't have any activity on Github in September, assumed you have taken some time off and decided to just wait and see. Good to see you back, let me know if any changes required (although now I'll be away for a week myself). |
Just to clarify - did you mean you're going to find time to review this or that you want me to do some more changes and ping afterwards? Aside from (suspiciously) failing Travis, this PR is ready as-is. |
/// `is_supported` checks whether necessary SSE 4.2 feature is supported on current CPU. | ||
pub fn is_supported() -> bool { | ||
if cfg!(feature = "use_std") { | ||
is_x86_feature_detected!("sse4.2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, what's the overhead of such a check and how is it cached?
We should put in some kind of compile-time override here so that we can test both implementations in travis and locally easily. A compile time environment variable like RUST_TWOWAY_DISABLE_SIMD? Something for testing mainly, but I guess it can be used for debugging in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to compile as is with no std. This does:
/// `is_supported` checks whether necessary SSE 4.2 feature is supported on current CPU.
pub fn is_supported() -> bool {
#[cfg(feature = "use_std")]
return is_x86_feature_detected!("sse4.2");
#[cfg(not(feature = "use_std"))]
return cfg!(target_feature = "sse4.2");
}
@@ -11,7 +10,7 @@ use std::usize; | |||
extern crate memchr; | |||
|
|||
mod tw; | |||
#[cfg(feature = "pcmp")] | |||
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))] | |||
pub mod pcmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mod should now preferably not be public. We need some way to benchmark it still, or maybe not? Just use the simd enable/disable override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought so too, but noticed you're exposing other algorithms as public modules and thought it's intentional and has to be preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to think over it again now that it's going stable. It's not 1.0 though, so we can live. For example using doc(hidden).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO now that this is stable, ideally SSE4 helers should become just part of normal two-way code since the algortithm is the same between both and only substring searches should be done in different ways based on runtime feature detection.
Thanks for doing all the hard work with the simd stuff, that's really nice. The rest should just be maintainership tasks |
Fuzz test run_substring doesn't pass for me, not sure why. Probably this is a 16-byte read in a region that's shorter than 16 bytes. READ of size 16 at 0x602000000577 thread T0 It corresponds to the equivalent of |
I looked at pat128 as the first suspect, since it's a place where a plain memcpy is replaced by a simd load. Changing it to use an intermediate buffer and memcpy seems to postpone this problem, then a similar problem is reached again with longer inputs for some reason. |
Crux of the issue is avoiding to read past the end of the needle or text. The pcmpestri instruction in the old code must be accomplishing that? Or is it an artifact of how fuzzing is instrumented, that it only shows up in the new code with the intrinsics? Either way it needs a resolution. |
Yeah that's weird, I'd expect both to behave in the same way. I can recheck sometime later this week and get back to you if you don't find a solution earlier. |
Looking at |
I've pushed 2 commits on top of this to branch stable-simd-pr-8 / d4e91ea that should fix the fuzz test. Yes, the pcmp code is not in a super understandable state. Thanks again for getting this code over the hump to stable simd. |
See my pr to your branch for updates |
I'm currently still away, but should be able to check after the weekend if that's okay. |
Thanks! |
Actually already reviewed and left one nit, otherwise looks good, thanks! |
Thanks for working on this and cooperating on solving this! I'll merge this and then add my own changes, adjust the mask load like we talked about. You inspired me to work on bluss/matrixmultiply#22, which was a huge success I think. This std::arch stuff is fun ;) After that we still have the maintainerish questions to settle before a new version can go live:
|
The reason for the accelerated schedule is that I have time now and I'm trying to get this done. |
@RReverser I can put in one caveat, these pathologies were collected with the original two way implementation in mind and not the new. test allright::twoway_find ... bench: 92,791 ns/iter (+/- 657) = 1939 MB/s
test aaab_in_aab::twoway_find ... bench: 287,539 ns/iter (+/- 3,647) = 1043 MB/s
test aaabbb::twoway_find ... bench: 207,107 ns/iter (+/- 1,506) = 1448 MB/s
test allright::twoway_find ... bench: 92,776 ns/iter (+/- 353) = 1940 MB/s
test bb_in_aa::twoway_find ... bench: 15,349 ns/iter (+/- 501) = 6515 MB/s
test bbbaaa::twoway_find ... bench: 370,070 ns/iter (+/- 3,541) = 810 MB/s
test gllright::twoway_find ... bench: 100,767 ns/iter (+/- 490) = 1786 MB/s
test naive::twoway_find ... bench: 91 ns/iter (+/- 1) = 2747 MB/s
test naive_longpat::twoway_find ... bench: 13,692 ns/iter (+/- 201) = 7303 MB/s
test naive_longpat_reversed::twoway_find ... bench: 93,921 ns/iter (+/- 2,610) = 1064 MB/s
test naive_rev::twoway_find ... bench: 288 ns/iter (+/- 13) = 868 MB/s
test pathological_two_way::twoway_find ... bench: 8,153 ns/iter (+/- 37) = 7359 MB/s
test pathological_two_way_rev::twoway_find ... bench: 40,818 ns/iter (+/- 1,065) = 1469 MB/s
test periodic2::twoway_find ... bench: 6,543 ns/iter (+/- 78) = 3056 MB/s
test periodic5::twoway_find ... bench: 2,626 ns/iter (+/- 117) = 3046 MB/s
test short_1let_cy::twoway_find ... bench: 1,144 ns/iter (+/- 91) = 4486 MB/s
test short_1let_long::twoway_find ... bench: 69 ns/iter (+/- 0) = 36971 MB/s
test short_2let_common::twoway_find ... bench: 449 ns/iter (+/- 2) = 5681 MB/s
test short_2let_cy::twoway_find ... bench: 1,248 ns/iter (+/- 4) = 4112 MB/s
test short_2let_rare::twoway_find ... bench: 370 ns/iter (+/- 10) = 6894 MB/s
test short_3let_cy::twoway_find ... bench: 1,301 ns/iter (+/- 16) = 3944 MB/s
test short_3let_long::twoway_find ... bench: 399 ns/iter (+/- 11) = 6393 MB/s
test short_short::twoway_find ... bench: 34 ns/iter (+/- 0) = 1647 MB/s
test short_word1_long::twoway_find ... bench: 522 ns/iter (+/- 2) = 4886 MB/s
test short_word2_long::twoway_find ... bench: 447 ns/iter (+/- 16) = 5706 MB/s |
@bluss Could you perhaps do consecutive runs with old/new implementation on same machine and run logs through cargo-benchcmp? It should help a bit with visualising regressions (or improvements?). |
Be the benchmark you want to see :) |
Before this patch, multipart got into an impossible sitation with it's dependencies. It errs with: ``` error: failed to select a version for `lazy_static`. ... required by package `multipart v0.15.4` versions that meet the requirements `>= 1.0, < 1.2.0` are: 1.1.0, 1.0.2, 1.0.1, 1.0.0 all possible versions conflict with previously selected packages. previously selected package `lazy_static v1.2.0` ... which is depended on by `ring v0.13.5` ... which is depended on by `cookie v0.11.0` ... which is depended on by `rocket_http v0.4.0` ... which is depended on by `rocket v0.4.0` ... which is depended on by `multipart v0.15.4 ``` This is due to ring 0.13.3 bumping lazy_static to 1.2.0 to avoid a [soundness bug](rust-lang-nursery/lazy-static.rs#117). This patch fixes this problem by requiring at least rust 1.24.1. In addition, I noticed that the feature sse4 was depending on `twoway/pcmp`, but that has been [removed](bluss/twoway#8).
Fixes #7.