Use stable SIMD intrinsics with runtime detection #8

RReverser · 2018-09-19T00:20:11Z

Fixes #7.

Switch to compile-time cfg for just x86/x86-64 and runtime detection for SSE 4.2. Also fix no_std fallbacks, including for detection (without `use_std` CPU features can't be detected at runtime, but it will be still possible to enable pcmp with explicit `-C target-feature=+sse4.2`).

RReverser · 2018-09-19T11:07:39Z

Hmm, I didn't observe these CI errors locally, I'll look into it.

In the meanwhile, I noticed that intrinsic versions gets quite a bit slower even on the same compiler/machine, and found what I believe to be a reason in the generated assembly and raised upstream to Rust: rust-lang/rust#54353

It should be compilation attribute to avoid that block altogether.

This looks ugly due to infecting every private function in the pcmp chain, but apparently is required for inlining and does help performance: rust-lang/rust#54353 (comment)

RReverser · 2018-09-19T18:28:02Z

@bluss Okay, given some information given in that thread, had to change the code more significantly to take advantage of inlining, but at least performance is now on par (sometimes faster / sometimes slower, probably just within noise margin) with previous asm version and not significantly slower as before.

bluss · 2018-09-29T21:08:03Z

Nice work, thanks a lot for working on this. Feel free to ping me, since I can be slow

RReverser · 2018-09-29T23:25:22Z

@bluss I was thinking of pinging you but saw that you didn't have any activity on Github in September, assumed you have taken some time off and decided to just wait and see. Good to see you back, let me know if any changes required (although now I'll be away for a week myself).

RReverser · 2018-10-09T00:52:29Z

Feel free to ping me, since I can be slow

Just to clarify - did you mean you're going to find time to review this or that you want me to do some more changes and ping afterwards? Aside from (suspiciously) failing Travis, this PR is ready as-is.

bluss · 2018-10-28T12:15:09Z

src/pcmp.rs

+/// `is_supported` checks whether necessary SSE 4.2 feature is supported on current CPU.
+pub fn is_supported() -> bool {
+    if cfg!(feature = "use_std") {
+        is_x86_feature_detected!("sse4.2")


I'm curious, what's the overhead of such a check and how is it cached?

We should put in some kind of compile-time override here so that we can test both implementations in travis and locally easily. A compile time environment variable like RUST_TWOWAY_DISABLE_SIMD? Something for testing mainly, but I guess it can be used for debugging in general.

This doesn't seem to compile as is with no std. This does:

/// `is_supported` checks whether necessary SSE 4.2 feature is supported on current CPU. pub fn is_supported() -> bool { #[cfg(feature = "use_std")] return is_x86_feature_detected!("sse4.2"); #[cfg(not(feature = "use_std"))] return cfg!(target_feature = "sse4.2"); }

bluss · 2018-10-28T12:28:55Z

src/lib.rs

@@ -11,7 +10,7 @@ use std::usize;
 extern crate memchr;

 mod tw;
-#[cfg(feature = "pcmp")]
+#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
 pub mod pcmp;


This mod should now preferably not be public. We need some way to benchmark it still, or maybe not? Just use the simd enable/disable override?

I thought so too, but noticed you're exposing other algorithms as public modules and thought it's intentional and has to be preserved.

We'll need to think over it again now that it's going stable. It's not 1.0 though, so we can live. For example using doc(hidden).

IMO now that this is stable, ideally SSE4 helers should become just part of normal two-way code since the algortithm is the same between both and only substring searches should be done in different ways based on runtime feature detection.

bluss · 2018-10-28T12:32:00Z

Thanks for doing all the hard work with the simd stuff, that's really nice. The rest should just be maintainership tasks

bluss · 2018-10-28T12:53:06Z

Fuzz test run_substring doesn't pass for me, not sure why. Probably this is a 16-byte read in a region that's shorter than 16 bytes.

READ of size 16 at 0x602000000577 thread T0
SUMMARY: AddressSanitizer: heap-buffer-overflow" Input was "0x7c,0x20,0x72,0x20,0x20,0x20,0x20"

It corresponds to the equivalent of find_bytes(b" ", b" "). (3 spaces then 2 spaces. For some reason github is not showing what I've input.

bluss · 2018-10-28T14:50:44Z

I looked at pat128 as the first suspect, since it's a place where a plain memcpy is replaced by a simd load. Changing it to use an intermediate buffer and memcpy seems to postpone this problem, then a similar problem is reached again with longer inputs for some reason.

bluss · 2018-10-28T19:52:04Z

Crux of the issue is avoiding to read past the end of the needle or text. The pcmpestri instruction in the old code must be accomplishing that? Or is it an artifact of how fuzzing is instrumented, that it only shows up in the new code with the intrinsics? Either way it needs a resolution.

RReverser · 2018-10-28T20:03:47Z

Yeah that's weird, I'd expect both to behave in the same way. I can recheck sometime later this week and get back to you if you don't find a solution earlier.

RReverser · 2018-10-30T17:38:56Z

Looking at pat128, I'm not even sure now why I made that change and assumed it would be ok when it's clearly wrong...

bluss · 2018-10-30T19:41:42Z

I've pushed 2 commits on top of this to branch stable-simd-pr-8 / d4e91ea that should fix the fuzz test. Yes, the pcmp code is not in a super understandable state. Thanks again for getting this code over the hump to stable simd.

bluss · 2018-11-09T17:22:18Z

See my pr to your branch for updates

RReverser · 2018-11-09T17:34:18Z

I'm currently still away, but should be able to check after the weekend if that's okay.

RReverser · 2018-11-09T17:34:23Z

Thanks!

RReverser · 2018-11-09T23:32:04Z

Actually already reviewed and left one nit, otherwise looks good, thanks!

bluss · 2018-11-17T17:22:51Z

Thanks for working on this and cooperating on solving this! I'll merge this and then add my own changes, adjust the mask load like we talked about.

You inspired me to work on bluss/matrixmultiply#22, which was a huge success I think. This std::arch stuff is fun ;)

After that we still have the maintainerish questions to settle before a new version can go live:

testing without/with pcmp
platform testing on x86-64 and non-x86
which symbols are public

bluss · 2018-11-17T17:23:19Z

The reason for the accelerated schedule is that I have time now and I'm trying to get this done.

bluss · 2018-11-18T20:26:15Z

@RReverser
These are my benchmarks results, they are not comparative and slightly messy. From the simd enabled stable twoway crate as it is now on an avx enabled x86-64 laptop. I think having a worst case of > 800 MB/s on the pathological test cases we have is very good.

I can put in one caveat, these pathologies were collected with the original two way implementation in mind and not the new.

test allright::twoway_find                  ... bench:      92,791 ns/iter (+/- 657) = 1939 MB/s
test aaab_in_aab::twoway_find               ... bench:     287,539 ns/iter (+/- 3,647) = 1043 MB/s
test aaabbb::twoway_find                    ... bench:     207,107 ns/iter (+/- 1,506) = 1448 MB/s
test allright::twoway_find                  ... bench:      92,776 ns/iter (+/- 353) = 1940 MB/s
test bb_in_aa::twoway_find                  ... bench:      15,349 ns/iter (+/- 501) = 6515 MB/s
test bbbaaa::twoway_find                    ... bench:     370,070 ns/iter (+/- 3,541) = 810 MB/s
test gllright::twoway_find                  ... bench:     100,767 ns/iter (+/- 490) = 1786 MB/s
test naive::twoway_find                     ... bench:          91 ns/iter (+/- 1) = 2747 MB/s
test naive_longpat::twoway_find             ... bench:      13,692 ns/iter (+/- 201) = 7303 MB/s
test naive_longpat_reversed::twoway_find    ... bench:      93,921 ns/iter (+/- 2,610) = 1064 MB/s
test naive_rev::twoway_find                 ... bench:         288 ns/iter (+/- 13) = 868 MB/s
test pathological_two_way::twoway_find      ... bench:       8,153 ns/iter (+/- 37) = 7359 MB/s
test pathological_two_way_rev::twoway_find  ... bench:      40,818 ns/iter (+/- 1,065) = 1469 MB/s
test periodic2::twoway_find                 ... bench:       6,543 ns/iter (+/- 78) = 3056 MB/s
test periodic5::twoway_find                 ... bench:       2,626 ns/iter (+/- 117) = 3046 MB/s
test short_1let_cy::twoway_find             ... bench:       1,144 ns/iter (+/- 91) = 4486 MB/s
test short_1let_long::twoway_find           ... bench:          69 ns/iter (+/- 0) = 36971 MB/s
test short_2let_common::twoway_find         ... bench:         449 ns/iter (+/- 2) = 5681 MB/s
test short_2let_cy::twoway_find             ... bench:       1,248 ns/iter (+/- 4) = 4112 MB/s
test short_2let_rare::twoway_find           ... bench:         370 ns/iter (+/- 10) = 6894 MB/s
test short_3let_cy::twoway_find             ... bench:       1,301 ns/iter (+/- 16) = 3944 MB/s
test short_3let_long::twoway_find           ... bench:         399 ns/iter (+/- 11) = 6393 MB/s
test short_short::twoway_find               ... bench:          34 ns/iter (+/- 0) = 1647 MB/s
test short_word1_long::twoway_find          ... bench:         522 ns/iter (+/- 2) = 4886 MB/s
test short_word2_long::twoway_find          ... bench:         447 ns/iter (+/- 16) = 5706 MB/s

RReverser · 2018-11-18T22:25:04Z

@bluss Could you perhaps do consecutive runs with old/new implementation on same machine and run logs through cargo-benchcmp? It should help a bit with visualising regressions (or improvements?).

bluss · 2018-11-18T23:13:31Z

Be the benchmark you want to see :)

Before this patch, multipart got into an impossible sitation with it's dependencies. It errs with: ``` error: failed to select a version for `lazy_static`. ... required by package `multipart v0.15.4` versions that meet the requirements `>= 1.0, < 1.2.0` are: 1.1.0, 1.0.2, 1.0.1, 1.0.0 all possible versions conflict with previously selected packages. previously selected package `lazy_static v1.2.0` ... which is depended on by `ring v0.13.5` ... which is depended on by `cookie v0.11.0` ... which is depended on by `rocket_http v0.4.0` ... which is depended on by `rocket v0.4.0` ... which is depended on by `multipart v0.15.4 ``` This is due to ring 0.13.3 bumping lazy_static to 1.2.0 to avoid a [soundness bug](rust-lang-nursery/lazy-static.rs#117). This patch fixes this problem by requiring at least rust 1.24.1. In addition, I noticed that the feature sse4 was depending on `twoway/pcmp`, but that has been [removed](bluss/twoway#8).

RReverser added 6 commits September 18, 2018 22:49

Comment out func used by commented out funcs

7601647

cargo fix

d6ada81

Switch pcmp module to stable Rust intrinsics

7befdde

Add x86 fallback for absent _mm_extract_epi64

bf728b2

Preconstruct SIMD vector for needle

26ef116

RReverser added 2 commits September 19, 2018 18:47

Fix cfg for pcmp::is_supported

d94de1f

It should be compilation attribute to avoid that block altogether.

Add target_feature everywhere...

6dedfa2

This looks ugly due to infecting every private function in the pcmp chain, but apparently is required for inlining and does help performance: rust-lang/rust#54353 (comment)

bluss reviewed Oct 28, 2018

View reviewed changes

bluss merged commit 9a543c7 into bluss:master Nov 17, 2018

erickt mentioned this pull request Jan 19, 2019

Use lazy_static 1.2.0, remove twoway/pcmp and require rust 1.24.1+ abonander/multipart#118

Merged

RReverser deleted the stable-simd branch March 4, 2019 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use stable SIMD intrinsics with runtime detection #8

Use stable SIMD intrinsics with runtime detection #8

RReverser commented Sep 19, 2018

RReverser commented Sep 19, 2018

RReverser commented Sep 19, 2018

bluss commented Sep 29, 2018

RReverser commented Sep 29, 2018

RReverser commented Oct 9, 2018

bluss Oct 28, 2018

bluss Oct 28, 2018

bluss Oct 28, 2018

RReverser Oct 28, 2018

bluss Oct 28, 2018

RReverser Oct 28, 2018

bluss commented Oct 28, 2018

bluss commented Oct 28, 2018 •

edited

Loading

bluss commented Oct 28, 2018

bluss commented Oct 28, 2018

RReverser commented Oct 28, 2018

RReverser commented Oct 30, 2018

bluss commented Oct 30, 2018

bluss commented Nov 9, 2018

RReverser commented Nov 9, 2018

RReverser commented Nov 9, 2018

RReverser commented Nov 9, 2018

bluss commented Nov 17, 2018

bluss commented Nov 17, 2018

bluss commented Nov 18, 2018 •

edited

Loading

RReverser commented Nov 18, 2018

bluss commented Nov 18, 2018

Use stable SIMD intrinsics with runtime detection #8

Use stable SIMD intrinsics with runtime detection #8

Conversation

RReverser commented Sep 19, 2018

RReverser commented Sep 19, 2018

RReverser commented Sep 19, 2018

bluss commented Sep 29, 2018

RReverser commented Sep 29, 2018

RReverser commented Oct 9, 2018

bluss Oct 28, 2018

Choose a reason for hiding this comment

bluss Oct 28, 2018

Choose a reason for hiding this comment

bluss Oct 28, 2018

Choose a reason for hiding this comment

RReverser Oct 28, 2018

Choose a reason for hiding this comment

bluss Oct 28, 2018

Choose a reason for hiding this comment

RReverser Oct 28, 2018

Choose a reason for hiding this comment

bluss commented Oct 28, 2018

bluss commented Oct 28, 2018 • edited Loading

bluss commented Oct 28, 2018

bluss commented Oct 28, 2018

RReverser commented Oct 28, 2018

RReverser commented Oct 30, 2018

bluss commented Oct 30, 2018

bluss commented Nov 9, 2018

RReverser commented Nov 9, 2018

RReverser commented Nov 9, 2018

RReverser commented Nov 9, 2018

bluss commented Nov 17, 2018

bluss commented Nov 17, 2018

bluss commented Nov 18, 2018 • edited Loading

RReverser commented Nov 18, 2018

bluss commented Nov 18, 2018

bluss commented Oct 28, 2018 •

edited

Loading

bluss commented Nov 18, 2018 •

edited

Loading