Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use stable SIMD intrinsics with runtime detection #8

Merged
merged 8 commits into from
Nov 17, 2018

Conversation

RReverser
Copy link
Contributor

Fixes #7.

Switch to compile-time cfg for just x86/x86-64 and runtime detection for SSE 4.2.

Also fix no_std fallbacks, including for detection (without `use_std` CPU features can't be detected at runtime, but it will be still possible to enable pcmp with explicit `-C target-feature=+sse4.2`).
@RReverser
Copy link
Contributor Author

Hmm, I didn't observe these CI errors locally, I'll look into it.

In the meanwhile, I noticed that intrinsic versions gets quite a bit slower even on the same compiler/machine, and found what I believe to be a reason in the generated assembly and raised upstream to Rust: rust-lang/rust#54353

It should be compilation attribute to avoid that block altogether.
This looks ugly due to infecting every private function in the pcmp chain, but apparently is required for inlining and does help performance: rust-lang/rust#54353 (comment)
@RReverser
Copy link
Contributor Author

@bluss Okay, given some information given in that thread, had to change the code more significantly to take advantage of inlining, but at least performance is now on par (sometimes faster / sometimes slower, probably just within noise margin) with previous asm version and not significantly slower as before.

@bluss
Copy link
Owner

bluss commented Sep 29, 2018

Nice work, thanks a lot for working on this. Feel free to ping me, since I can be slow

@RReverser
Copy link
Contributor Author

@bluss I was thinking of pinging you but saw that you didn't have any activity on Github in September, assumed you have taken some time off and decided to just wait and see. Good to see you back, let me know if any changes required (although now I'll be away for a week myself).

@RReverser
Copy link
Contributor Author

Feel free to ping me, since I can be slow

Just to clarify - did you mean you're going to find time to review this or that you want me to do some more changes and ping afterwards? Aside from (suspiciously) failing Travis, this PR is ready as-is.

/// `is_supported` checks whether necessary SSE 4.2 feature is supported on current CPU.
pub fn is_supported() -> bool {
if cfg!(feature = "use_std") {
is_x86_feature_detected!("sse4.2")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, what's the overhead of such a check and how is it cached?

We should put in some kind of compile-time override here so that we can test both implementations in travis and locally easily. A compile time environment variable like RUST_TWOWAY_DISABLE_SIMD? Something for testing mainly, but I guess it can be used for debugging in general.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to compile as is with no std. This does:

/// `is_supported` checks whether necessary SSE 4.2 feature is supported on current CPU.
pub fn is_supported() -> bool {
    #[cfg(feature = "use_std")]
    return is_x86_feature_detected!("sse4.2");
    #[cfg(not(feature = "use_std"))]
    return cfg!(target_feature = "sse4.2");
}

@@ -11,7 +10,7 @@ use std::usize;
extern crate memchr;

mod tw;
#[cfg(feature = "pcmp")]
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
pub mod pcmp;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mod should now preferably not be public. We need some way to benchmark it still, or maybe not? Just use the simd enable/disable override?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so too, but noticed you're exposing other algorithms as public modules and thought it's intentional and has to be preserved.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to think over it again now that it's going stable. It's not 1.0 though, so we can live. For example using doc(hidden).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO now that this is stable, ideally SSE4 helers should become just part of normal two-way code since the algortithm is the same between both and only substring searches should be done in different ways based on runtime feature detection.

@bluss
Copy link
Owner

bluss commented Oct 28, 2018

Thanks for doing all the hard work with the simd stuff, that's really nice. The rest should just be maintainership tasks

@bluss
Copy link
Owner

bluss commented Oct 28, 2018

Fuzz test run_substring doesn't pass for me, not sure why. Probably this is a 16-byte read in a region that's shorter than 16 bytes.

READ of size 16 at 0x602000000577 thread T0
SUMMARY: AddressSanitizer: heap-buffer-overflow" Input was "0x7c,0x20,0x72,0x20,0x20,0x20,0x20"

It corresponds to the equivalent of find_bytes(b" ", b" "). (3 spaces then 2 spaces. For some reason github is not showing what I've input.

@bluss
Copy link
Owner

bluss commented Oct 28, 2018

I looked at pat128 as the first suspect, since it's a place where a plain memcpy is replaced by a simd load. Changing it to use an intermediate buffer and memcpy seems to postpone this problem, then a similar problem is reached again with longer inputs for some reason.

@bluss
Copy link
Owner

bluss commented Oct 28, 2018

Crux of the issue is avoiding to read past the end of the needle or text. The pcmpestri instruction in the old code must be accomplishing that? Or is it an artifact of how fuzzing is instrumented, that it only shows up in the new code with the intrinsics? Either way it needs a resolution.

@RReverser
Copy link
Contributor Author

Yeah that's weird, I'd expect both to behave in the same way. I can recheck sometime later this week and get back to you if you don't find a solution earlier.

@RReverser
Copy link
Contributor Author

Looking at pat128, I'm not even sure now why I made that change and assumed it would be ok when it's clearly wrong...

@bluss
Copy link
Owner

bluss commented Oct 30, 2018

I've pushed 2 commits on top of this to branch stable-simd-pr-8 / d4e91ea that should fix the fuzz test. Yes, the pcmp code is not in a super understandable state. Thanks again for getting this code over the hump to stable simd.

@bluss
Copy link
Owner

bluss commented Nov 9, 2018

See my pr to your branch for updates

@RReverser
Copy link
Contributor Author

I'm currently still away, but should be able to check after the weekend if that's okay.

@RReverser
Copy link
Contributor Author

Thanks!

@RReverser
Copy link
Contributor Author

Actually already reviewed and left one nit, otherwise looks good, thanks!

@bluss
Copy link
Owner

bluss commented Nov 17, 2018

Thanks for working on this and cooperating on solving this! I'll merge this and then add my own changes, adjust the mask load like we talked about.

You inspired me to work on bluss/matrixmultiply#22, which was a huge success I think. This std::arch stuff is fun ;)

After that we still have the maintainerish questions to settle before a new version can go live:

  • testing without/with pcmp
  • platform testing on x86-64 and non-x86
  • which symbols are public

@bluss bluss merged commit 9a543c7 into bluss:master Nov 17, 2018
@bluss
Copy link
Owner

bluss commented Nov 17, 2018

The reason for the accelerated schedule is that I have time now and I'm trying to get this done.

@bluss
Copy link
Owner

bluss commented Nov 18, 2018

@RReverser
These are my benchmarks results, they are not comparative and slightly messy. From the simd enabled stable twoway crate as it is now on an avx enabled x86-64 laptop. I think having a worst case of > 800 MB/s on the pathological test cases we have is very good.

I can put in one caveat, these pathologies were collected with the original two way implementation in mind and not the new.

test allright::twoway_find                  ... bench:      92,791 ns/iter (+/- 657) = 1939 MB/s
test aaab_in_aab::twoway_find               ... bench:     287,539 ns/iter (+/- 3,647) = 1043 MB/s
test aaabbb::twoway_find                    ... bench:     207,107 ns/iter (+/- 1,506) = 1448 MB/s
test allright::twoway_find                  ... bench:      92,776 ns/iter (+/- 353) = 1940 MB/s
test bb_in_aa::twoway_find                  ... bench:      15,349 ns/iter (+/- 501) = 6515 MB/s
test bbbaaa::twoway_find                    ... bench:     370,070 ns/iter (+/- 3,541) = 810 MB/s
test gllright::twoway_find                  ... bench:     100,767 ns/iter (+/- 490) = 1786 MB/s
test naive::twoway_find                     ... bench:          91 ns/iter (+/- 1) = 2747 MB/s
test naive_longpat::twoway_find             ... bench:      13,692 ns/iter (+/- 201) = 7303 MB/s
test naive_longpat_reversed::twoway_find    ... bench:      93,921 ns/iter (+/- 2,610) = 1064 MB/s
test naive_rev::twoway_find                 ... bench:         288 ns/iter (+/- 13) = 868 MB/s
test pathological_two_way::twoway_find      ... bench:       8,153 ns/iter (+/- 37) = 7359 MB/s
test pathological_two_way_rev::twoway_find  ... bench:      40,818 ns/iter (+/- 1,065) = 1469 MB/s
test periodic2::twoway_find                 ... bench:       6,543 ns/iter (+/- 78) = 3056 MB/s
test periodic5::twoway_find                 ... bench:       2,626 ns/iter (+/- 117) = 3046 MB/s
test short_1let_cy::twoway_find             ... bench:       1,144 ns/iter (+/- 91) = 4486 MB/s
test short_1let_long::twoway_find           ... bench:          69 ns/iter (+/- 0) = 36971 MB/s
test short_2let_common::twoway_find         ... bench:         449 ns/iter (+/- 2) = 5681 MB/s
test short_2let_cy::twoway_find             ... bench:       1,248 ns/iter (+/- 4) = 4112 MB/s
test short_2let_rare::twoway_find           ... bench:         370 ns/iter (+/- 10) = 6894 MB/s
test short_3let_cy::twoway_find             ... bench:       1,301 ns/iter (+/- 16) = 3944 MB/s
test short_3let_long::twoway_find           ... bench:         399 ns/iter (+/- 11) = 6393 MB/s
test short_short::twoway_find               ... bench:          34 ns/iter (+/- 0) = 1647 MB/s
test short_word1_long::twoway_find          ... bench:         522 ns/iter (+/- 2) = 4886 MB/s
test short_word2_long::twoway_find          ... bench:         447 ns/iter (+/- 16) = 5706 MB/s

@RReverser
Copy link
Contributor Author

@bluss Could you perhaps do consecutive runs with old/new implementation on same machine and run logs through cargo-benchcmp? It should help a bit with visualising regressions (or improvements?).

@bluss
Copy link
Owner

bluss commented Nov 18, 2018

Be the benchmark you want to see :)

erickt added a commit to erickt/multipart that referenced this pull request Jan 19, 2019
Before this patch, multipart got into an impossible sitation with
it's dependencies. It errs with:

```
error: failed to select a version for `lazy_static`.
    ... required by package `multipart v0.15.4`
versions that meet the requirements `>= 1.0, < 1.2.0` are: 1.1.0, 1.0.2, 1.0.1, 1.0.0

all possible versions conflict with previously selected packages.

  previously selected package `lazy_static v1.2.0`
    ... which is depended on by `ring v0.13.5`
    ... which is depended on by `cookie v0.11.0`
    ... which is depended on by `rocket_http v0.4.0`
    ... which is depended on by `rocket v0.4.0`
    ... which is depended on by `multipart v0.15.4
```

This is due to ring 0.13.3 bumping lazy_static to 1.2.0 to avoid
a [soundness bug](rust-lang-nursery/lazy-static.rs#117).
This patch fixes this problem by requiring at least rust 1.24.1.

In addition, I noticed that the feature sse4 was depending on
`twoway/pcmp`, but that has been [removed](bluss/twoway#8).
@RReverser RReverser deleted the stable-simd branch March 4, 2019 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants