-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSE/Neon path for MSVC x86 and ARM #2680
Conversation
Testing this PR (
TL;DR: the change to |
And here's the same test on an ARM Mac. The runs on this M1 Mini had little deviation:
The only really interesting takeaway is how much the M1 trounces the 3990X in this test (though the Threadripper here is under-clocked at the moment). |
Maybe define our own ? or something like that in compiler.h?
example needs MSVC NEON support ... does MSVC need |
@aqrit On MSVC 64-bit ARM |
Closing since #2681 makes this redundant. |
This is taking what #2653 started and extending it to x86 and MS ARM64 targets. To do this I fake the
__SSE2__
or__ARM_NEON
defines for MSVC (this was preferable to having the longer tests everywhere else) and change the signature forZSTD_Vec256_cmpMask8
(more of later).First some benchmarks! This is x86 without the SSE2 path, on a 3990X (with 127 idle cores!):
And this is with the SSE2 path enabled:
I took the best of five runs, and we see a 20-50% improvement. For this to work I needed to change
ZSTD_Vec256_cmpMask8
to a pointer of the 256-bit type (since on 32-bit systems, depending on the version of MSVC, tested with 2010-2019, it errors withformal parameter with requested alignment of 16 won't be aligned
). I worried this would affect performance by not making best use of the wider SSE registers, but after many runs comparing the x64 version with or without the change, the result was the pointer variant was always slightly faster (there was variance in the numbers but on a generally good run the pointer always bested the pass-by-value). I suspect this wouldn't be the case with a real 256-bit type.The same run on 3990X as x64, for comparison:
Since I had one on my desk I also threw this at a Surface Pro X with ARM64. Here's the before running the fallback path:
And here's after with the Neon path:
Around a 10% improvement.
I also ran the same benchmark on other x86 and x64 hardware with the same result. I haven't as of yet run this on Apple ARM hardware with Clang for comparison, but I will, and then update this PR.
The fake defines I'm not 100% happy with, but it's no different (IMO) to faking
__has_builtin()
and others. But suggestions welcome.