-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add x86 SIMD optimizations to crypto datatypes #507
Conversation
f7f31d5
to
ae2a70c
Compare
Hi, @Lastique thanks for these. I guess these changes are based on real world usage, do you have any benchmarks on the speed improvements for a given configuration & platform? |
I have attached a unit benchmark to #508, but I don't have one at hand for the changes in this PR. I did bench the change when I originally wrote this patch for our in-house use a few years back, but unfortunately I didn't save it. I suppose, I could write one anew, if required. Basically, these two PRs came out of our attempts to reduce libsrtp entries in the profiling reports for our WebRTC SFU software. The software maintains lots of SRTP connections (hundreds), and these changes allowed to save a few percents of total CPU usage. The most CPU consuming part in this PR that came up in our reports is |
ae2a70c
to
ccd75ec
Compare
Rebased on top of the current master. |
ccd75ec
to
52d61dd
Compare
@Lastique thanks for updating, will take a new look at this now |
- The v128 operations are optimized for SSE2/SSSE3. - srtp_octet_string_is_eq is optimized for SSE2. When SSE2 is not available, use a pair of 32-bit accumulators to speed up the bulk of the operation. We use two accumulators to leverage instruction-level parallelism supported by most modern CPUs. - In srtp_cleanse, use memset and ensure it is not optimized away with a dummy asm statement, which can potentially consume the contents of the memory. - Endian conversion functions use gcc-style intrinsics, when possible. The SIMD code uses intrinsics, which are available on all modern compilers. For MSVC, config_in_cmake.h is modified to define gcc/clang-style SSE macros based on MSVC predefined macros. We enable all SSE versions when it indicates that AVX is enabled. SSE2 is always enabled for x86-64 or for x86 when SSE2 FP math is enabled.
52d61dd
to
faa7998
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will wait a few more day before merging in case any one has an objection.
@@ -272,9 +480,14 @@ int srtp_octet_string_is_eq(uint8_t *a, uint8_t *b, int len) | |||
|
|||
void srtp_cleanse(void *s, size_t len) | |||
{ | |||
#if defined(__GNUC__) | |||
memset(s, 0, len); | |||
__asm__ __volatile__("" : : "r"(s) : "memory"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a little unrelated to the rest ... unless it was performance problem.
thanks for the pr @Lastique , and sorry it so long to get in |
Thanks. |
srtp_octet_string_is_eq
is optimized for SSE2. When SSE2 is not available, use a pair of 32-bit accumulators to speed up the bulk of the operation. We use two accumulators to leverage instruction-level parallelism supported by most modern CPUs.srtp_cleanse
, usememset
and ensure it is not optimized away with a dummy asm statement, which can potentially consume the contents of the memory.base64_block_to_octet_triple
, prefermemchr
tostrchr
as it explicitly accepts the string length, which is known at compile time.The SIMD code uses intrinsics, which are available on all modern compilers. For MSVC, config_in_cmake.h is modified to define gcc/clang-style SSE macros based on MSVC predefined macros. We enable all SSE versions when it indicates that AVX is enabled. SSE2 is always enabled for x86-64 or for x86 when SSE2 FP math is enabled.