Abysmal multi-thread performance due to C11 atomics not being used by default #1723

wtarreau · 2024-07-25T09:39:28Z

Problem:

We at HAProxy have been testing various TLS libraries, and were shocked to notice that AWS-LC had significantly worse performance than OpenSSL 1.1.1 on client-side and TLS resumption and was almost as bad as OpenSSL-3.x(!) in threaded environments due to extreme use of pthread locks. We first noticed this about a year ago with v1.19 but didn't have time to investigate further by then. Now we've assigned more time to this and tested v1.29.0 and v1.32.0 (both of which show strictly identical performance on this point).

In order to overcome a part of OpenSSL's slowness, we had already integrated a pthread lock emulation in haproxy, that exposes much lighter locks ("lorw" for "low overhead rwlock") and that libraries can use transparently. So it was easy for us to try that as well with AWS-LC. It helped quite a bit but the performance collapsed again past ~40 threads and remained significantly below openssl.

A perf flamegraph showed that it was always the refcounting functions that were calling pthread_rwlock. See attached flamegraphs below, first one with standard pthread rwlocks, second one with low overhead rwlocks:

Looking at the code revealed that aside this heavy pthread-based implementation, there is an alternate implementation for C11 compatible compilers, using much cheaper C11 atomics based on fetch-and-add and friends.

We finally found that CMakeLists.txt forces -std=c99 by default when the CMAKE_C_STANDARD variable is not set, which forces the fallback to the slow implementation. Just building with -DCMAKE_C_STANDARD=11 completely solved the problem, making AWS-LC much faster than OpenSSL. Now here's the comparison of the performance of the different libs + locking mechanisms, running on an r8g-8xlarge (Graviton 4, 64 cores):

Solution:

The apparently cleanest solution would seem to be to just drop that counter-productive -std=c99 from the CMakeLists file. C11 has apparently been the default since gcc-5 from what I'm seeing so this leaves quite some margin, and those with older compilers are used to looking for such settings anyway. At the very least, the minimalistic approach would be to document that everyone with a compiler less than 13 years old should always add -DCMAKE_C_STANDARD=11 to their cmake command line otherwise the performance will be very low.

Another option could be to better detect the compiler support for C11 atomics instead of relying on the advertised standard. In haproxy we rely on the compiler version to choose the type of atomics: gcc >= 4.7 is OK with the C11 ones.

Requirements / Acceptance Criteria:

Any way to make sure that users and distro pacakgers don't need to read the code to figure where the low performance comes from and how to address it. We've been telling some of our users for quite some time "beware, in the client mode the performance can be very low" just because we didn't take the time to analyse the code. Most users likely don't inspect the code either to figure some hidden build options that don't appear in the doc (and should really be prominent).

Other

Thanks for making this great lib available ;-)

The text was updated successfully, but these errors were encountered:

andrewhop · 2024-07-30T19:10:19Z

Thanks for an excellent write up, we're very interested in helping improve our multi-threaded performance. I will take a look at your suggestion to change the default C language standard. For my own curiosity can you point me to the low overhead locking functionality you added to HAProxy.

For reference here is where our CMake defaults to C 99, here is where we detect the C standard and conditionally enable the C11 implementation of CRYPTO_refcount_*.

wtarreau · 2024-07-30T19:25:16Z

Hi Andrew!

For my own curiosity can you point me to the low overhead locking functionality you added to HAProxy.

Sure! It's here: https://github.com/haproxy/haproxy/blob/master/src/thread.c#L1002, which is a simple copy of the example provided in plock: https://github.com/wtarreau/plock/blob/master/examples/pth_rwl.c and the pl_lorw_* functions are available here: https://github.com/wtarreau/plock/blob/master/plock.h (BTW please note that even pthread's generic unlock is costly, you have to waste a read cycle for a test before deciding to unlock for read or for write).

It's automatically enabled if you build haproxy with USE_PTHREAD_EMULATION=1

A few other projects use it, the intel QAT engine comes to mind, where it brought significant savings: https://github.com/intel/QAT_Engine/blob/master/plock.c compared to default pthread rwlocks.

It's often dirty to proceed this way but it's so easy to compare with/without that it's often worth trying ;-)

For reference here...

Oh I'm sorry for not having placed the line number references. I was so impatient to dump all the info here now that I had collected everything that I just forgot the pointers. Thanks for having corrected that!

andrewhop · 2024-07-31T01:21:56Z

I opened #1729 to turn C11 on by default which greatly improves the performance on my Mac. Did you have to pass in any other flags to unlock the performance improvement on Linux? Looking at a Linux build I confirmed it is using C11 and that the refcount_c11.c code is being used, but it doesn't seem to be any faster than the refcount_lock.c implementation.

andrewhop · 2024-07-31T02:27:04Z

What OS and compiler did you use in your testing?

wtarreau · 2024-07-31T03:11:06Z

I've had it on multiple Linux boxes (ubuntu 20/22, debian I-don't-know-what-version, slackware 15). The exact flags I've used on last build on an ubuntu-22 on ARM were:

cmake -B build -DCMAKE_C_STANDARD=11 -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$HOME/ssl/aws-lc

And on this machine there is gcc-11.4, it's one place where it makes such a huge difference.

I noticed you've passed -Gninja to your cmake, I've never used ninja, I have no idea how it could interfere with options if at all, I'm only using a standard make.

Stupid question, why not just drop entirely the -std= entry ? As I pointed above, it's granted since gcc-5.

In addition, these atomics are even available since gcc-4.7 without having to force the standard anyway. In haproxy we enable them on gcc >= 4.7 and clang, and we never rely on the C standard version:

#if defined(__GNUC__) && (__GNUC__ < 4 || __GNUC__ == 4 && __GNUC_MINOR__ < 7) && !defined(__clang__)
   /* use older atomics (__sync_xxx) */
#else
  /* use C11 atomics */
#endif

wtarreau · 2024-08-21T19:29:43Z

Thanks for going to the end of it, Andrew! Great to know that in next version it should work out of the box with no change, our users will be delighted!

andrewhop mentioned this issue Jul 31, 2024

Enable C11 automatically if the compiler supports it #1729

Merged

andrewhop closed this as completed in #1729 Aug 21, 2024

andrewhop closed this as completed in 92654db Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abysmal multi-thread performance due to C11 atomics not being used by default #1723

Abysmal multi-thread performance due to C11 atomics not being used by default #1723

wtarreau commented Jul 25, 2024

andrewhop commented Jul 30, 2024

wtarreau commented Jul 30, 2024 •

edited

Loading

andrewhop commented Jul 31, 2024

andrewhop commented Jul 31, 2024

wtarreau commented Jul 31, 2024

wtarreau commented Aug 21, 2024

Abysmal multi-thread performance due to C11 atomics not being used by default #1723

Abysmal multi-thread performance due to C11 atomics not being used by default #1723

Comments

wtarreau commented Jul 25, 2024

Problem:

Solution:

Requirements / Acceptance Criteria:

Other

andrewhop commented Jul 30, 2024

wtarreau commented Jul 30, 2024 • edited Loading

andrewhop commented Jul 31, 2024

andrewhop commented Jul 31, 2024

wtarreau commented Jul 31, 2024

wtarreau commented Aug 21, 2024

wtarreau commented Jul 30, 2024 •

edited

Loading