Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abysmal multi-thread performance due to C11 atomics not being used by default #1723

Closed
wtarreau opened this issue Jul 25, 2024 · 6 comments · Fixed by #1729
Closed

Abysmal multi-thread performance due to C11 atomics not being used by default #1723

wtarreau opened this issue Jul 25, 2024 · 6 comments · Fixed by #1729

Comments

@wtarreau
Copy link

Problem:

We at HAProxy have been testing various TLS libraries, and were shocked to notice that AWS-LC had significantly worse performance than OpenSSL 1.1.1 on client-side and TLS resumption and was almost as bad as OpenSSL-3.x(!) in threaded environments due to extreme use of pthread locks. We first noticed this about a year ago with v1.19 but didn't have time to investigate further by then. Now we've assigned more time to this and tested v1.29.0 and v1.32.0 (both of which show strictly identical performance on this point).

In order to overcome a part of OpenSSL's slowness, we had already integrated a pthread lock emulation in haproxy, that exposes much lighter locks ("lorw" for "low overhead rwlock") and that libraries can use transparently. So it was easy for us to try that as well with AWS-LC. It helped quite a bit but the performance collapsed again past ~40 threads and remained significantly below openssl.

A perf flamegraph showed that it was always the refcounting functions that were calling pthread_rwlock. See attached flamegraphs below, first one with standard pthread rwlocks, second one with low overhead rwlocks:

aws-lc-1 29
aws-lc-1 29-lorw

Looking at the code revealed that aside this heavy pthread-based implementation, there is an alternate implementation for C11 compatible compilers, using much cheaper C11 atomics based on fetch-and-add and friends.

We finally found that CMakeLists.txt forces -std=c99 by default when the CMAKE_C_STANDARD variable is not set, which forces the fallback to the slow implementation. Just building with -DCMAKE_C_STANDARD=11 completely solved the problem, making AWS-LC much faster than OpenSSL. Now here's the comparison of the performance of the different libs + locking mechanisms, running on an r8g-8xlarge (Graviton 4, 64 cores):

graviton4-64t-locking-comparison

Solution:

The apparently cleanest solution would seem to be to just drop that counter-productive -std=c99 from the CMakeLists file. C11 has apparently been the default since gcc-5 from what I'm seeing so this leaves quite some margin, and those with older compilers are used to looking for such settings anyway. At the very least, the minimalistic approach would be to document that everyone with a compiler less than 13 years old should always add -DCMAKE_C_STANDARD=11 to their cmake command line otherwise the performance will be very low.

Another option could be to better detect the compiler support for C11 atomics instead of relying on the advertised standard. In haproxy we rely on the compiler version to choose the type of atomics: gcc >= 4.7 is OK with the C11 ones.

Requirements / Acceptance Criteria:

Any way to make sure that users and distro pacakgers don't need to read the code to figure where the low performance comes from and how to address it. We've been telling some of our users for quite some time "beware, in the client mode the performance can be very low" just because we didn't take the time to analyse the code. Most users likely don't inspect the code either to figure some hidden build options that don't appear in the doc (and should really be prominent).

Other

Thanks for making this great lib available ;-)

@andrewhop
Copy link
Contributor

Thanks for an excellent write up, we're very interested in helping improve our multi-threaded performance. I will take a look at your suggestion to change the default C language standard. For my own curiosity can you point me to the low overhead locking functionality you added to HAProxy.

For reference here is where our CMake defaults to C 99, here is where we detect the C standard and conditionally enable the C11 implementation of CRYPTO_refcount_*.

@wtarreau
Copy link
Author

wtarreau commented Jul 30, 2024

Hi Andrew!

For my own curiosity can you point me to the low overhead locking functionality you added to HAProxy.

Sure! It's here: https://github.com/haproxy/haproxy/blob/master/src/thread.c#L1002, which is a simple copy of the example provided in plock: https://github.com/wtarreau/plock/blob/master/examples/pth_rwl.c and the pl_lorw_* functions are available here: https://github.com/wtarreau/plock/blob/master/plock.h (BTW please note that even pthread's generic unlock is costly, you have to waste a read cycle for a test before deciding to unlock for read or for write).

It's automatically enabled if you build haproxy with USE_PTHREAD_EMULATION=1

A few other projects use it, the intel QAT engine comes to mind, where it brought significant savings: https://github.com/intel/QAT_Engine/blob/master/plock.c compared to default pthread rwlocks.

It's often dirty to proceed this way but it's so easy to compare with/without that it's often worth trying ;-)

For reference here...

Oh I'm sorry for not having placed the line number references. I was so impatient to dump all the info here now that I had collected everything that I just forgot the pointers. Thanks for having corrected that!

@andrewhop
Copy link
Contributor

I opened #1729 to turn C11 on by default which greatly improves the performance on my Mac. Did you have to pass in any other flags to unlock the performance improvement on Linux? Looking at a Linux build I confirmed it is using C11 and that the refcount_c11.c code is being used, but it doesn't seem to be any faster than the refcount_lock.c implementation.

@andrewhop
Copy link
Contributor

What OS and compiler did you use in your testing?

@wtarreau
Copy link
Author

I've had it on multiple Linux boxes (ubuntu 20/22, debian I-don't-know-what-version, slackware 15). The exact flags I've used on last build on an ubuntu-22 on ARM were:

cmake -B build -DCMAKE_C_STANDARD=11 -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$HOME/ssl/aws-lc

And on this machine there is gcc-11.4, it's one place where it makes such a huge difference.

I noticed you've passed -Gninja to your cmake, I've never used ninja, I have no idea how it could interfere with options if at all, I'm only using a standard make.

Stupid question, why not just drop entirely the -std= entry ? As I pointed above, it's granted since gcc-5.

In addition, these atomics are even available since gcc-4.7 without having to force the standard anyway. In haproxy we enable them on gcc >= 4.7 and clang, and we never rely on the C standard version:

#if defined(__GNUC__) && (__GNUC__ < 4 || __GNUC__ == 4 && __GNUC_MINOR__ < 7) && !defined(__clang__)
   /* use older atomics (__sync_xxx) */
#else
  /* use C11 atomics */
#endif

@wtarreau
Copy link
Author

Thanks for going to the end of it, Andrew! Great to know that in next version it should work out of the box with no change, our users will be delighted!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants