Add BRAVO Reader-Writer Lock #9394

masaori335 · 2023-02-09T06:31:13Z

Abstract

Introduce ts::bravo::shared_mutex as an alternative to the std::shared_mutex or pthread_rwlock_t.

As we know, std::shared_mutex or pthread_rwlock_t (almost equivalent on Linux) doesn't scale on a multi-thread application like ATS. BRAVO is a reader-writer lock algorithm published by Dave Dice and Alex Kogan at USENIX 2019[*1]. The algorithm acts as an accelerator layer for the reader lock. Please note that this doesn't accelerate the writer lock, but almost no penalty. This algorithm is useful for read-heavy use cases. For the details of the algorithm, please check the paper.

Implementation

The puzpuzpuz/xsync's RBMutex is an algorithm implementation in go-lang. ts::bravo::shared_mutex followed it in C++.

Exclusive locking (writer)

ts::bravo::shared_mutex can be a drop-in replacement.

ts::bravo::shared_mutex mutex; 
std::lock_guard lock(mutex);

Shared locking (reader)

To handle ts::bravo::Token, it needs ts::bravo::shared_lock, but you can use it like std::shared_lock.

ts::bravo::shared_mutex mutex; 
ts::bravo::shared_lock lock(mutex);

Micro Benchmark

benchmark_shared_mutex.cc is Catch2 based micro benchmark tool. You can adjust threads and the rate of reading and writing.

$ taskset -c 0-63 ./benchmark_shared_mutex --ts-nthreads 64 --ts-nloop 100 --ts-nread 1000 --ts-nwrite 1

The below numbers are measures on Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz 64 core.

read: 1000, write: 0

	ts::shared_mutex	ts::bravo::shared_mutex
1	4.36742	1.09271
2	76.5058	1.08926
4	126.713	1.11642
8	232.651	1.18535
16	498.872	1.80362
32	1012.19	2.17977
64	2172.4	5.7694

read: 1000, write: 1

	ts::shared_mutex	ts::bravo::shared_mutex
1	4.3714	3.19028
2	59.3886	17.5586
4	132.736	53.6457
8	249.857	112.068
16	586.322	128.144
32	1375.09	227.752
64	5240.33	1301.71

References

[*1] Dave Dice and Alex Kogan. 2019. BRAVO: Biased Locking for Reader-Writer Locks. In Proceedings of the 2019 USENIX Annual Technical Conference (ATC). USENIX Association, Renton, WA, 315–328.
https://www.usenix.org/conference/atc19/presentation/dice

[*2] xsync - Concurrent data structures for Go
https://github.com/puzpuzpuz/xsync

ywkaras · 2023-02-10T01:50:58Z

Perhaps compare also to #9154 .

NOTICE

ywkaras · 2023-02-18T20:50:38Z

src/tscpp/util/unit_tests/benchmark_shared_mutex.cc

+  std::thread list[conf.nthreads];
+
+  for (int i = 0; i < conf.nthreads; i++) {
+    new (&list[i]) std::thread{[](T &mutex) {


Why not just:

list[i] = std::thread{...

? std::thread isn't copyable, but it's moveable. What you're doing assumes that ~std::thread() is vacuous when the thread is constructed with the default constructor. That's probably true in this case, but it's better practice to not make such assumptions unless it's really necessary.

It seems like the only advantage of using a lambda here is that you avoid having to pick a name for the thread function. I think it would be more readable just using a conventional, named function for the thread function.

On the other hand, @SolidWallOfCode smiles and sighs with contentment whenever anyone uses a lambda, so there is that.

ywkaras · 2023-02-18T21:26:13Z

With my ts::scalable_shared_mutex, with 16 threads, it had worse performance that ts::shared_mutex. So maybe check the performance of this mutex with 16 threads?

ywkaras · 2023-02-18T21:41:15Z

ts::scalable_shared_mutex uses the class DenseThreadId ( https://github.com/apache/trafficserver/pull/9154/files#diff-5d4184d5004b8c73554cb88f0bfe7f0c4711c7394c5a15e24daaaff8962d1076R177 ). It provides a thread ID for currently running threads, that's in the range from 0 to one less than the number of currently-running threads. A new thread may reuse the ID of an exited thread. (This could be a performance issue with dynamic threading, but ATS seems to do little of that.) So dense thread ID % num slots would then be a reasonable hash function, with perhaps fewer collisions. ts::scalable_shared_mutex always assume there are at least as many slots as currently running threads. This avoids complex logic to deal with collisions. I notice that you hard-coded 4096 as the number of slots. I suspect there are few if any ATS proxies in prod that run optimally with more than 4096 simultaneously active threads.

ywkaras · 2023-02-18T21:55:29Z

include/tscpp/util/Bravo.h

+using time_point = std::chrono::time_point<std::chrono::system_clock>;
+
+#ifdef __cpp_lib_hardware_interference_size
+using std::hardware_constructive_interference_size;


It seems like this constant is only available in newer versions of gcc and clang. We could maybe get the cache line size out of /proc/cpuinfo when the target is Linux?

wkaras ~ $ grep -e cache_alignment -e processor /proc/cpuinfo | head -n 4 processor : 0 cache_alignment : 64 processor : 1 cache_alignment : 64 wkaras ~ $

This is C++17 but only available with GCC 😿
https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size

masaori335 · 2023-02-21T01:10:42Z

With my ts::scalable_shared_mutex, with 16 threads, it had worse performance that ts::shared_mutex. So maybe check the performance of this mutex with 16 threads?

Please look at the table and graph above. The benchmark is measured with 1,2,4,8,16,32, and 64 threads. ts::brovo::shared_mutex shows better performance with fewer threads.

ts::scalable_shared_mutex uses the class DenseThreadId...

IIUC, reducing collision between threads by using the hash function (mix32) is one of the key ideas of BRAVO.
However, for ATS, we might be able to reduce the slots with your DenseThreadId, because we don't use dynamic threading. I'll try to use it and measure performance.

src/tscpp/util/unit_tests/test_Bravo.cc

Signed-off-by: Walt Karas <wkaras@yahooinc.com>

ywkaras · 2023-02-21T23:21:19Z

Would be good if Apple people looked at this. Maybe @vmamidi or @cmcfarlen . Apple was talking about having thread_local HostDBs. Maybe if HostDB had a faster shared_mutex, that complexity could be avoided. Yahoo has deprioritized exploration of the use of proxy hosts with many cores, so I won't get many brownie points for giving this the attention that it deserves.

ywkaras · 2023-02-21T23:54:34Z

Here is a unit test I wrote in a previous effort to build a better shared_mutex: https://github.com/ywkaras/trafficserver/blob/OWMR/src/tscore/unit_tests/test_OneWriterMultiReader.cc .

masaori335 · 2023-02-22T06:56:36Z

Microbenchmark with DenseThreadId is below. We can go with it because it has no collision, and we can reduce the slot size.

read: 1000, write: 0

thread num	ts::bravo::shared_mutex (DenseThreadId)
1	1.02784
2	1.04234
4	1.06245
8	1.14418
16	1.33218
32	2.40103
64	3.26439

read: 1000, write: 1

thread num	ts::bravo::shared_mutex (DenseThreadId)
1	1.25493
2	47.5607
4	85.9465
8	130.392
16	106.654
32	237.198
64	1259.23

ywkaras · 2023-02-22T16:09:21Z

It looks like there are still a few areas that need improvement. But, I think the next step is to agree on what testing is needed for verification of proper, logically correct locking. If you think your current logic testing is adequate, I think we should get at least one more reviewer, to break the tie.

ywkaras · 2023-02-26T23:42:05Z

A common problem with performance testing is that, since the code is executing to time it only, not to use any results, the optimizer will remove some or all of the code you're actually trying to time. One way to double check is to use 'objdump -d' to disassemble, check if there's clearly too few assembly instructions corresponding to the source code.

masaori335 · 2023-02-28T07:03:56Z

For the logic, having more complicated tests is better, but current tests have enough coverage to start testing this reader-writer lock widely. Let's have more eyeballs.

For the benchmark, I noticed Catch2 Benchmark provides an option[*1] (return from the function) for the optimization and benchmark issue. I followed this way. Also, I checked with objdump -d and ran the benchmark on the lldb with "disassembly mode"[*2]. It has some instructions.

[*1] https://github.com/catchorg/Catch2/blob/v2.13.8/docs/benchmarks.md#the-optimizer
[*2] https://github.com/llvm/llvm-project/blob/main/lldb/examples/python/disassembly_mode.py

lzx404243 · 2023-03-20T17:51:41Z

I created a test build replacing all RW locks(std::shared_mutex, ts::shared_mutex, ink_rwlock) with ts::bravo::shared_mutex and ran it on a prod box. See no performance degradation nor crashes.

include/tscpp/util/Bravo.h

lzx404243 · 2023-03-21T15:32:18Z

include/tscpp/util/Bravo.h

+  struct Mutex {
+    std::atomic<bool> read_bias         = false;
+    std::array<Slot, SLOT_SIZE> readers = {};


It seems unlike the original BRAVO where a global reader table is used, your implementation has a reader table per lock instance. This will increase the per-lock memory footprint, with potential performance benefit. May benchmark the memory usage as well to see whether the trade-off is worthwhile/acceptable.

Right, I followed the approach of xsync's RBMutex.

https://github.com/puzpuzpuz/xsync/blob/main/rbmutex.go

include/tscpp/util/Bravo.h

ezelkow1 · 2023-05-19T17:02:43Z

[approve ci centos]

ywkaras · 2023-05-22T22:25:56Z

Masaori and I don' t agree as to whether this needs a longer unit test, so I'm bowing out and leaving it to other reviewers to break the tie.

* asf/master: Fix Via header on H3 connections (apache#9758) Add ssrc and surc log fields for server simple/unavail retry counts. (apache#9694) Term standardization: dead/down server -> down server (apache#9582) Fix a crash caused by a HTTP/2 GET request with a body (apache#9738) Add configuration for UDP poll timeout (apache#9733) Fix quic_no_activity_timeout test (apache#9737) Updates to cmake install to get a running ATS from a fresh install (apache#9735) Make config.proxy.http.no_dns_just_forward_to_parent overridable (apache#9728) Fix a potential crash due to use-after-free on QUIC connections (apache#9715) Doc: Clarify that connect ports can have multiple values (apache#9713) Add BRAVO Reader-Writer Lock (apache#9394) Cleanup: Fix format of doc corruption message (apache#9725) Don't build traffic_quic command (apache#9726) Fix protocol version in request Via header (apache#9716) Fix TS_HTTP_REQUEST_TRANSFORM_HOOK Tunnel Processing (apache#9724)

masaori335 added the Performance label Feb 9, 2023

masaori335 added this to the 10.0.0 milestone Feb 9, 2023

masaori335 self-assigned this Feb 9, 2023

masaori335 mentioned this pull request Feb 9, 2023

Use BRAVO lock for g_records_rwlock #9395

Closed

masaori335 force-pushed the bravo-xsync branch from c904642 to 494a6c7 Compare February 9, 2023 06:42

Add BRAVO Reader-Writer Lock

871d2fa

masaori335 force-pushed the bravo-xsync branch from 494a6c7 to 871d2fa Compare February 9, 2023 23:22

bryancall requested a review from ywkaras February 13, 2023 23:19

ywkaras reviewed Feb 18, 2023

View reviewed changes

NOTICE Outdated Show resolved Hide resolved

ywkaras reviewed Feb 18, 2023

View reviewed changes

ywkaras reviewed Feb 21, 2023

View reviewed changes

src/tscpp/util/unit_tests/test_Bravo.cc Outdated Show resolved Hide resolved

masaori335 added 2 commits February 21, 2023 11:37

Fix bug in the slow path of try_lock_shared

f928203

Add DenseThreadId

94b154d

Signed-off-by: Walt Karas <wkaras@yahooinc.com>

Use DenseThreadId

867abb7

masaori335 added 3 commits February 28, 2023 13:42

Fix unit test dependencies

af9914f

Prevent optimizer from removing target function of benchmark

636d0d6

Make benchmark scenario better

44c4047

bryancall self-requested a review March 13, 2023 22:17

lzx404243 reviewed Mar 21, 2023

View reviewed changes

masaori335 added 2 commits March 30, 2023 10:39

Add constructors

fce0691

Get rid of unnesessary for loop

b494695

bryancall requested a review from bneradt April 10, 2023 22:36

bneradt approved these changes May 22, 2023

View reviewed changes

masaori335 merged commit c331c72 into apache:master May 22, 2023

Add BRAVO Reader-Writer Lock #9394

Add BRAVO Reader-Writer Lock #9394

Uh oh!

Conversation

masaori335 commented Feb 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Abstract

Implementation

Exclusive locking (writer)

Shared locking (reader)

Micro Benchmark

read: 1000, write: 0

read: 1000, write: 1

References

Uh oh!

ywkaras commented Feb 10, 2023

Uh oh!

Uh oh!

ywkaras Feb 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywkaras Feb 18, 2023

Choose a reason for hiding this comment

Uh oh!

ywkaras commented Feb 18, 2023

Uh oh!

ywkaras commented Feb 18, 2023

Uh oh!

ywkaras Feb 18, 2023

Choose a reason for hiding this comment

Uh oh!

masaori335 Feb 21, 2023

Choose a reason for hiding this comment

Uh oh!

masaori335 commented Feb 21, 2023

Uh oh!

Uh oh!

ywkaras commented Feb 21, 2023

Uh oh!

ywkaras commented Feb 21, 2023

Uh oh!

masaori335 commented Feb 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

read: 1000, write: 0

read: 1000, write: 1

Uh oh!

ywkaras commented Feb 22, 2023

Uh oh!

ywkaras commented Feb 26, 2023

Uh oh!

masaori335 commented Feb 28, 2023

Uh oh!

lzx404243 commented Mar 20, 2023

Uh oh!

Uh oh!

lzx404243 Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masaori335 Mar 27, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ezelkow1 commented May 19, 2023

Uh oh!

ywkaras commented May 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

masaori335 commented Feb 9, 2023 •

edited

Loading

ywkaras Feb 18, 2023 •

edited

Loading

masaori335 commented Feb 22, 2023 •

edited

Loading

lzx404243 Mar 21, 2023 •

edited

Loading