Skip to content

Conversation

@masaori335
Copy link
Contributor

@masaori335 masaori335 commented Feb 9, 2023

Abstract

Introduce ts::bravo::shared_mutex as an alternative to the std::shared_mutex or pthread_rwlock_t.

As we know, std::shared_mutex or pthread_rwlock_t (almost equivalent on Linux) doesn't scale on a multi-thread application like ATS. BRAVO is a reader-writer lock algorithm published by Dave Dice and Alex Kogan at USENIX 2019[*1]. The algorithm acts as an accelerator layer for the reader lock. Please note that this doesn't accelerate the writer lock, but almost no penalty. This algorithm is useful for read-heavy use cases. For the details of the algorithm, please check the paper.

Implementation

The puzpuzpuz/xsync's RBMutex is an algorithm implementation in go-lang. ts::bravo::shared_mutex followed it in C++.

Exclusive locking (writer)

ts::bravo::shared_mutex can be a drop-in replacement.

ts::bravo::shared_mutex mutex; 
std::lock_guard lock(mutex);

Shared locking (reader)

To handle ts::bravo::Token, it needs ts::bravo::shared_lock, but you can use it like std::shared_lock.

ts::bravo::shared_mutex mutex; 
ts::bravo::shared_lock lock(mutex);

Micro Benchmark

benchmark_shared_mutex.cc is Catch2 based micro benchmark tool. You can adjust threads and the rate of reading and writing.

$ taskset -c 0-63 ./benchmark_shared_mutex --ts-nthreads 64 --ts-nloop 100 --ts-nread 1000 --ts-nwrite 1

The below numbers are measures on Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz 64 core.

read: 1000, write: 0

  ts::shared_mutex ts::bravo::shared_mutex
1 4.36742 1.09271
2 76.5058 1.08926
4 126.713 1.11642
8 232.651 1.18535
16 498.872 1.80362
32 1012.19 2.17977
64 2172.4 5.7694

Screenshot 2023-02-09 at 15 20 18

read: 1000, write: 1

  ts::shared_mutex ts::bravo::shared_mutex
1 4.3714 3.19028
2 59.3886 17.5586
4 132.736 53.6457
8 249.857 112.068
16 586.322 128.144
32 1375.09 227.752
64 5240.33 1301.71

Screenshot 2023-02-09 at 15 20 27

References

[*1] Dave Dice and Alex Kogan. 2019. BRAVO: Biased Locking for Reader-Writer Locks. In Proceedings of the 2019 USENIX Annual Technical Conference (ATC). USENIX Association, Renton, WA, 315–328.
https://www.usenix.org/conference/atc19/presentation/dice

[*2] xsync - Concurrent data structures for Go
https://github.com/puzpuzpuz/xsync

@ywkaras
Copy link
Contributor

ywkaras commented Feb 10, 2023

Perhaps compare also to #9154 .

@bryancall bryancall requested a review from ywkaras February 13, 2023 23:19
std::thread list[conf.nthreads];

for (int i = 0; i < conf.nthreads; i++) {
new (&list[i]) std::thread{[](T &mutex) {
Copy link
Contributor

@ywkaras ywkaras Feb 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just:

list[i] = std::thread{...

? std::thread isn't copyable, but it's moveable. What you're doing assumes that ~std::thread() is vacuous when the thread is constructed with the default constructor. That's probably true in this case, but it's better practice to not make such assumptions unless it's really necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the only advantage of using a lambda here is that you avoid having to pick a name for the thread function. I think it would be more readable just using a conventional, named function for the thread function.

On the other hand, @SolidWallOfCode smiles and sighs with contentment whenever anyone uses a lambda, so there is that.

@ywkaras
Copy link
Contributor

ywkaras commented Feb 18, 2023

With my ts::scalable_shared_mutex, with 16 threads, it had worse performance that ts::shared_mutex. So maybe check the performance of this mutex with 16 threads?

@ywkaras
Copy link
Contributor

ywkaras commented Feb 18, 2023

ts::scalable_shared_mutex uses the class DenseThreadId ( https://github.com/apache/trafficserver/pull/9154/files#diff-5d4184d5004b8c73554cb88f0bfe7f0c4711c7394c5a15e24daaaff8962d1076R177 ). It provides a thread ID for currently running threads, that's in the range from 0 to one less than the number of currently-running threads. A new thread may reuse the ID of an exited thread. (This could be a performance issue with dynamic threading, but ATS seems to do little of that.) So dense thread ID % num slots would then be a reasonable hash function, with perhaps fewer collisions. ts::scalable_shared_mutex always assume there are at least as many slots as currently running threads. This avoids complex logic to deal with collisions. I notice that you hard-coded 4096 as the number of slots. I suspect there are few if any ATS proxies in prod that run optimally with more than 4096 simultaneously active threads.

using time_point = std::chrono::time_point<std::chrono::system_clock>;

#ifdef __cpp_lib_hardware_interference_size
using std::hardware_constructive_interference_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this constant is only available in newer versions of gcc and clang. We could maybe get the cache line size out of /proc/cpuinfo when the target is Linux?

wkaras ~
$ grep -e cache_alignment -e processor /proc/cpuinfo | head -n 4
processor	: 0
cache_alignment	: 64
processor	: 1
cache_alignment	: 64
wkaras ~
$

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masaori335
Copy link
Contributor Author

With my ts::scalable_shared_mutex, with 16 threads, it had worse performance that ts::shared_mutex. So maybe check the performance of this mutex with 16 threads?

Please look at the table and graph above. The benchmark is measured with 1,2,4,8,16,32, and 64 threads. ts::brovo::shared_mutex shows better performance with fewer threads.

ts::scalable_shared_mutex uses the class DenseThreadId...

IIUC, reducing collision between threads by using the hash function (mix32) is one of the key ideas of BRAVO.
However, for ATS, we might be able to reduce the slots with your DenseThreadId, because we don't use dynamic threading. I'll try to use it and measure performance.

Signed-off-by: Walt Karas <wkaras@yahooinc.com>
@ywkaras
Copy link
Contributor

ywkaras commented Feb 21, 2023

Would be good if Apple people looked at this. Maybe @vmamidi or @cmcfarlen . Apple was talking about having thread_local HostDBs. Maybe if HostDB had a faster shared_mutex, that complexity could be avoided. Yahoo has deprioritized exploration of the use of proxy hosts with many cores, so I won't get many brownie points for giving this the attention that it deserves.

@ywkaras
Copy link
Contributor

ywkaras commented Feb 21, 2023

Here is a unit test I wrote in a previous effort to build a better shared_mutex: https://github.com/ywkaras/trafficserver/blob/OWMR/src/tscore/unit_tests/test_OneWriterMultiReader.cc .

@masaori335
Copy link
Contributor Author

masaori335 commented Feb 22, 2023

Microbenchmark with DenseThreadId is below. We can go with it because it has no collision, and we can reduce the slot size.

read: 1000, write: 0

thread num ts::bravo::shared_mutex (DenseThreadId)
1 1.02784
2 1.04234
4 1.06245
8 1.14418
16 1.33218
32 2.40103
64 3.26439

read: 1000, write: 1

thread num ts::bravo::shared_mutex (DenseThreadId)
1 1.25493
2 47.5607
4 85.9465
8 130.392
16 106.654
32 237.198
64 1259.23

Screenshot 2023-02-22 at 15 47 28

@ywkaras
Copy link
Contributor

ywkaras commented Feb 22, 2023

It looks like there are still a few areas that need improvement. But, I think the next step is to agree on what testing is needed for verification of proper, logically correct locking. If you think your current logic testing is adequate, I think we should get at least one more reviewer, to break the tie.

@ywkaras
Copy link
Contributor

ywkaras commented Feb 26, 2023

A common problem with performance testing is that, since the code is executing to time it only, not to use any results, the optimizer will remove some or all of the code you're actually trying to time. One way to double check is to use 'objdump -d' to disassemble, check if there's clearly too few assembly instructions corresponding to the source code.

@masaori335
Copy link
Contributor Author

For the logic, having more complicated tests is better, but current tests have enough coverage to start testing this reader-writer lock widely. Let's have more eyeballs.

For the benchmark, I noticed Catch2 Benchmark provides an option[*1] (return from the function) for the optimization and benchmark issue. I followed this way. Also, I checked with objdump -d and ran the benchmark on the lldb with "disassembly mode"[*2]. It has some instructions.

[*1] https://github.com/catchorg/Catch2/blob/v2.13.8/docs/benchmarks.md#the-optimizer
[*2] https://github.com/llvm/llvm-project/blob/main/lldb/examples/python/disassembly_mode.py

@bryancall bryancall self-requested a review March 13, 2023 22:17
@lzx404243
Copy link
Collaborator

I created a test build replacing all RW locks(std::shared_mutex, ts::shared_mutex, ink_rwlock) with ts::bravo::shared_mutex and ran it on a prod box. See no performance degradation nor crashes.

Comment on lines +332 to +334
struct Mutex {
std::atomic<bool> read_bias = false;
std::array<Slot, SLOT_SIZE> readers = {};
Copy link
Collaborator

@lzx404243 lzx404243 Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unlike the original BRAVO where a global reader table is used, your implementation has a reader table per lock instance. This will increase the per-lock memory footprint, with potential performance benefit. May benchmark the memory usage as well to see whether the trade-off is worthwhile/acceptable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I followed the approach of xsync's RBMutex.

https://github.com/puzpuzpuz/xsync/blob/main/rbmutex.go

@bryancall bryancall requested a review from bneradt April 10, 2023 22:36
@ezelkow1
Copy link
Member

[approve ci centos]

@ywkaras
Copy link
Contributor

ywkaras commented May 22, 2023

Masaori and I don' t agree as to whether this needs a longer unit test, so I'm bowing out and leaving it to other reviewers to break the tie.

@masaori335 masaori335 merged commit c331c72 into apache:master May 22, 2023
cmcfarlen pushed a commit to cmcfarlen/trafficserver that referenced this pull request Jun 3, 2024
* asf/master:
  Fix Via header on H3 connections (apache#9758)
  Add ssrc and surc log fields for server simple/unavail retry counts. (apache#9694)
  Term standardization: dead/down server -> down server (apache#9582)
  Fix a crash caused by a HTTP/2 GET request with a body (apache#9738)
  Add configuration for UDP poll timeout (apache#9733)
  Fix quic_no_activity_timeout test (apache#9737)
  Updates to cmake install to get a running ATS from a fresh install (apache#9735)
  Make config.proxy.http.no_dns_just_forward_to_parent overridable (apache#9728)
  Fix a potential crash due to use-after-free on QUIC connections (apache#9715)
  Doc: Clarify that connect ports can have multiple values (apache#9713)
  Add BRAVO Reader-Writer Lock (apache#9394)
  Cleanup: Fix format of doc corruption message (apache#9725)
  Don't build traffic_quic command (apache#9726)
  Fix protocol version in request Via header (apache#9716)
  Fix TS_HTTP_REQUEST_TRANSFORM_HOOK Tunnel Processing (apache#9724)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants