A lot of modern applications are multi-threaded to improve performance and utilise the computational power available. However, a lot of programs work on shared resources (e.g. files, data structures) that must have safeguards to prevent concurrency bugs from occurring. The contention between threads to access these resources can hamper the benefits of parallel programming, even degrading the performance of an application in certain scenarios.
Intuition will often suggest that having more threads working on a task will decrease the execution time of it, as tasks can be distributed to several parallel workers. Examine the following three implementations of a function that increments a counter
a given number of times.
The first implementation is for a single thread and does not use any concurrency primitives:
static void Increment(int &counter, int iterations) {
for (int i = 0; i < iterations; i++) {
counter++;
}
}
This multi-threaded implementation of the Increment
function uses atomic
variables to prevent multiple threads reading/writing to the variable simultaneously, using "memory barriers" to enforce a certain ordering when accessing the counter
which prevents data races:
static void IncrementAtomic(std::atomic<int> &counter, int iterations) {
for (int i = 0; i < iterations; i++) {
counter.fetch_add(1);
}
}
This alternative multi-threaded implementation uses a mutex
(otherwise known as a "lock"), which only allows the thread that "holds" the mutex to operate on the counter:
static void IncrementMutex(int &counter, int iterations, std::mutex &mtx) {
for (int i = 0; i < iterations; i++) {
mtx.lock();
counter++;
mtx.unlock();
}
}
Benchmarking these functions shows considerable performance degradation when having more than one thread working on the task. These were the results when using these functions to increment a counter 10 million times:
Function | Execution Time (ns) |
---|---|
Increment (one thread) | 12713615 |
IncrementAtomic (one thread) | 54969640 |
IncrementAtomic (two threads) | 151883027 |
IncrementMutex (one thread) | 162426559 |
IncrementMutex (two threads) | 472225654 |
Contrary to the idea that more threads corresponds to increased performance, the results show that having two threads compete for a single resource caused the execution time to triple for both IncrementAtomic
and IncrementMutex
. Not only that, significant overhead is induced from having additional concurrency primitives protecting the counter
variable from creating a data race.
The broader issue that we are observing in the Increment
implementations is write-contention - two threads are attempting to modify a variable, but to preserve the correctness of the data concurrency primitives must be used to maintain "mutual exclusion". In other examples, read-contention may also be an issue though the overheads associated with it are less pronounced. Read/write locks are designed to give threads access to variables whenever possible, restricting access to variables unless absolutely necessary.
This level of overhead is significant for low-latency applications, and there have been designs to minimise contention where possible. A common design pattern is to have a single threaded "producer" which writes to a data structure. As demonstrated in the previous section, having a single thread writing to data storage performs better than multiple threads that need to manage mutual exclusion between themselves. The "consumers" only need read-access, so they are able to concurrently read objects without a concern for exclusivity.
The LMAX Disruptor is an example of an open-source concurrent structure that serves as an alternative to queues, which are limited when used in designs with producers/consumers and their data. More information about the design (as well as locks and concurrency) can be found in the technical paper.
If an application must use mutexes/locks and atomic operations, it is important to use the synchronisation primitives that are most suitable for the task. For example, spinlocks are an implementation of a lock that are useful for short critical sections (a section of code executed by multiple threads where different sequences of execution affects the result) and/or when the lock will not often be contested... However, the worst case behaviour of spinlocks are extremely expensive and can cause massive overheads when used incorrectly.
An alternative to a spinlocks (among many others) are sleeping locks, which ask the OS to put the thread to sleep if it does not initially obtain the lock. When the lock is freed, the thread is woken back up using a system call. The problem is that the system calls to make the thread sleep are expensive, making this solution suboptimal for locks that are not held for long periods of time.
There are lock implementations that combine the advantages of both spinlocks and sleeping locks (e.g. futexes), otherwise known as hybrid locks. It is important for the programmer to understand the advantages and disadvantages with each lock to choose the most appropriate one for each use case.