Skip to content

Commit

Permalink
Added missing bli_init_once() in bli_thread API.
Browse files Browse the repository at this point in the history
Details:
- Fixed an issue with specifying threading globally at runtime via
  bli_thread_set_num_threads() (the automatic way) or via
  bli_thread_set_ways() (the manual way), with bli_thread_init_rntm()
  also affected. These functions were not calling bli_init_once() prior
  to acting, and therefore their effects on the global rntm_t structure
  were being wiped out by the eventual call to bli_init_once(), by some
  other BLIS function. Thanks to Ali Emre Gülcü for reporting the
  behavior associated with this bug.
- Added additional content to docs/Multithreading.md covering topics of
  choosing between OpenMP and pthreads, and specifying affinity via
  OpenMP.
- CREDITS file update.
  • Loading branch information
fgvanzee committed Dec 18, 2018
1 parent f808d82 commit 93d5631
Show file tree
Hide file tree
Showing 3 changed files with 57 additions and 0 deletions.
1 change: 1 addition & 0 deletions CREDITS
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ but many others have contributed code and feedback, including
Richard Goldschmidt @SuperFluffy
Chris Goodyer
John Gunnels (IBM, T.J. Watson Research Center)
Ali Emre Gülcü @Lephar
Jeff Hammond @jeffhammond (Intel)
Jacob Gorm Hansen @jacobgorm
Jean-Michel Hautbois @jhautbois
Expand Down
47 changes: 47 additions & 0 deletions docs/Multithreading.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
# Contents

## Choosing OpenMP vs pthreads
## Specifying thread-to-core affinity

* **[Contents](Multithreading.md#contents)**
* **[Introduction](Multithreading.md#introduction)**
* **[Enabling multithreading](Multithreading.md#enabling-multithreading)**
* [Choosing OpenMP vs pthreads](Multithreading.md#choosing-openmp-vs-pthreads)
* [Specifying thread-to-core affinity](Multithreading.md#specifying-thread-to-core-affinity)
* **[Specifying multithreading](Multithreading.md#specifying-multithreading)**
* [Globally via environment variables](Multithreading.md#globally-via-environment-variables)
* [The automatic way](Multithreading.md#environment-variables-the-automatic-way)
Expand Down Expand Up @@ -46,6 +51,46 @@ For more complete and up-to-date information on the `--enable-threading` option,
$ ./configure --help
```

## Choosing OpenMP vs pthreads

While we provide the ability to implement multithreading in BLIS in terms of either OpenMP or pthreads, we typically encourage users to opt for OpenMP:
```
$ ./configure -t openmp auto
```
The reason mostly comes down to the fact that most OpenMP implementations (most notably GNU) allow the user to conveniently bind threads to cores via an environment variable(s) set prior to running the application. This is important because when the operating system causes a thread to migrate from one core to another, the thread will typically leave behind the data it was using in the L1 and L2 caches. That data may not be present in the caches of the destination core. Once the thread resumes execution from the new core, it will experience a period of frequent cache misses as the data it was previously using is transmitted once again through the cache hierarchy. If migration happens frequently enough, it can pose a significant (and unnecessary) drag on performance.

Note that binding threads to cores is possible in pthreads, but it requires a runtime call to the operating system, such as `sched_setaffinity()`, to convey the thread binding information, and BLIS does not yet implement this behavior for pthreads.

## Specifying thread-to-core affinity

The solution to thread migration is setting *processor affinity*. In this context, affinity refers to the tendency for a thread to remain bound to a particular compute core. There are at least two ways to set affinity in OpenMP. The first way offers more control, but requires you to understand a bit about the processor topology and how core IDs are mapped to physical cores, while the second way is simpler but less powerful.

Let's start with an example. Suppose I have a two-socket system with a total of eight cores, four cores per socket. By setting `GOMP_CPU_AFFINITY` as follows
```
$ export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7"
```
I am communicating to OpenMP that the first thread to be created should be spawned on core 0, from which it should not migrate. The second thread to be created should be spawned on core 1, from which it should not migrate, and so forth. If socket 0 has cores 0-3 and socket 1 has 4-7, this would result in the first four threads on socket 0 and the second four threads on socket 1. (And if more than eight threads are spawned, the mapping wraps back around, staring from the beginning.) So with `GOMP_CPU_AFFINITY`, you are doing more than just preventing threads from migrating once they are spawned--you are specifying the cores on which they will be spawned in the first place.

Another example: Suppose the hardware numbers the cores alternatingly between sockets, such that socket 0 gets even-numbered cores and socket 1 gets odd-numbered cores. In such a scenario, you might want to use `GOMP_CPU_AFFINITY` as follows
```
$ export GOMP_CPU_AFFINITY="0 2 4 6 1 3 5 7"
```
Because the first four entries are `0 2 4 6`, threads 0-3 would be spawned on the first socket, since that is where cores 0, 2, 4, and 6 are located. Similarly, the subsequent `1 3 5 7` would cause threads 4-7 to be spawned on the second socket, since that is where cores 1, 3, 5, and 7 reside. Of course, setting `GOMP_CPU_AFFINITY` in this way implies that BLIS benefits from this kind of grouping of threads--which, generally, it does. As a general rule, you should try to fill up a socket with one thread per core before moving to the next socket.

A second method of specifying affinity is via `OMP_PROC_BIND`, which is much simpler to set:
```
$ export OMP_PROC_BIND=close
```
This binds the threads close to the master thread, in contiguous "place" partitions. (There are other valid values aside from `close`.) Places are specified by another variable, `OMP_PLACES`:
```
$ export OMP_PLACES=cores
```
The `cores` value is most appropriate for BLIS since we usually want to ignore hardware threads (symmetric multithreading, or "hyperthreading" on Intel systems) and instead map threads to physical cores.

Setting these two variables is often enough. However, it obviously does not offer the level of control that `GOMP_CPU_AFFINITY` does. Sometimes, it takes some experimentation to determine whether a particular mapping is performing as expected. If multithreaded performance on eight cores is only twice what it is observed of single-threaded performance, the affinity mapping may be to blame. But if performance is six or seven times higher than sequential execution, then the mapping you chose is probably working fine.

Unfortunately, the topic of thread-to-core affinity is well beyond the scope of this document. (A web search will uncover many [great resources](http://www.nersc.gov/users/software/programming-models/openmp/process-and-thread-affinity/) discussing the use of [GOMP_CPU_AFFINITY](https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html) and [OMP_PROC_BIND](https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fPROC_005fBIND.html#OMP_005fPROC_005fBIND).) It's up to the user to determine an appropriate affinity mapping, and then choose your preferred method of expressing that mapping to the OpenMP implementation.


# Specifying multithreading

Expand All @@ -59,6 +104,8 @@ This pattern--automatic or manual--holds regardless of which of the three method

Regardless of which method is employed, and which specific way within each method, after setting the number of threads, the application may call the desired level-3 operation (via either the [typed API](docs/BLISTypedAPI.md) or the [object API](docs/BLISObjectAPI.md)) and the operation will execute in a multithreaded manner. (When calling BLIS via the BLAS API, only the first two (global) methods are available.)

NOTE: Please be aware of what happens if you try to specify both the automatic and manual ways, as it could otherwise confuse new users. Regardless of which broad method is used, **if multithreading is specified via both the automatic and manual ways, the manual way will always take precedence.** Also, specifying parallelism for even *one* loop counts as specifying the manual way (in which case the ways of parallelism for the remaining loops will be assumed to be 1).

## Globally via environment variables

The most common method of specifying multithreading in BLIS is globally via environment variables. With this method, the user sets one or more environment variables in the shell before launching the BLIS-linked executable.
Expand Down
9 changes: 9 additions & 0 deletions frame/thread/bli_thread.c
Original file line number Diff line number Diff line change
Expand Up @@ -1303,6 +1303,9 @@ static bli_pthread_mutex_t global_rntm_mutex = BLIS_PTHREAD_MUTEX_INITIALIZER;

void bli_thread_set_ways( dim_t jc, dim_t pc, dim_t ic, dim_t jr, dim_t ir )
{
// We must ensure that global_rntm has been initialized.
bli_init_once();

// Acquire the mutex protecting global_rntm.
bli_pthread_mutex_lock( &global_rntm_mutex );

Expand All @@ -1314,6 +1317,9 @@ void bli_thread_set_ways( dim_t jc, dim_t pc, dim_t ic, dim_t jr, dim_t ir )

void bli_thread_set_num_threads( dim_t n_threads )
{
// We must ensure that global_rntm has been initialized.
bli_init_once();

// Acquire the mutex protecting global_rntm.
bli_pthread_mutex_lock( &global_rntm_mutex );

Expand All @@ -1327,6 +1333,9 @@ void bli_thread_set_num_threads( dim_t n_threads )

void bli_thread_init_rntm( rntm_t* rntm )
{
// We must ensure that global_rntm has been initialized.
bli_init_once();

// Acquire the mutex protecting global_rntm.
bli_pthread_mutex_lock( &global_rntm_mutex );

Expand Down

0 comments on commit 93d5631

Please sign in to comment.