-
Notifications
You must be signed in to change notification settings - Fork 4
Request refactoring test
Per the discussion on the 2016-07-26 webex, we decided to test several aspects of request refactoring.
Below is a proposal for running various tests / collecting data to evaluate the performance of OMPI with and without threading, and to evaluate the performance after the request code refactoring. Once we all agree on the plan, the idea is that several organizations would run these tests and collect the data specified. It would be great if running these tests could be scripted up to make it easy to gather all this data consistently on a variety of platforms at different organizations.
- Verify that all the threading work / request revamp has not harmed baseline / single-threaded performance. If it has, fix it.
- Tests 1 and 2 (below) support this goal.
- Verify for what cases that the request revamp improves performance.
- Test 3 (below) supports this goal.
All the tests below should be run with the vader
BTL. Rationale:
- Shared memory is the lowest latency transport, and should easily show any performance differences / problems
-
vader
can be run on any platform - There are large differences between the other BTLs and MTLs in v1.10, v2.0.x, and v2.1.x, making it difficult performance comparisons that solely highlight threading improvements and/or the request refactor
The tests can be run on other networks of course, but vader
results can be the baseline.
The intent of this test is to compare single-threaded performance over several Open MPI versions to ensure that performance has not degraded since v1.10.x.
Run the osu_mbw_mr
benchmark within a single NUMA domain on a single server (using the vader
BTL).
- 1.10.3
- 2.0.0
- Master, commit before request refactoring (need to find a suitable git hash here, so that we all test the same thing)
- Make sure to disable debugging! (this is likely from before we switched master to always build optimized)
- Master head (need to agree on a specific git hash to test, so that we all test the same thing)
NOTE: Per https://github.com/open-mpi/ompi/issues/1902, we currently expect there to be some performance degradation. This will be fixed.
The intent of this test is to compare single-threaded performance with MPI_THREAD_MULTIPLE
performance over several Open MPI versions in order to evaluate the cost of locking/atomics/etc. in the MPI_THREAD_MULTIPLE
code paths.
Spawn an even number of threads on a single server (using the vader
BTL). Each thread runs the message rate benchmark with another thread in the same NUMA domain.
- N processes/1 process per core, each process uses
MPI_THREAD_SINGLE
- Use the stock
osu_mbw_mr
benchmark. - This is the baseline performance measurement.
- Use the stock
- N processes/1 process per core, each process uses
MPI_THREAD_MULTIPLE
- Use the stock
osu_mbw_mr
benchmark, but set the environment variableOMPI_MPI_THREAD_LEVEL
to 3, thereby settingMPI_THREAD_MULTIPLE
- The intent of this test is to measure the performance delta between this test and the baseline. We expect the performance delta to be nonzero (because of https://github.com/open-mpi/ompi/issues/1902).
- Use the stock
- 1 process/N threads/1 thread per core (obviously using
MPI_THREAD_MULTIPLE
).- Use Arm's test (which essentially creates N threads and runs
osu_mbw_mr
in each). - The intent of this test is to measure the performance delta between this test and the baseline. We expect the performance delta to be nonzero (because of https://github.com/open-mpi/ompi/issues/1902).
- Use Arm's test (which essentially creates N threads and runs
- 2.0.0
- Master, commit before request refactoring (need to find a suitable git hash here, so that we all test the same thing)
- Make sure to disable debugging! (this is likely from before we switched master to always build optimized)
- Master head (need to agree on a specific git hash to test, so that we all test the same thing)
The goals of the request refactoring were to:
- Decrease lock contention when multiple threads are blocking in
MPI_WAIT*
- Decrease CPU cycle / scheduling contention between threads that are blocking in
MPI_WAIT*
and threads that are not blocking inMPI_WAIT*
Traditional MPI applications are written/executed as one MPI process (and thread) per CPU core. The request refactoring intended to reduce lock contention between threads, specifically targeted at enabling new forms of MPI_THREAD_MULTIPLE
-enabled programming models and applications. It is therefore unlikely that the request refactoring will show much improvement in "traditional" MPI applications (i.e., one process/thread per CPU core).
With the old request code, if there are N threads blocking in MPI_WAIT*
in a single MPI process (even if the process is bound to N cores), each thread will be vying for the lock to enter the progression loop for a single iteration, and for every single completed requests all the threads in MPI_WAIT*
will check the status of all their requests. Upon exit from the progress loop, if a thread still has requests to wait for, it will repeat the process again: vie for the lock, enter for a single progression iteration, upon exit check the status of all requests, ...etc. Meaning: there are N threads all actively contending for a lock, and each of the N threads are continually entering/exiting the progress loop. There is much overhead in this approach.
With the new request code, the thread that succeeds in entering the progress loop will stay in the progress loop until all of its requests have completed (vs. just performing a single progression iteration). In the current implementation, all other threads will remain blocked/asleep until the synchronizations they are block into has been satisfied (the synchronization accounts for the expected number of requests to complete before a thread should be awaken). Thus, the thread in the progress loop will selectively wake individual threads as their requests complete (vs. waking all blocked threads to check and see if their requests have completed). Once the thread in the progress loop completes all of its own requests, it will wake a single thread to take its place inside the progress loop and then exit.
Why is this useful?
The goal is to enable MPI_THREAD_MULTIPLE
-enabled applications that are bound to multiple cores (e.g., an entire NUMA domain), and who have more threads than cores. Consider: if an MPI process is bound to N cores, and M threads are blocking in MPI_WAIT*
, only one of those threads will be active inside the progress loop. This means that there are still (N-1) cores available for other threads (regardless of the value of M!): threads that could be computing, or threads that could be performing non-blocking operations in MPI.
Meaning: one of the issues the request refactor will enable is the "Got a long MPI operation to perform? Let it block in a progress thread" types of programming models. This is in addition to the drastic decrease in contentions, a relief of the load on the memory bus, the opportunity for async progress and many more.
Existing benchmarks will therefore tend to not show any improvement because they do not have code that will execute during MPI_WAIT*
. We need to create a new benchmark to show the programming model and performance benefits from this approach.
We need to write a new benchmark that does the following:
- Launches 2 MPI processes on a single server (using the
vader
BTL between the two) - Each process is bound to a NUMA domain
- Each process creates 2*(num_cores in the NUMA domain) threads (i.e., twice the number of cores that are in the NUMA domain)
- Have half the threads continually waiting on non-blocking MPI operations, in two ways (i.e., run these as two separate tests -- not both at the same time):
- Test 1: measure the bandwidth by continually
MPI_WAITALL
ing on a large number of sends/receives of large messages. - Test 2: measure the message rate by continually
MPI_WAITALL
ing on a large number of sends/receives of small messages.
- Test 1: measure the bandwidth by continually
- Have the other half of the threads perform CPU-based computations (e.g., DGEMM)
- We might need/want to experiment with binding the various threads in different ways within the NUMA core...? TBD.
The performance of both types of metrics (MPI performance and non-MPI performance) should be greatly improved after the request refactoring.
- The prototype proposed by Mellanox (derived from OSU and threaded-test) is located at https://github.com/artpol84/poc/tree/master/benchmarks/message_rate (should we put it somewhere else?).
- Help message:
./mr_th_nb -h
Options: ./mr_th_nb
-h Display this help
Test description:
-s Message size (default: 0)
-n Number of measured iterations (default: 100)
-w Number of warmup iterations (default: 10)
-W Window size - number of send/recvs between sync (default: 256)
Test options:
-Dthrds Disable threaded support (call MPI_Init) (default: MPI_Init_thread)
-t Number of threads (default: 1)
-B Blocking send (default: non-blocking)
-S SMP mode - intra-node performance - pairwise exchanges (default: disabled)
-d Use separate communicator for each thread (default: disabled)
-b Enable fine-grained binding of the threads inside the set provided by MPI
benchmark is able to discover node-local ranks and exchange existing binding
(default: disabled)
-
Benchmark capabilities:
- The default values of
-n
,-w
and-W
correspond to OSU message rate test defaults. -
-B
option provides ablity to switch between OSU (non-blocking MPI_Isend/MPI_Irecv) and threaded-test (blocking MPI_Send/MPI_Recv) versions of message rate test. -
-Dthrds
is a "hacky" option that is pre-scanned in theargv
before calling MPI_Init. We need this because we want the way to select betweenMPI_Init
andMPI_Init_threads
on the fly. As alternative this can be solved using preprocessor and 2 separate binaries. FWIW this approach works fine with Open MPI and Intel MPI. -
-d
option provides ability to create a separate communicator for each pair of communicating ranks (some implementations may benefit from. -
-S
switch tells the benchmark what process placement you expect to evaluate. To avoid unnoticed configuration errors benchmark performs verification (double check) of the process placement.- By default (no
-S
option) benchmark expects 2-node configuration with equal number of processes on each node, regardless to rank mapping benchmark will create it's own pairing of ranks on those two nodes. - If
-S
option was given benchmark expects to find even number of processes on one node, those procs will be splitted onto 2 sets: senders and receivers.
- By default (no
-
-b
option tells the benchmark to perform additional "fine-binding" of threads, since it already knows the information about process placement it performs additional exchange of cpusets for procs located on the same node. Based on this information non-overlapping binding of threads to individual cores is performed.
- The default values of
-
Examples:
- A 2-proc case of OSU:
mpirun -np 2 --map-by node ./mr_th_nb
- A 2 proc per node OSU:
mpirun -np 4 --map-by node ./mr_th_nb
- A 2-proc per node, 2-thread case:
mpirun -np 4 --map-by node ./mr_th_nb -t 2
- A 2-proc per node, 2-thread with fine-binding :
mpirun -np 4 --map-by node --bind-to socket ./mr_th_nb -t 2 -b
- 2-proc SMP case described in the previous paragraph:
mpirun -np 2 --map-by socket --bind-to socket ./mr_th_nb -t 2 -b -S
- 4-proc SMP case:
mpirun -np 4 --map-by ppr:2:socket --bind-to socket ./mr_th_nb -t 2 -b -S
- A 2 2-threaded procs per node with fine-binding and comm dup:
mpirun -np 4 --map-by node --bind-to socket ./mr_th_nb -t 2 -b -d
- A 2-proc case of OSU:
- 2.0.0
- Master, commit before request refactoring (need to find a suitable git hash here, so that we all test the same thing)
- Make sure to disable debugging! (this is likely from before we switched master to always build optimized)
- Master head (need to agree on a specific git hash to test, so that we all test the same thing)