Matrix multiplication: Nested parallelism #9

mratsim · 2018-12-26T11:11:17Z

On benchmark on dual Xeon Gold 6154 vs MKL:

Warmup: 0.9943 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 2304, N: 2304)
B matrix shape: (M: 2304, N: 2304)
Output shape: (M: 2304, N: 2304)
Required number of operations: 24461.181 millions
Required bytes:                   42.467 MB
Arithmetic intensity:            576.000 FLOP/byte
Theoretical peak single-core:    118.400 GFLOP/s
Theoretical peak multi:         4262.400 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Intel MKL benchmark
Collected 100 samples in 0.658 seconds
Average time: 6.211 ms
Stddev  time: 2.274 ms
Min     time: 5.648 ms
Max     time: 28.398 ms
Perf:         3938.203 GFLOP/s

Display output[0] to make sure it's not optimized away
566.68505859375

Laser production implementation
Collected 100 samples in 4.067 seconds
Average time: 40.303 ms
Stddev  time: 12.542 ms
Min     time: 35.367 ms
Max     time: 121.945 ms
Perf:         606.927 GFLOP/s

Display output[0] to make sure it's not optimized away
566.68505859375

PyTorch Glow: libjit matmul implementation
Collected 100 samples in 36.837 seconds
Average time: 368.372 ms
Stddev  time: 3.071 ms
Min     time: 362.655 ms
Max     time: 380.193 ms
Perf:         66.403 GFLOP/s

Display output[0] to make sure it's not optimized away
566.6849975585938

According to the paper

[2] Anatomy of High-Performance Many-Threaded Matrix Multiplication
Smith et al

http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf

Parallelism should be done around jc (dimension nc)

Note that nc is often 4096 so we might need another distribution scheme.

The text was updated successfully, but these errors were encountered:

mratsim · 2019-01-16T08:18:48Z

Changing to nested parallelism.

Unfortunately, parallelizing on a single loop doesn't scale well (unless we multiply bigger matrices).

BLIS multithreading readme mentions multithreading at multiple level.

Regarding nested parallelism in OpenMP, at first glance it seems quite tricky with a real risk of oversubscription or OpenMP not spawning new threads on the second loop if we use dynamic schedule.
Intel sugests using the recent OpenMP task construct.

mratsim mentioned this issue Dec 26, 2018

Multi-threaded CPU kernels pytorch/glow#1749

Open

mratsim changed the title ~~Matrix multiplication: Dispatch depending on NUMA nodes~~ Matrix multiplication: Nested parallelism Jan 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix multiplication: Nested parallelism #9

Matrix multiplication: Nested parallelism #9

mratsim commented Dec 26, 2018

mratsim commented Jan 16, 2019

Matrix multiplication: Nested parallelism #9

Matrix multiplication: Nested parallelism #9

Comments

mratsim commented Dec 26, 2018

mratsim commented Jan 16, 2019