You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Warmup: 0.9943 s, result 224 (displayed to avoid compiler optimizing warmup away)
A matrix shape: (M: 2304, N: 2304)
B matrix shape: (M: 2304, N: 2304)
Output shape: (M: 2304, N: 2304)
Required number of operations: 24461.181 millions
Required bytes: 42.467 MB
Arithmetic intensity: 576.000 FLOP/byte
Theoretical peak single-core: 118.400 GFLOP/s
Theoretical peak multi: 4262.400 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.
Intel MKL benchmark
Collected 100 samples in 0.658 seconds
Average time: 6.211 ms
Stddev time: 2.274 ms
Min time: 5.648 ms
Max time: 28.398 ms
Perf: 3938.203 GFLOP/s
Display output[0] to make sure it's not optimized away
566.68505859375
Laser production implementation
Collected 100 samples in 4.067 seconds
Average time: 40.303 ms
Stddev time: 12.542 ms
Min time: 35.367 ms
Max time: 121.945 ms
Perf: 606.927 GFLOP/s
Display output[0] to make sure it's not optimized away
566.68505859375
PyTorch Glow: libjit matmul implementation
Collected 100 samples in 36.837 seconds
Average time: 368.372 ms
Stddev time: 3.071 ms
Min time: 362.655 ms
Max time: 380.193 ms
Perf: 66.403 GFLOP/s
Display output[0] to make sure it's not optimized away
566.6849975585938
According to the paper
[2] Anatomy of High-Performance Many-Threaded Matrix Multiplication
Smith et al
Unfortunately, parallelizing on a single loop doesn't scale well (unless we multiply bigger matrices).
BLIS multithreading readme mentions multithreading at multiple level.
Regarding nested parallelism in OpenMP, at first glance it seems quite tricky with a real risk of oversubscription or OpenMP not spawning new threads on the second loop if we use dynamic schedule.
Intel sugests using the recent OpenMP task construct.
cc @Laurae2
On benchmark on dual Xeon Gold 6154 vs MKL:
According to the paper
[2] Anatomy of High-Performance Many-Threaded Matrix Multiplication
Smith et al
Parallelism should be done around
jc
(dimensionnc
)Note that
nc
is often 4096 so we might need another distribution scheme.The text was updated successfully, but these errors were encountered: