-
Notifications
You must be signed in to change notification settings - Fork 111
Dirac ITT Benchmarks
Build instructions for Grid are available at https://github.com/paboyle/Grid
The key Grid benchmark is located in branch:
release/dirac-ITT
under
benchmarks/Benchmark_ITT
and in the corresponding release:
https://github.com/paboyle/Grid/releases
It should be run on
-
Single node run
-
128 node run
-
(Optionally 2,4,8,16,32 and 64 node runs.)
The code is Hybrid OpenMP + MPI with NUMA socket aware optimisations. The relevant options can make big changes to delivered performance.
Log files should be collected after compile, run and threading parameters and compile options are optimised.
Some example configurations, invocation commands, and expected results are given.
The best options will vary from system system and compiler to compiler. Our guidance documents best currently known approaches, but you will have to tweak and run whatever configuration and invocation gives best performance.
Information (compile instructions and our own results) is provided for
-
Intel Knights Landing processors, with Intel Omnipath interconnect
-
Intel Skylake processors, single node, dual socket
-
AMD EPYC processors, single node, dual socket
-
Compile instructions for ARM Neon nodes. We have not benchmarked specific nodes
-
Other processor technologies will need to use the "generic" vectorisation target
The benchmark uses two strategies, overlapping communication and computation and performing communication then computation sequentially. The best result is taken.
We used hybrid OpenMP and MPI. We recommend one MPI rank per NUMA domain in a multi-socket or multi-die context. We recommend compiling with
--enable-comms=mpi3
or
--enable-comms=mpit
comms targets, using the (runtime) command line option:
--comms-threads <N>
to control how many threads try to enter MPI concurrently.
A globals comms buffer is allocated with either MMAP (default) or SHMGET (--enable-comms=mpi3). If
--shm-hugepages
is specified than the software requests that Linux provide 2MB huge pages. This requires system administrator assistance to preserve and (mip3) enable the user to map these pages.
The following advice for Intel Omnipath interconnect will probably carry over to Mellanox EDR, HDR and Cray Aries interconnects. However, other interconnects may not require to devote as many threads to communication as is recommended for OPA below.
-
For best performance with Intel Omnipath interconnects it is essential that 512 huge 2MB pages be preallocated by the system administrator using
echo 512 > /proc/sys/vm/nr_hugepages
In a system with multiple sockets or NUMA domains we find that using
-
One MPI rank per NUMA domain works best
-
Use OpenMP within each NUMA domain and bind these threads to each NUMA domain.
-
Use MPI3 comms so that between NUMA domains on the same node
--enable-comms=mpi3
shared memory is used for intranode comms. You will need a sysadmin to set up a Unix group to use huge pages for this region.
-
MPI will be used only for the intranode communications.
Configuration:
`../configure --enable-simd=KNL --enable-precision=single --enable-comms=mpit CXX=mpiicpc `
Invocation:
env KMP_HW_SUBSET=1T ./Benchmark_ITT --shm 1024 --shm-hugetlb
Results: (Key section of output (at end) for single node)
================================================================================== Memory benchmark ================================================================================== ================================================================================== = Benchmarking a*x + y bandwidth ================================================================================== L bytes GB/s Gflop/s seconds GB/s / node ---------------------------------------------------------- 8 393216.000 30.966 5.161 1.646 30.966 12 1990656.000 129.703 21.617 0.393 129.703 16 6291456.000 256.614 42.769 0.199 256.614 20 15360000.000 345.245 57.541 0.148 345.245 24 31850496.000 390.747 65.124 0.130 390.747 28 59006976.000 293.532 48.922 0.173 293.532 32 100663296.000 280.259 46.710 0.182 280.259 36 161243136.000 278.244 46.374 0.183 278.244 40 245760000.000 293.138 48.856 0.174 293.138 44 359817216.000 296.198 49.366 0.171 296.198 48 509607936.000 300.027 50.005 0.170 300.027 ================================================================================== Per Node Summary table Ls=16 ================================================================================== L Wilson DWF4 DWF5 8 100474.977 584630.187 703124.323 12 366189.052 306451.774 497540.725 16 340541.524 368178.044 659709.592 24 309891.310 484440.886 745885.026 ================================================================================== ================================================================================== Comparison point result: 337315 Mflop/s per node Comparison point robustness: 0.556 ==================================================================================
Configuration:
`../configure --enable-simd=KNL --enable-precision=single --enable-comms=mpit CXX=mpiicpc `
Invocation:
Example run on 16 nodes
export MPI=2.2.2.2 export NODES=16 export OMP_NUM_THREADS=62 export KMP_AFFINITY=explicit,proclist=[0,1,2,3,4,5,6,7,8-61] mpirun -np $NODES -ppn 1 ./Benchmark_ITT --mpi $MPI --comms-threads 8 --shm 1024 --shm-hugepages
or
# either 8 comms cores; 1 HT + (1 or 2)HT x 54 cores = (62 or 116) threads # empirically, leave a tile free for O/S, daemons etc... # export OMP_NUM_THREADS=116 export I_MPI_THREAD_SPLIT=1 export I_MPI_THREAD_RUNTIME=openmp export PSM2_MULTI_EP=1 export I_MPI_FABRICS=ofi export I_MPI_THREAD_MAX=8 export I_MPI_PIN_DOMAIN=256 export MPI=2.2.2.2 export NODES=16 export OMP_NUM_THREADS=62 export KMP_AFFINITY=explicit,proclist=[0,1,2,3,4,5,6,7,8-61,72-125] mpirun -np $NODES -ppn 1 ./Benchmark_ITT --mpi $MPI --comms-threads 8 --shm 1024 --shm-hugepages
Results: (Key section of output (at end) for multinode runs)
================================================================================== Memory benchmark ================================================================================== ================================================================================== = Benchmarking a*x + y bandwidth ================================================================================== L bytes GB/s Gflop/s seconds GB/s / node ---------------------------------------------------------- 8 6291456.000 495.862 82.644 1.644 30.991 12 31850496.000 2111.790 351.965 0.386 131.987 16 100663296.000 4078.515 679.753 0.200 254.907 20 245760000.000 5311.453 885.242 0.153 331.966 24 509607936.000 6186.956 1031.159 0.132 386.685 28 944111616.000 4528.025 754.671 0.180 283.002 32 1610612736.000 4487.177 747.863 0.182 280.449 36 2579890176.000 4624.351 770.725 0.176 289.022 40 3932160000.000 4698.816 783.136 0.173 293.676 44 5757075456.000 4668.114 778.019 0.174 291.757 48 8153726976.000 4667.113 777.852 0.175 291.695 ================================================================================== Communications benchmark ================================================================================== =================================================================================== Benchmarking threaded STENCIL halo exchange in 4 dimensions ================================================================================== L Ls bytes MB/s uni (err/min/max) MB/s bidi (err/min/max) 4 8 49152 2739.5 79.3 437.9 4183.1 5479.0 158.6 875.8 8366.3 8 8 393216 13591.7 164.0 4256.7 15051.3 27183.4 328.1 8513.5 30102.7 12 8 1327104 18694.4 208.4 8876.9 19956.5 37388.8 416.8 17753.9 39912.9 16 8 3145728 20960.3 45.9 15006.5 21490.9 41920.5 91.9 30012.9 42981.8 20 8 6144000 20944.4 42.0 15723.6 21652.9 41888.8 83.9 31447.2 43305.7 24 8 10616832 20637.2 264.8 5903.6 21733.5 41274.4 529.7 11807.1 43467.1 28 8 16859136 21219.5 40.4 18327.6 22005.7 42439.0 80.7 36655.3 44011.4 32 8 25165824 21146.0 74.2 14737.3 22126.2 42292.1 148.5 29474.6 44252.5 ================================================================================== Per Node Summary table Ls=16 ================================================================================== L Wilson DWF4 DWF5 8 10561.8 59548.3 95783.9 12 40422.0 129713.9 204889.2 16 60763.9 210209.7 322914.2 24 141442.3 293747.1 440702.5 ================================================================================== ================================================================================== Comparison point result: 169962 Mflop/s per node Comparison point robustness: 0.595 ==================================================================================
Configuration:
as above
Invocation:
export KMP_AFFINITY=explicit,proclist=[0,1,2,3,4,5,6,7,8-61,72-125,136-191,200-255] export COMMS_THREADS=8 export OMP_NUM_THREADS=62 export I_MPI_THREAD_SPLIT=1 export I_MPI_THREAD_RUNTIME=openmp export I_MPI_FABRICS=ofi export I_MPI_PIN_DOMAIN=256 export I_MPI_THREAD_MAX=8 export PSM2_MULTI_EP=1 export FI_PSM2_LOCK_LEVEL=0 mpirun -np 16 -ppn 1 ./Benchmark_ITT --mpi 2.2.2.2 --shm 1024 --comms-threads $COMMS_THREADS
Results:
================================================================================== Memory benchmark ================================================================================== = Benchmarking a*x + y bandwidth ================================================================================== L bytes GB/s Gflop/s seconds GB/s / node ---------------------------------------------------------- 8 6291456.000 471.722 78.620 1.729 29.483 12 31850496.000 2234.161 372.360 0.365 139.635 16 100663296.000 4916.119 819.353 0.166 307.257 20 245760000.000 7531.977 1255.330 0.108 470.749 24 509607936.000 6649.536 1108.256 0.123 415.596 28 944111616.000 6119.038 1019.840 0.133 382.440 32 1610612736.000 5558.231 926.372 0.147 347.389 36 2579890176.000 5172.548 862.091 0.158 323.284 40 3932160000.000 6004.183 1000.697 0.136 375.261 44 5757075456.000 6139.139 1023.190 0.132 383.696 48 8153726976.000 6160.498 1026.750 0.132 385.031 ================================================================================== Communications benchmark ================================================================================== ==================================================================================================== = Benchmarking threaded STENCIL halo exchange in 4 dimensions ==================================================================================================== L Ls bytes MB/s uni (err/min/max) MB/s bidi (err/min/max) 4 8 49152 2508.7 30.2 1374.9 3817.6 5017.4 60.5 2749.8 7635.3 8 8 393216 6596.4 1128.2 188.3 8548.2 13192.8 2256.4 376.6 17096.3 12 8 1327104 8812.4 330.9 1042.3 9669.2 17624.9 661.8 2084.6 19338.5 16 8 3145728 9312.5 247.3 1483.7 9807.4 18625.1 494.6 2967.3 19614.8 20 8 6144000 8897.3 207.3 2741.6 9891.7 17794.5 414.5 5483.3 19783.5 24 8 10616832 8784.3 167.9 3405.8 10149.9 17568.7 335.7 6811.7 20299.9 28 8 16859136 8880.7 127.9 4390.8 9932.5 17761.3 255.8 8781.7 19865.0 32 8 25165824 8787.4 96.8 5748.6 10122.0 17574.8 193.6 11497.1 20244.0 ================================================================================== Per Node Summary table Ls=16 ================================================================================== L Wilson DWF4 DWF5 8 9042.3 51265.4 82154.6 12 32063.5 125126.9 195947.9 16 52761.9 199410.9 308859.9 24 131042.1 264027.0 418236.2 ================================================================================== ================================================================================== Comparison point result: 162269 Mflop/s per node Comparison point robustness: 0.606 ==================================================================================
At least so far the above data suggests that the second rail does not deliver much more application performance despite substantial effort to exploit this.
There are several metrics that we can extract from the 7210 log above.
-
The memory bandwidth we obtain is 385 GB/s
-
The large packet bidirectional bandwidth is 18 GB/s
-
The code execution metric was 199
-
The performance robustness measure was 0.606
The last is derived from worst case / best case ratio on 16^4 for 4D vectorised DWF kernels.
The following configuration is recommended for the Intel Skylake platform:
../configure --enable-precision=single\ --enable-simd=AVX512 \ --enable-comms=mpi3 \ --enable-mkl \ CXX=mpiicpc
In some cases AVX2 will perform better than AVX512
../configure --enable-precision=single\ --enable-simd=AVX2 \ --enable-comms=mpi3 \ --enable-mkl \ CXX=mpiicpc
The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.
If you are working on a Cray machine that does not use the mpiicpc
wrapper, please use:
../configure --enable-precision=single\ --enable-simd=AVX512 \ --enable-comms=mpi3 \ --enable-mkl \ CXX=CC CC=cc
Since Dual socket nodes are commonplace, we recommend MPI-3 as the default with the use of one rank per socket. If using the Intel MPI library, threads should be pinned to NUMA domains using
export I_MPI_PIN=1
This is the default.
-
Expected Skylake Platinum 8170 dual socket (single prec, single node 26+26 cores) performance using NUMA MPI mapping):
export KMP_HW_SUBSET=48c1t mpirun -n 2 benchmarks/Benchmark_ITT --mpi 2.1.1.1 --shm 1024
================================================================================== Memory benchmark ================================================================================== ================================================================================== = Benchmarking a*x + y bandwidth ================================================================================== L bytes GB/s Gflop/s seconds GB/s / node ---------------------------------------------------------- 8 786432.000 86.924 14.487 1.173 86.924 12 3981312.000 268.574 44.762 0.379 268.574 16 12582912.000 376.029 62.672 0.271 376.029 20 30720000.000 330.720 55.120 0.308 330.720 24 63700992.000 384.694 64.116 0.265 384.694 28 118013952.000 389.749 64.958 0.261 389.749 32 201326592.000 263.108 43.851 0.387 263.108 36 322486272.000 247.029 41.172 0.413 247.029 40 491520000.000 228.836 38.139 0.445 228.836 44 719634432.000 217.325 36.221 0.467 217.325 48 1019215872.000 211.450 35.242 0.482 211.450 ================================================================================== Per Node Summary table Ls=16 ================================================================================== L Wilson DWF4 DWF5 8 120793.084 661006.094 606747.933 12 436396.981 896169.271 833171.916 16 759287.473 980449.360 941381.756 24 558486.691 520478.501 641010.858 ================================================================================== ================================================================================== Comparison point result: 938309 Mflop/s per node ==================================================================================
-
Expected Skylake Platinum 8170 dual socket (single prec, multinode node 26+26 cores) performance
Using NUMA MPI mapping and single rail OPA network. On smaller volumes the performance is communication bound and for higher core count parts it is likely wise to also investigate dual rail configurations.
Like KNL, Skylake nodes using OPA in multinode simulation appears to require hugepages to be reserved by the system administrator in order to obtain best performance from OPA. We have not been able to access a Skylake system with huge pages reserved.
The core count, and price, of Skylake parts has a large range (unlike for KNL).
As a (poor) proxy for changing the Skylake part number to investigate the optimum, we have run a single rail OPA system on the 8170 part versus the number of threads per socket active.
================================================================================== L Wilson DWF4 DWF5 8 32253.9 154101.5 183347.2 12 109684.1 262507.3 292317.9 16 194774.2 249036.2 281591.4 24 230182.8 254606.6 291424.2 ================================================================================== Comparison point result: 249036.2 Mflop/s per node Comparison point robustness: 0.690 ==================================================================================
================================================================================== L Wilson DWF4 DWF5 8 32079.1 152026.3 182516.0 12 108190.3 269583.6 306923.0 16 212425.1 275198.8 342756.1 24 238965.6 248059.5 321454.7 ================================================================================== Comparison point result: 275198.8 Mflop/s per node Comparison point robustness: 0.703 ==================================================================================
================================================================================== L Wilson DWF4 DWF5 8 30231.8 154613.2 188445.1 12 104854.1 273350.6 330746.8 16 223157.8 308796.0 379889.6 24 249450.3 299750.2 374744.2 ================================================================================== Comparison point result: 308796.0 Mflop/s per node Comparison point robustness: 0.715 ==================================================================================
================================================================================== L Wilson DWF4 DWF5 8 30362.8 155402.7 186385.9 12 106136.0 281990.1 328575.3 16 227842.8 320035.4 395040.0 24 268774.4 281899.5 384608.6 ================================================================================== Comparison point result: 320035.4 Mflop/s per node Comparison point robustness: 0.736 ==================================================================================
================================================================================== L Wilson DWF4 DWF5 8 28321.5 152194.9 187719.0 12 101859.6 282493.1 335522.2 16 218783.2 322475.3 403770.4 24 265264.5 287054.6 391544.9 ================================================================================== Comparison point result: 322475.3 Mflop/s per node Comparison point robustness: 0.738 ==================================================================================
================================================================================== L Wilson DWF4 DWF5 8 20621.0 145140.7 185795.4 12 90771.4 287326.0 335079.4 16 209516.7 337189.0 413882.4 24 285756.5 298828.3 415698.2 ================================================================================== Comparison point result: 337189.0 Mflop/s per node Comparison point robustness: 0.744 ==================================================================================
We have not run the Benchmark_ITT programme on EPYC, as we do not have continuous access to nodes. However we have run the similar Benchmark_memory_bandwidth and Benchmark_dwf codes on a single dual EPYC node.
The AMD EPYC is a multichip module comprising 32 cores spread over four distinct chips each with 8 cores. So, even with a single socket node there is a quad-chip module. Dual socket nodes with 64 cores total are common. Each chip within the module exposes a separate NUMA domain. There are four NUMA domains per socket and we recommend one MPI rank per NUMA domain. MPI-3 is recommended with the use of four ranks per socket, and 8 threads per rank.
The best advice we have is as follows.
-
Configuration:
../configure --enable-precision=single\
--enable-simd=AVX2 \
--enable-comms=mpi3 \
CXX=mpicxx
-
Invocation:
Using MPICH and g++ v4.9.2, best performance can be obtained using explicit GOMP_CPU_AFFINITY flags for each MPI rank. This can be done by invoking MPI on a wrapper script omp_bind.sh to handle this.
It is recommended to run 8 MPI ranks on a single dual socket AMD EPYC, with 8 threads per rank using MPI3 and shared memory to communicate within this node:
mpirun -np 8 ./omp_bind.sh ./Benchmark_dwf --mpi 2.2.2.1 --dslash-unroll --threads 8 --grid 16.16.16.16 --cacheblocking 4.4.4.4
Where omp_bind.sh does the following:
#!/bin/bash numanode=` expr $PMI_RANK % 8 ` basecore=`expr $numanode \* 16` core0=`expr $basecore + 0 ` core1=`expr $basecore + 2 ` core2=`expr $basecore + 4 ` core3=`expr $basecore + 6 ` core4=`expr $basecore + 8 ` core5=`expr $basecore + 10 ` core6=`expr $basecore + 12 ` core7=`expr $basecore + 14 ` export GOMP_CPU_AFFINITY="$core0 $core1 $core2 $core3 $core4 $core5 $core6 $core7" echo GOMP_CUP_AFFINITY $GOMP_CPU_AFFINITY $@
Since the cacheblocking that was optimal is non default behaviour, the blocking in Benchmark_ITT.cc must be modified prior to compiling.
-
Results: Expected AMD EPYC 7601 dual socket (single prec, single node 32+32 cores) with NUMA MPI:
Average mflops/s per call per node (full): 420235 : 4d vec Average mflops/s per call per node (full): 437617 : 4d vec, fp16 comms Average mflops/s per call per node (full): 522988 : 5d vec Average mflops/s per call per node (full): 588984 : 5d vec, red black Average mflops/s per call per node (full): 508423 : 4d vec, red black
Memory test:
mpirun -np 8 ./omp_bind.sh ./Benchmark_memory_bandwidth --threads 8 --mpi 1.2.2.2
Results:
==================================================================================================== L bytes GB/s Gflop/s seconds ---------------------------------------------------------- 8 3.15e+06 516 86.1 0.158 16 5.03e+07 886 148 0.0921 24 2.55e+08 332 55.3 0.246 32 8.05e+08 254 42.3 0.321 40 1.97e+09 254 42.3 0.317 48 4.08e+09 254 42.3 0.321 56 7.55e+09 255 42.5 0.297 64 1.29e+10 254 42.3 0.304 72 2.06e+10 254 42.4 0.244 80 3.15e+10 255 42.5 0.247 88 4.61e+10 254 42.4 0.181
Two STREAMS read bandwidth exceeded 290GB/s using Benchmark_memory_bandwidth.
-
Performance was somewhat brittle, with the above NUMA optimisation required to obtain good performance
The following configuration is recommended for the Intel Haswell platform:
../configure --enable-precision=double\
--enable-simd=AVX2 \
--enable-comms=mpi3-auto \
--enable-mkl \
CXX=icpc MPICXX=mpiicpc
These nodes are supported courtesy of work by Nils Meyer and Guido Cossu.
ARM is part of our TeamCity continuous integration structure, thanks to assistance and cycle provisioning by the University of Regensburg. The code is thus expected to work on multi-core ARM servers, but performance results are presently absent.