Name		Name	Last commit message	Last commit date
parent directory ..
script		script
CMakeLists.txt		CMakeLists.txt
bm_bench_version.cpp		bm_bench_version.cpp
bm_bench_version.h		bm_bench_version.h
bm_main.cpp		bm_main.cpp
readme.md		readme.md

readme.md

Benchmark Multiply

Matrix multiplication benchmark implemented with several optimizations: blocking, OpenMP, GPU Offloading... Both matrices have the same size.

Compile

To compile the benchmark, the first thing to do is to uncomment the corresponding line in the root Cmake configuration file : add_subdirectory (benchmark_multiply).

Then, several options are available.

OpenMP

The benchmark can be launched on multiple cores by using OpenMP. To compile the benchmark with OPENMP use the following option -DBM_OMP=TRUE.

OpenMP GPU offloading

The benchmark can be launched on a GPU with OpenMP (version > 4.5) To offload the benchmark on GPU, use the following option -DBM_OMP_TARGET_GPU=TRUE

Manual Compilation

The benchmark can be compiled without using Cmake /opt/rocm/llvm/bin/clang++ -DBM_OMP -DBM_OMP_TARGET_GPU -fopenmp ../src/benchmark_multiply/benchmark_multiply.cpp ../src/benchmark_multiply/multiply_version.cpp -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908 -o mult_manocompiled

Example

Cray Sequential = cmake .. -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC -DCMAKE_FC_COMPILER=ftn -DBM_OMP=FALSE -DBM_OMP_TARGET_GPU=FALSE
Cray OMP GPU = cmake .. -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC -DCMAKE_FC_COMPILER=ftn -DBM_OMP=TRUE -DBM_OMP_TARGET_GPU=TRUE

Execute

Available options can be printed with -h option.

-L 100 number of line for matrix A
-C 100 number of column for matrix A
-V 1 version to use
-B 20 size of block for blocking optimization

Typical execution export OMP_PROC_BIND=close ; export OMP_PLACES=cores ; export OMP_NUM_THREADS=128 ; time ./bin/benchmark_multiply/benchmark_multiply -V 6 -L 4000 -C 4000

Script

Several scripts can be found in the script folder

bm_execute.sh

Script to facilitate the execution and your different tests. Nothing special here.

bm_omp_benchmark.sh

This script can be used to test different configuration of the OpenMP binding variables (OMP_NUM_THREADS, OMP_PROC_BIND, OMP_PLACES) and different size of block. Then, for different number of threads, binding and place strategies the script print the performance. Below, the result of the exuction of benchmark_multiply -V 6 -L 4000 -C 4000 -B 40 on AMD EPYC 7542 32-Core Processor (see full results in results folder)

THREADS PROC_BIND OMP_PLACES VERSION BLOCK RES             TIME      
1       true      cores      6       40    140100658790400 585.44    
2       true      cores      6       40    140100658790400 182.41    
3       true      cores      6       40    140100658790400 116.60    
.......     
126     true      cores      6       40    140100658790400 6.04
127     true      cores      6       40    140100658790400 5.96
128     true      cores      6       40    140100658790400 6.09

Strong scaling on dual socket server for 1 to 128 threads with OMP_PROC_BIND=true and OMP_PROC_PLACES=cores.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark_multiply

benchmark_multiply

readme.md

Benchmark Multiply

Compile

Execute

Script

Files

benchmark_multiply

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmark_multiply

Folders and files

parent directory

readme.md

Benchmark Multiply

Compile

Execute

Script