This article provides a recipe for how to obtain, configure, compile, and run optimized version of olb-0.8r0 with cylinder2d workload on Intel® Xeon® processors and Intel® Xeon Phi™ processors. The source for this version of OpenLB can be downloaded from:
optimization we made to get the best performance on CPU/Xeon Phi.
- Inter-procedural Optimization by Intel Compiler
- Fine tune the MPI processes and Hybrid configuration
- Make good usage of MCDRAM by numactl
- PGO optimization for Intel Compiler
- Optimized hot function (bgkCollison which takes 30% time) by AVX512 and FMA intrinsics
other modifications: Enlarge the lattice size of cylinder2d to get benefit from many core based on the example workload cylinder2d. Added time collection function to collect the time for computation, note there is a lot of serial computation in first iteration(iT=0) and takes a lot of time, it’s better to exclude this part of time for performance test (details please see examples/cylinder2d/cylinder2d.cpp)
you may get a official copy of openlb_v0.8 from Link
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
To build this package, install the Intel® MPI Library and Intel® C++ Composer XE 2016.2.181 or higher products on your host system.
mkdir openlb_v0.8
cd openlb_v0.8
git clone
source /opt/intel/impi/<version>/bin64/
source /opt/intel/composer_xe_<version>/bin/ intel64
Modify the in the openlb directory, setup compile and compile options and parallel mode (here we use pure MPI version)
Change compile option in
CXX := mpiicpc
OPTIM := -O3 -g -Wall -fp-model fast=2 -xCORE_AVX2 -fma -ipo
ARPRG := xiar
Then generate the library and executable for cylinder2d in example/cylinder2d
cd openlb_v0.8
make clean & make
cd example/cylinder2d/
Now we have cylinder2d in examples/cylinder2d/, run the application on Intel® Xeon® processor
mpirun –np 36 ./cylinder2d
We will demonstrate how to use Profile-Guided Optimization (PGO) optimization, first to use –prof-gen and let compiler creates and links an instrumented program from source code. Details for Intel Compiler PGO optimization are discussed in this Link
change compile options in
CXX := mpiicpc
OPTIM := -O3 -g -Wall -xMIC-AVX512 -fp-model fast=2 -fma -ipo -prof-gen –DWITH_AVX512
ARPRG := xiar
Then compile and run the application on Intel® Xeon Phi™ processor.
make clean & make
cd examples/cylinde2d
numactl --membind=1 mpirun –np 272 ./cylinder2d
The instrumented program generates several dynamic information file on source code file basis in the same location of the source code, which will be used in the second compilation. In this second compilation, we need to change the compiler option for Intel® Xeon Phi™ processor, replace the –prof-gen to –prof-use
now change the options again in order to utilize the dynamic information file
CXX := mpiicpc
OPTIM := -O3 -g -Wall -xMIC-AVX512 -fp-model fast=2 -fma -ipo -prof-use –DWITH_AVX512
ARPRG := xiar
Final Run:
numactl --membind=1 mpirun –np 272 ./cylinder2d
Contacts: [Jun Jin] ( Shan Zhou
The version is based on official openlb_v0.8, you may get a official copy of openlb_v0.8 from Link
This project is licensed under the GPL2 License