Multi-SIMD-mode executables? #177

valassi · 2021-04-25T15:43:45Z

This is a spinoff of vectorisation issue #71 and a followup to the big PR #171.

(The first part of this description also serves as documentation of what is available there now!).

The current vectorisation infrastructure supports five SIMD modes, which correspond to different -march options:

none, ie no SIMD, with no -march
sse4, ie SSE4.2 with 128 width registers, via -march=nehalem
avx2, ie AVX2 with 256 width registers, via -march=haswell
512y, ie AVX512VL with 256 width (ymm) registers, via -march=skylake-avx512 -mprefer-vector-width=256
512z, ie AVX512VL with 512 width (zmm) registers, via -march=skylake-avx512 -DMGONGPU_PVW512

Note that the above flags are GLOBAL. They are applied to all files in src and in the PSigna directory.

In the code, #ifdef's for SSE4_2, AVX2_, AVX512VL and MGONGPU_PVW512 determine how the code is built (i.e. they determine the neppV parameter, see issue #176). Note in particular that

AVX512VL is needed for '512y' because it is the mode adding some AVX512 extensions to xmm and ymm registers
AVX512VL is used also in '512z', even if strictly speaking AVX512F would be enough: the fact that I use AVX512VL and that I pass -march=skylake-avx512 excludes KNL AVX512 for instance, which does not support AVX512VL
the MG-specific parameter MGONGPU_PVW512 has to be passed because -mprefer-vector-width=256 is not exposed in a macro (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96476), in other words the #ifdef flags avaialable to -march=skylake-avx512 are the same with or writhout -mprefer-vector-width=256 (check with gcc -E -dM)
eventually, we may think of replacing 'nehalem', haswell', 'skylake-avx512' by microarchitecture feature levels x86-64-v2, x86-64-v3, x86-64-v4 (https://www.phoronix.com/scan.php?page=news_item&px=GCC-11-x86-64-Feature-Levels), which are more or less equivalent, as suggested by MarcoCle, but this will be much later

Note also that the code already does have a basic check to fail gently if the desired avx mode is not supported by the present hardware. (This was added after a few tests crashed on the github CI - as the CI seems to have some AVX512 nodes but also some nodes that do not support it).

Presently, the build infrastructure for the vectorized builds is controled by two optional external parameters AVX and USEBUILDDIR and it works as follows

By default (if USEBUILDDIR is not set equal to 1), a single build of object files and executables is done in the current directory, for one and only one value of the AVX tag.
The AVX tag can be chosen externally (between one of none, sse4, avx2, 512y, 512z), eg "make AVX=none". If no AVX parameter is given, the Makefile chooses a default, which is 512y for gcc (unless the host does not support AVX512VL, in which case avx2 is used), and avx2 for clang.
If the user tries to "make AVX=none" followed by "make AVX=512y" (or just "make" on gcc, which uses 512y), the build fails. This is to avoid mixing files built with different AVX features. The implementation of this guard is quite basic, using a .build.tag_ file that is added during the build.
If USEBUILDDIR is set equal to 1, then the build is performed in a subdirectory build., which makes it possible to build several modes in parallel.
In particular, 'make avxall' builds all avx modes in separate build dirs. And 'make cleanall' removes all separate build idrectories (and also cleans the present directory)

(The second part of this description below is a possible proposed change)

An alternative to the model above is to build a single (larger) executable supporting multi-simd mode:

One would link all 5 implementations (none to 512z) in the same executable, and then decide at runtime though a command line parameter (eg ./check.exe -avx 512z) which AVX mode is used.
If no parameter is passed, the best supported avx mode would be chosen

In practice, one whould choose however how much of the implementation must be duplicated

Should one build only CPPProcess.o (ie only the code using neppV) using the given -march?
One should then add different namespaces for the different avx implementations

It should be noted however that these large multi-mode binaries are not typically what the LHC experiments do in their builds

One advantage of this multi-mode build could be for studies of becnhamrking (see issue #157).

One could build a benchmark container with an executable supporting several AVX modes, where the -avx parameter can be passed to the executable to benchmark different options
HOWEVER, the benchmarking project makes it also easy to create a benchmark container encapsulating separate executables, where a -avx option to the container simply determines which executable is used: there is no real need for multi-simd executables.

Very low priority. Probably not be implemented at all. I file this in any case so I do not forget (and esepcially Iadded the documentation of how this works now).

valassi · 2022-01-28T17:53:52Z

Hi @oliviermattelaer as we briefly discussed today:

ok for the moment to stay with the preent approach, define one "AVX" mode at build time, so that the right options can be switched on in the Makefiles
interesting eventually to use this "fat binary" approach, where the binaries support more than one AVX mode all linked together

This is also related to Makefile cleanup #362

valassi · 2022-01-28T18:00:40Z

One important point, relevant to the Bridge:

each SIMD mode has a different AOSOA structure for momenta
however, the Fortran MadEvent mode is always the same independent of the SIMD mode in C++: this is possible because the Fortran momenta array is copied (and transposed if needed) to a C++ array owned by the Bridge

So in practice a multiSIMD C++ library is perfectly compatible with the Fortran MadEvent integration through the Bridge

…n disabled) Note the following compilation warning ccache /usr/local/cuda-11.6/bin/nvcc -O3 -lineinfo -I. -I../../src -I../../../../../tools -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -use_fast_math -std=c++17 -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_HARDCODE_CIPC -Xcompiler -fPIC -c gCPPProcess.cu -o gCPPProcess.o gCPPProcess.cu(53): warning madgraph5#177-D: variable "mg5amcGpu::cIPC" was declared but never referenced

valassi added enhancement A feature we want to develop performance How fast is it? Make it go faster! idea Possible new development (may need further discussion) labels Apr 25, 2021

valassi mentioned this issue Apr 27, 2021

More complete analysis of AVX512 in both gcc and clang #173

Open

This was referenced Dec 18, 2021

Clarify build strategy for heterogeneous applications (and clean all build options) #318

Open

Workplan for January 2022 #323

Closed

valassi mentioned this issue Jan 28, 2022

Cleanup of Makefiles #362

Open

valassi mentioned this issue Mar 26, 2024

Jorgen's makefiles with separate builds for CUDA, HIP and C++ #798

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-SIMD-mode executables? #177

Multi-SIMD-mode executables? #177

valassi commented Apr 25, 2021

valassi commented Jan 28, 2022

valassi commented Jan 28, 2022

Multi-SIMD-mode executables? #177

Multi-SIMD-mode executables? #177

Comments

valassi commented Apr 25, 2021

valassi commented Jan 28, 2022

valassi commented Jan 28, 2022