Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-SIMD-mode executables? #177

Open
valassi opened this issue Apr 25, 2021 · 2 comments
Open

Multi-SIMD-mode executables? #177

valassi opened this issue Apr 25, 2021 · 2 comments
Labels
enhancement A feature we want to develop idea Possible new development (may need further discussion) performance How fast is it? Make it go faster!

Comments

@valassi
Copy link
Member

valassi commented Apr 25, 2021

This is a spinoff of vectorisation issue #71 and a followup to the big PR #171.


(The first part of this description also serves as documentation of what is available there now!).

The current vectorisation infrastructure supports five SIMD modes, which correspond to different -march options:

  • none, ie no SIMD, with no -march
  • sse4, ie SSE4.2 with 128 width registers, via -march=nehalem
  • avx2, ie AVX2 with 256 width registers, via -march=haswell
  • 512y, ie AVX512VL with 256 width (ymm) registers, via -march=skylake-avx512 -mprefer-vector-width=256
  • 512z, ie AVX512VL with 512 width (zmm) registers, via -march=skylake-avx512 -DMGONGPU_PVW512

Note that the above flags are GLOBAL. They are applied to all files in src and in the PSigna directory.

In the code, #ifdef's for SSE4_2, AVX2_, AVX512VL and MGONGPU_PVW512 determine how the code is built (i.e. they determine the neppV parameter, see issue #176). Note in particular that

  • AVX512VL is needed for '512y' because it is the mode adding some AVX512 extensions to xmm and ymm registers
  • AVX512VL is used also in '512z', even if strictly speaking AVX512F would be enough: the fact that I use AVX512VL and that I pass -march=skylake-avx512 excludes KNL AVX512 for instance, which does not support AVX512VL
  • the MG-specific parameter MGONGPU_PVW512 has to be passed because -mprefer-vector-width=256 is not exposed in a macro (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96476), in other words the #ifdef flags avaialable to -march=skylake-avx512 are the same with or writhout -mprefer-vector-width=256 (check with gcc -E -dM)
  • eventually, we may think of replacing 'nehalem', haswell', 'skylake-avx512' by microarchitecture feature levels x86-64-v2, x86-64-v3, x86-64-v4 (https://www.phoronix.com/scan.php?page=news_item&px=GCC-11-x86-64-Feature-Levels), which are more or less equivalent, as suggested by MarcoCle, but this will be much later

Note also that the code already does have a basic check to fail gently if the desired avx mode is not supported by the present hardware. (This was added after a few tests crashed on the github CI - as the CI seems to have some AVX512 nodes but also some nodes that do not support it).

Presently, the build infrastructure for the vectorized builds is controled by two optional external parameters AVX and USEBUILDDIR and it works as follows

  • By default (if USEBUILDDIR is not set equal to 1), a single build of object files and executables is done in the current directory, for one and only one value of the AVX tag.
  • The AVX tag can be chosen externally (between one of none, sse4, avx2, 512y, 512z), eg "make AVX=none". If no AVX parameter is given, the Makefile chooses a default, which is 512y for gcc (unless the host does not support AVX512VL, in which case avx2 is used), and avx2 for clang.
  • If the user tries to "make AVX=none" followed by "make AVX=512y" (or just "make" on gcc, which uses 512y), the build fails. This is to avoid mixing files built with different AVX features. The implementation of this guard is quite basic, using a .build.tag_ file that is added during the build.
  • If USEBUILDDIR is set equal to 1, then the build is performed in a subdirectory build., which makes it possible to build several modes in parallel.
  • In particular, 'make avxall' builds all avx modes in separate build dirs. And 'make cleanall' removes all separate build idrectories (and also cleans the present directory)

(The second part of this description below is a possible proposed change)

An alternative to the model above is to build a single (larger) executable supporting multi-simd mode:

  • One would link all 5 implementations (none to 512z) in the same executable, and then decide at runtime though a command line parameter (eg ./check.exe -avx 512z) which AVX mode is used.
  • If no parameter is passed, the best supported avx mode would be chosen

In practice, one whould choose however how much of the implementation must be duplicated

  • Should one build only CPPProcess.o (ie only the code using neppV) using the given -march?
  • One should then add different namespaces for the different avx implementations

It should be noted however that these large multi-mode binaries are not typically what the LHC experiments do in their builds

One advantage of this multi-mode build could be for studies of becnhamrking (see issue #157).

  • One could build a benchmark container with an executable supporting several AVX modes, where the -avx parameter can be passed to the executable to benchmark different options
  • HOWEVER, the benchmarking project makes it also easy to create a benchmark container encapsulating separate executables, where a -avx option to the container simply determines which executable is used: there is no real need for multi-simd executables.

Very low priority. Probably not be implemented at all. I file this in any case so I do not forget (and esepcially Iadded the documentation of how this works now).

@valassi valassi added enhancement A feature we want to develop performance How fast is it? Make it go faster! idea Possible new development (may need further discussion) labels Apr 25, 2021
@valassi
Copy link
Member Author

valassi commented Jan 28, 2022

Hi @oliviermattelaer as we briefly discussed today:

  • ok for the moment to stay with the preent approach, define one "AVX" mode at build time, so that the right options can be switched on in the Makefiles

  • interesting eventually to use this "fat binary" approach, where the binaries support more than one AVX mode all linked together

This is also related to Makefile cleanup #362

@valassi
Copy link
Member Author

valassi commented Jan 28, 2022

One important point, relevant to the Bridge:

  • each SIMD mode has a different AOSOA structure for momenta
  • however, the Fortran MadEvent mode is always the same independent of the SIMD mode in C++: this is possible because the Fortran momenta array is copied (and transposed if needed) to a C++ array owned by the Bridge

So in practice a multiSIMD C++ library is perfectly compatible with the Fortran MadEvent integration through the Bridge

valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 25, 2022
…n disabled)

Note the following compilation warning
ccache /usr/local/cuda-11.6/bin/nvcc  -O3  -lineinfo -I. -I../../src -I../../../../../tools -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -use_fast_math -std=c++17  -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_HARDCODE_CIPC -Xcompiler -fPIC -c gCPPProcess.cu -o gCPPProcess.o
gCPPProcess.cu(53): warning madgraph5#177-D: variable "mg5amcGpu::cIPC" was declared but never referenced
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature we want to develop idea Possible new development (may need further discussion) performance How fast is it? Make it go faster!
Projects
None yet
Development

No branches or pull requests

1 participant