Manual vectorization of the spreader #452
Replies: 10 comments 5 replies
-
I updated the branch fork with comments and some minor optimizations. @lu1and10 provided some optimized Horner code that is even faster than my version. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Both spread and interp comparison (master vs spreader-vectorization-v2 commit 3e24326), on Gcc11.4, AMD laptop 5700U ryzen2, one thread. 3D for interp is a great speed-up (1D less dramatic): |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Tuning the compiler flags leads to some interesting results: Xeon Gold: Clang gets better in 1D but gcc gets way worse. In 2D/3D clang gets way worse but the other compilers gets faster. There is not much difference in 1D compared to previous results. In 2D and 3D all the compilers improve and gets to the level of gcc 13 which was the fastest before. EDIT: Added test on the old AMD machine. IN this case the performance is better on average. Clang is slower than gcc in most cases. |
Beta Was this translation helpful? Give feedback.
-
For my AMD 5700U ryzen laptop, 1-thread, GCC 11.4. 1D is much faster, but low-acc double-prec 2d and 3d are slower (although single-prec 2d and 3d is faster). Blue is master of 7-1-24; yellow is Marco's fork, interp-vectorization branch: |
Beta Was this translation helpful? Give feedback.
-
I fixed some things and re-run the tests. TLDR: On all CPUs except one manual vectorization improves performance. Old AMD ROME: Gains across all over the board. 2D and 3D with ns=2 gets a bit slower. New AMD Genoa: Same as the old AMD, in general performance is much better. Old Intel Xeon Gold: Weird behavior, I wonder if there is something wrong with the machine. In 1D clang is fast, gcc gets slower. In 2/3D clang dies and gcc gets faster. New Intel xeon w5: Good performance increase across all the board. |
Beta Was this translation helpful? Give feedback.
-
I ran my benchmarker on my AMD laptop, with the new flags in makefile (interp-vec branch; yellow); against master (blue): We see spreader slight worsening: varying performance for interp (worse for small w, better for larger): |
Beta Was this translation helpful? Give feedback.
-
my master flags are as in master: |
Beta Was this translation helpful? Give feedback.
-
I attach the results for Gencoeff with 1.25: Great improvement in 1D with new coeffs and manual vectorization combined. |
Beta Was this translation helpful? Give feedback.
-
Manually vectorizing Horner and the spreader can give speedup sometimes doubling the performance. In particular clang vectorizer seems not to be aggressive leaving performance on the table.
The vectorization is based on the spreader width (
ns
).ns
is a runtime variable, hence it is not possible to rely on it to do any fine vectorization. However, the supported range forns
is[2, 16]
. Under these assumptions it is possible to generate compile time kernels (using c++17 and templates) for each value these values. Then, at runtime the correct kernel can be selected. This improves the performance by a large margin (up to 100% in certain cases, more details below). The drawback is more verbose and less clear code. By manual vectorizing it is possible to do extra optimizations that rely on knowing that some values aredon't cares
and even if garbage or zeros are written in it, it does not impact the final result as these are ignored. Another drawback is that padding is required in certain places to guarantee that no out-of-bounds memory accesses take place.The vectorization relies on xsimd xsimd. The reason is that intrinsics are not portable, this library is well maintained and supports arm, mac, windows and linux.
The source code is available here: fork
I would greatly appreciate any suggestions to improve it.
EDIT: 06 Jun 2024: I updated the branch to the most recent one.
Performance measurements:
Older AMD processor (does not have AVX512)
Performance improved in all cases especially in 1D and 2D where it doubles. 3D minor improvements.
Older Intel Gold (supports AVX512)
Similar to the AMD machine but higher improvements on 3D
Recent Intel xeon CPU (supports avx512)
Improvements across the board but some minor slowdown in 3D with
ns = [13-15]
.It is worth noting that manually vectorized code with clang usually offers best performance, auto vectorization in clang is much slower than gcc.
Beta Was this translation helpful? Give feedback.
All reactions