Manual vectorization of the spreader #452

DiamonDinoia · 2024-06-04T16:23:07Z

DiamonDinoia
Jun 4, 2024
Maintainer

Manually vectorizing Horner and the spreader can give speedup sometimes doubling the performance. In particular clang vectorizer seems not to be aggressive leaving performance on the table.
The vectorization is based on the spreader width (ns). ns is a runtime variable, hence it is not possible to rely on it to do any fine vectorization. However, the supported range for ns is [2, 16]. Under these assumptions it is possible to generate compile time kernels (using c++17 and templates) for each value these values. Then, at runtime the correct kernel can be selected. This improves the performance by a large margin (up to 100% in certain cases, more details below). The drawback is more verbose and less clear code. By manual vectorizing it is possible to do extra optimizations that rely on knowing that some values are don't cares and even if garbage or zeros are written in it, it does not impact the final result as these are ignored. Another drawback is that padding is required in certain places to guarantee that no out-of-bounds memory accesses take place.
The vectorization relies on xsimd xsimd. The reason is that intrinsics are not portable, this library is well maintained and supports arm, mac, windows and linux.

The source code is available here: fork
I would greatly appreciate any suggestions to improve it.

EDIT: 06 Jun 2024: I updated the branch to the most recent one.

Performance measurements:

Older AMD processor (does not have AVX512)

Performance improved in all cases especially in 1D and 2D where it doubles. 3D minor improvements.

Older Intel Gold (supports AVX512)

Similar to the AMD machine but higher improvements on 3D

Recent Intel xeon CPU (supports avx512)

Improvements across the board but some minor slowdown in 3D with ns = [13-15].

It is worth noting that manually vectorized code with clang usually offers best performance, auto vectorization in clang is much slower than gcc.

DiamonDinoia · 2024-06-21T01:22:26Z

DiamonDinoia
Jun 21, 2024
Maintainer Author

I updated the branch fork with comments and some minor optimizations. @lu1and10 provided some optimized Horner code that is even faster than my version.

The results are:
On older CPUs:

and on newer:

0 replies

DiamonDinoia · 2024-06-21T14:34:44Z

DiamonDinoia
Jun 21, 2024
Maintainer Author

These are the results for interpolation, moved ns (aka w) at compile time, generated kernels and dispatched them. It also uses the explicitly vectorized horner:
Older CPU:

Newer CPU:

Overall is faster. Although, explicit vectorization should help in this case too.

0 replies

ahbarnett · 2024-06-21T20:43:49Z

ahbarnett
Jun 21, 2024
Maintainer

Both spread and interp comparison (master vs spreader-vectorization-v2 commit 3e24326), on Gcc11.4, AMD laptop 5700U ryzen2, one thread. 3D for interp is a great speed-up (1D less dramatic):

1 reply

ahbarnett Jun 21, 2024
Maintainer

This is done via perftest/compare_spreads.jl

DiamonDinoia · 2024-06-28T18:31:04Z

DiamonDinoia
Jun 28, 2024
Maintainer Author

I manually vectorized the interpolation. Results are below.
I noticed large improvement in 1D. Minor improvements in 1D sometimes gets a a bit worse (eg. with ns=2). In 3D it gets slower for ns<=4, then it gets faster until ns = 13. From 13 to 16 is slightly slower
Old xeon gold:

New Xeon w5-3435x

0 replies

DiamonDinoia · 2024-07-01T23:34:19Z

DiamonDinoia
Jul 1, 2024
Maintainer Author

Tuning the compiler flags leads to some interesting results:

Xeon Gold:

Clang gets better in 1D but gcc gets way worse. In 2D/3D clang gets way worse but the other compilers gets faster.

Intel W5:

There is not much difference in 1D compared to previous results. In 2D and 3D all the compilers improve and gets to the level of gcc 13 which was the fastest before.

EDIT:

Added test on the old AMD machine. IN this case the performance is better on average. Clang is slower than gcc in most cases.

0 replies

ahbarnett · 2024-07-02T20:52:06Z

ahbarnett
Jul 2, 2024
Maintainer

For my AMD 5700U ryzen laptop, 1-thread, GCC 11.4. 1D is much faster, but low-acc double-prec 2d and 3d are slower (although single-prec 2d and 3d is faster). Blue is master of 7-1-24; yellow is Marco's fork, interp-vectorization branch:

1 reply

ahbarnett Jul 2, 2024
Maintainer

flags are as in makefile. Done with perftest/spreadinterpnd{f} as driven by compare_spreads.jl

It could be argued that low-prec double slowdown doesn't matter since the user should switch to single-prec there :)

DiamonDinoia · 2024-07-08T19:01:17Z

DiamonDinoia
Jul 8, 2024
Maintainer Author

I fixed some things and re-run the tests.

TLDR: On all CPUs except one manual vectorization improves performance.

Old AMD ROME:

Gains across all over the board. 2D and 3D with ns=2 gets a bit slower.

New AMD Genoa:

Same as the old AMD, in general performance is much better.

Old Intel Xeon Gold:

Weird behavior, I wonder if there is something wrong with the machine. In 1D clang is fast, gcc gets slower. In 2/3D clang dies and gcc gets faster.

New Intel xeon w5:

Good performance increase across all the board.

0 replies

ahbarnett · 2024-07-09T16:51:51Z

ahbarnett
Jul 9, 2024
Maintainer

I ran my benchmarker on my AMD laptop, with the new flags in makefile (interp-vec branch; yellow); against master (blue):

We see spreader slight worsening:

varying performance for interp (worse for small w, better for larger):

1 reply

lu1and10 Jul 9, 2024
Maintainer

I guess Marco's result is using master branch with old compile flags. Are you using master branch also with the new added compile flags? I guess the new optimization flags will also increase the speed of master branch.

ahbarnett · 2024-07-09T17:55:06Z

ahbarnett
Jul 9, 2024
Maintainer

my master flags are as in master: -O3 -funroll-loops -march=native -fcx-limited-range -ffp-contract=fast
The new flags seem to hurt spreader for w in range 4-7 (code is the same, I think).

1 reply

lu1and10 Jul 9, 2024
Maintainer

I see, I tried the new flags with master branch, it seems master branch has a good speed up with the new flags for w>7 of interp on FI cluster nodes. I did not check spreader, not sure why the new flags may hurt spreader.

DiamonDinoia · 2024-07-23T18:48:18Z

DiamonDinoia
Jul 23, 2024
Maintainer Author

I attach the results for Gencoeff with 1.25:

Great improvement in 1D with new coeffs and manual vectorization combined.

1 reply

DiamonDinoia Jul 23, 2024
Maintainer Author

Note: with 1.25 we need to ask for half digit accuracy to use ns = 3, 6, 11, 13, which the test does not do atm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manual vectorization of the spreader #452

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Manual vectorization of the spreader #452

DiamonDinoia Jun 4, 2024 Maintainer

Replies: 10 comments · 5 replies

DiamonDinoia Jun 21, 2024 Maintainer Author

DiamonDinoia Jun 21, 2024 Maintainer Author

ahbarnett Jun 21, 2024 Maintainer

ahbarnett Jun 21, 2024 Maintainer

DiamonDinoia Jun 28, 2024 Maintainer Author

DiamonDinoia Jul 1, 2024 Maintainer Author

ahbarnett Jul 2, 2024 Maintainer

ahbarnett Jul 2, 2024 Maintainer

DiamonDinoia Jul 8, 2024 Maintainer Author

ahbarnett Jul 9, 2024 Maintainer

lu1and10 Jul 9, 2024 Maintainer

ahbarnett Jul 9, 2024 Maintainer

lu1and10 Jul 9, 2024 Maintainer

DiamonDinoia Jul 23, 2024 Maintainer Author

DiamonDinoia Jul 23, 2024 Maintainer Author

DiamonDinoia
Jun 4, 2024
Maintainer

Replies: 10 comments 5 replies

DiamonDinoia
Jun 21, 2024
Maintainer Author

DiamonDinoia
Jun 21, 2024
Maintainer Author

ahbarnett
Jun 21, 2024
Maintainer

ahbarnett Jun 21, 2024
Maintainer

DiamonDinoia
Jun 28, 2024
Maintainer Author

DiamonDinoia
Jul 1, 2024
Maintainer Author

ahbarnett
Jul 2, 2024
Maintainer

ahbarnett Jul 2, 2024
Maintainer

DiamonDinoia
Jul 8, 2024
Maintainer Author

ahbarnett
Jul 9, 2024
Maintainer

lu1and10 Jul 9, 2024
Maintainer

ahbarnett
Jul 9, 2024
Maintainer

lu1and10 Jul 9, 2024
Maintainer

DiamonDinoia
Jul 23, 2024
Maintainer Author

DiamonDinoia Jul 23, 2024
Maintainer Author