Optimized large SIMD butterflies for better register usage #134
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR rewrites the size-16, size-24, and size-32 sse, neon, and wasm simd butterfies. It optimizes their use of temporary variables to reduce the number of register spills, similar to the way bigger AVX butterflies do. It loads data as late as possible, does as much work as possible on a single piece of data before moving on, then stores it as soon as possible.
I tried applying the same transformation to scalar butterflies and the AVX butterflies that don't have it, or don't do it completely, but didn't see any benefit. For AVX at least, the optimizer was aggressively resisting the code transformation, and was reordering code to be less performant, etc. I didn't look into the code generation of scalar code, but it wouldn't surprise if it just doesn't use registers very well in the first place.
SSE benefited the most from this change, because it's the most register-constrained. It only has 16 FP registers, while neon has 32, and I assume wasm simd is JIT compiled and doesn't optimally use registers at all. The size-32 butterfly is faster for all 3 instruction sets though - fast enough that I was able to remove the section of code in the planner that switches between butterfly8 and butterfly32 for radix 4. All 3 now use butterfly32 unconditionally.
Benchmarks - in each pair of results, compare the first value which represents before, to the second which represents after:
SSE
NEON
WASM SIMD
This is independent of the WIP dinterleaving project in the other PR, but I did stumble across this optimization while testing different options for butterfly optimization. In particular, i determined that if the butterflies start out interleaved, deinterleaving them was not faster than just computing them interleaved -- at least not on SSE, which is probably the best at handling interleaved multiplies due to the addsub instruction. I intend to test deinterleaved butterflies on the other platforms, but I'm reluctant to implement different algorithms on different platforms, so even if it's marginally faster to deinterleave neon or wasm simd butterflies, I will probably leave them interleaved just to keep the implementations identical.
I got an apple M1 laptop so I could do NEON coding directly, and this is my first time doing benchmarking with it, and I'm very impressed with its performance compared to SSE on my relatively new intel PC.