Autogenerate entire prime butterfly files, rather than just chunks #137

ejmahler · 2024-03-19T06:47:09Z

This PR replaces the old python script for generating chunks of prime butterflies with a Rust program (under tools/gen_simd_butterflies) that generates the entire file. It's a single program capable of generating output for SSE, NEON, and Wasm Simd, determined by a command line argument. When we support FCMA, it will be trivial to add that as an option. More instruction sets like AVX could be added, but it will take more work so I didn't include it here.

I've been thinking about long-term sustainability of current approach to SIMD. We're talking about adding a new code path for the FCMA instruction set, there's AVX512 to consider planning around, AVX10 was announced, etc, and I don't think it's sustainable to keep copying as much code as we're copying if we want implementations for each of these - which I do. I think the biggest challenge is the butterflies. Even with our excellent boilerplate macros, they are a massive pain to work with and it's very easy to make mistakes. Any system-wide change we want to make has to be duplicated dozens of times - once per length, and then repeated per instruction set.

The high-level goal of this PR is to wrangle some of the complexity of our butterfly system - to take something that is extremely repetitive and verbose and replace it with something concise - all while keeping everything human-readable. This code generation functions very similarly to a macro, but unlike a macro, we can easily see the expanded code, contributing to the goal of human readability I mentioned above.

I also don't think our O(n^2) pile of multiplies and adds we do in the body of these FFTs is easy to replicate with a macro, and that's the most important part to automate!

If this ends up making our code easier to reason about, I would like to expand it to perhaps include good-thomas butterflies and the bigger mixed radix butterflies. There will always be some hand-written ones we don't want to automate (1, 2, 3, 4, 8 being the most likely sizes that we always hand-write) but all others operate on a pretty strict formula that we don't need to trouble ourselves with writing over and over and over again.

This PR also rearranges the body of these FFTs so that the O(n^2) pile of math expressed as FMAs, meaning we can use explicit FMA on neon while keeping the implementation of each instruction set identical. It mildly helps Wasm Simd and largely doesn't change the performance of SSE, but the rearranging alone brings solid performance wins to Neon. Converting the math ops to FMA helps even more.

This PR also moves the rotations before the big math pile - that's because for FCMA, we'll be able to omit the rotations entirely as long as we make sure to start from the "rotations first" format, and I want to keep all the implementations as similar as possible, so they all do rotations first.

Attached are performance measurements.

… FMA

ejmahler · 2024-03-19T06:52:50Z

Aside from fixing the build, there are 2 more tasks to complete before I can merge this:

Evaluate using handlebars templates instead of tinytemplate, with the hope that it removes the need to preprocess the file
Set up a CI step to verify that the autogenerated output matches the actual contents of the file, similar to the way rustfmt works, to make sure that code changes are properly propagated to the generation script

HEnquist · 2024-03-20T22:53:58Z

This looks great!
It would be neat to avoid preprocessing. I'm not familiar with tinytemplate and handlebar, but I use jinja2 in python, and it supports custom delimiters. There is minijinja for rust, which also supports this:
https://docs.rs/minijinja/latest/minijinja/syntax/index.html#custom-delimiters
This could maybe be used to avoid escaping.

ejmahler added 13 commits March 16, 2024 02:30

Created a rust tool to autogenerate the full SSE prime butterfly file

f84d887

Removed unnecessary include

95fcb2d

added prime-specific benches

f954b64

formatting fix

9b4dad2

Make template architecture-agnostic

ddad578

Added wasm simd prime butterfly benches

f2ba41d

wasm simd refactor: rename the non-newtype util methods to _v128

46cad6f

Hook up wasm simd to use the prime butterfly generator

3ed4c09

add neon prime butterfly benchmarks

508c085

Check in cargo.lock for gen_simd_butterflies

a9d5ac9

Converted neon prime butterflies over to the autogenerated system and…

5316ca6

… FMA

Use the same testing feature detection as neon

61e8e36

Delete gen_sse_butterflies.py

a4eddae

Exclude wasm simd benchmarks from rustfmt

550b625

ejmahler changed the title ~~Complete autogeneration of entire prime butterfly files, rather than just chunks~~ Autogenerate entire prime butterfly files, rather than just chunks Mar 19, 2024

ejmahler added 9 commits March 22, 2024 02:38

Simplify the template by pre-packaging the struct names

aca46a6

Simplified butterfly macro invocation by removing unused parameters

6577d93

Fix macro parameter

acecf49

Fix struct name

b037ca4

Use handlebars templates to avoid preprocessing

92158f9

Added "--check" mode to the simd butterfly script

409cdd3

Added a CI step to verify autogenerated code match

94b5cef

Test new CI step by submitting mismatching code

d80c31e

Back out test code, remove handlebars import

f8f33fc

ejmahler merged commit 801d6a4 into master Mar 25, 2024
19 checks passed

ejmahler deleted the autogenerate-prime-simd branch March 25, 2024 09:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autogenerate entire prime butterfly files, rather than just chunks #137

Autogenerate entire prime butterfly files, rather than just chunks #137

ejmahler commented Mar 19, 2024 •

edited

Loading

ejmahler commented Mar 19, 2024 •

edited

Loading

HEnquist commented Mar 20, 2024

Autogenerate entire prime butterfly files, rather than just chunks #137

Autogenerate entire prime butterfly files, rather than just chunks #137

Conversation

ejmahler commented Mar 19, 2024 • edited Loading

ejmahler commented Mar 19, 2024 • edited Loading

HEnquist commented Mar 20, 2024

ejmahler commented Mar 19, 2024 •

edited

Loading

ejmahler commented Mar 19, 2024 •

edited

Loading