-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autogenerate entire prime butterfly files, rather than just chunks #137
Conversation
Aside from fixing the build, there are 2 more tasks to complete before I can merge this:
|
This looks great! |
This PR replaces the old python script for generating chunks of prime butterflies with a Rust program (under
tools/gen_simd_butterflies
) that generates the entire file. It's a single program capable of generating output for SSE, NEON, and Wasm Simd, determined by a command line argument. When we support FCMA, it will be trivial to add that as an option. More instruction sets like AVX could be added, but it will take more work so I didn't include it here.I've been thinking about long-term sustainability of current approach to SIMD. We're talking about adding a new code path for the FCMA instruction set, there's AVX512 to consider planning around, AVX10 was announced, etc, and I don't think it's sustainable to keep copying as much code as we're copying if we want implementations for each of these - which I do. I think the biggest challenge is the butterflies. Even with our excellent boilerplate macros, they are a massive pain to work with and it's very easy to make mistakes. Any system-wide change we want to make has to be duplicated dozens of times - once per length, and then repeated per instruction set.
The high-level goal of this PR is to wrangle some of the complexity of our butterfly system - to take something that is extremely repetitive and verbose and replace it with something concise - all while keeping everything human-readable. This code generation functions very similarly to a macro, but unlike a macro, we can easily see the expanded code, contributing to the goal of human readability I mentioned above.
I also don't think our O(n^2) pile of multiplies and adds we do in the body of these FFTs is easy to replicate with a macro, and that's the most important part to automate!
If this ends up making our code easier to reason about, I would like to expand it to perhaps include good-thomas butterflies and the bigger mixed radix butterflies. There will always be some hand-written ones we don't want to automate (1, 2, 3, 4, 8 being the most likely sizes that we always hand-write) but all others operate on a pretty strict formula that we don't need to trouble ourselves with writing over and over and over again.
This PR also rearranges the body of these FFTs so that the O(n^2) pile of math expressed as FMAs, meaning we can use explicit FMA on neon while keeping the implementation of each instruction set identical. It mildly helps Wasm Simd and largely doesn't change the performance of SSE, but the rearranging alone brings solid performance wins to Neon. Converting the math ops to FMA helps even more.
This PR also moves the rotations before the big math pile - that's because for FCMA, we'll be able to omit the rotations entirely as long as we make sure to start from the "rotations first" format, and I want to keep all the implementations as similar as possible, so they all do rotations first.
Attached are performance measurements.