RadixN: Like Radix4, but supports size 2,3,4,5,6, and 7 cross-FFTs all in the same instance #132
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements a new FFT algorithm called RadixN, and integrates it into the planner. It efficiently handles FFT sizes containing small prime factors. So far, it's scalar-only, but I intend to port it over to the SIMD implementations next.
The key insight here is that even though we use the same cross-FFT size at every "level" of Radix4, there's no intrinsic limitation that forces us to do so - that's just what we need for powers of four, so that's what radix4 does.
RadixN removes this restriction, allowing every cross-FFT to have a different butterfly size. I picked 2,3,4,5,6, and 7 as the initial butterflies, but if it improves performance to add 9 or 10 or 11 for example, I'm open to adding it. If we add a lot of composite sizes, I suspect there will be a combinatoric explosion of complexity of which overlapping sets of factors to choose from (Ie, if we have 2 * 3 * 5, we could collapse that to 3 * 10, or 6 * 5, or 2 * 15, which is fastest? it gets messy really fast, so if we start adding more composites than we have now i'd like it to come with an architecture for managing that complexity.
RadixN can handle factors larger than its largest butterfly length, by using those larger factors as its base FFT length. For example, RadixN can gracefully handle something like 4 * 9 * 5 * 7 * 101, by processing the 4 * 9 * 5 * 7 internally, and using 101 as the base FFT size.
Benchmarking shows that RadixN is marginally worse than Radix4 at power of two sizes, as you might expect given that it has more branches, etc. So the planner still uses Radix4 when computing 4*k * base, and still tries to maneuver the base to make 4^k happen.
Where the performance of RadixN really shines is in collections of small factors - For 44100 (2^2 * 3^2 * 5^2 * 7^2), we get a 60% improvement:
For 48000 (2^7 * 3 * 5^3) we get a 45% improvement:
These improvements come from eliminating the overhead associated with the multiple transposes that MixedRadix does. I was inspired by a comment from #65 - "With this shuffler, there is no need to use mixer radix for long ffts, since radix4 is now the fastest also there." I remembered this quote, and wondered - if it's true for powers of two, why wouldn't it also be true for non powers of two?
I plan to do one more pass of cleanup and clarity before submitting, and then like I said I'll port it over to the SIMD implementations.