Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM support #4

Open
jpaasen opened this issue Jul 7, 2021 · 4 comments
Open

ARM support #4

jpaasen opened this issue Jul 7, 2021 · 4 comments

Comments

@jpaasen
Copy link

jpaasen commented Jul 7, 2021

Has there been any work done to enable arm neon support? In this repo or in any forks?

@V-Kuzmin
Copy link

Do you have any plans to do support for arm in the near future?

@Pflugshaupt
Copy link

I added ARM64 NEON support here: https://github.com/Pflugshaupt/muFFT
I used a different CMake setup, but the actual SIMD parts are working, it just would need proper internal CMake patches.
Unfortunately I found the result to be slower than other NEON FFT libraries (esp. pffft). My use case was 1D real to complex ffts on macOS with m1 cpus.
I guess the main reason for the slowness is the way the complex numbers are arranged in the registers. Two complex numbers per 128-bit register leads to a lot of permutations and shuffles that could be avoided by a 128-bit real/128-bit imag layout.

@jpaasen
Copy link
Author

jpaasen commented Nov 19, 2022

I'm curious, did you also benchmark vdsp fft vs pffft (and mufft) on M1 mac?

@Pflugshaupt
Copy link

Pflugshaupt commented Nov 19, 2022

Yes, but I only benchmark inside my current project, so this is not general at all. I do heaps of 8'192 real to complex 1d ffts. For this pffft on Arm64 with neon is faster than vdsp and faster than my patch of muFFT.
As far as I know vdsp has a fft weakness on m1 Macs. Pffft with neon is much faster than vdsp for 2^10 - 2^16 real to complex ffts. It's possible vdsp has improved since I tested it on the initial m1 systems, but it definitely had a weakness when it comes to FFTs on ARM (using the old calls that allow for fft sizes > 2^12).
My patch of muFFT probably falls somewhere in the middle, I hoped it would be faster, but it wasn't. Maybe it could be optimized more (for instants using neon fma), but pffft doesn't use those either.
My guess is the big difference comes from how the complex numbers are arranged in memory/on the registers. muFFT uses a strict interleaved scheme, where 128-bit registers are used to hold 2 complex numbers with 32-bit real and imaginary parts. Pffft uses two 128-bit registers to hold 4 complex numbers, with all real parts in one reg and all imaginary parts in the other. The pffft scheme leads to fewer shuffle and permute operations - especially on complex multiplications where the muFFT routine does more shuffling than calculating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants