SOLVED: 20% and up to 32% faster fp32 radix-2 FFT and up to 12% and up to 27% faster fp32 radix-4 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) #1325

f4lc0n-asm · 2024-12-09T16:33:30Z

Hello,

here are the 20% and up to 32% faster ASM drop-in replacements for dsps_fft2r_fc32_ae32_.S and dsps_fft2r_fc32_aes3_.S from ESP-DSP repository.

I achieved this by:

Replacing slli / add.n pair with addx8 on 3 occasions in radix-2 routines.
Carefully reordering instructions for much fewer pipeline stalls and preventing the assembler to automatically insert a NOP in front of a hw loop so that the first instruction in the hw loop and branch target instructions entirely fit within a naturally aligned four byte region. Sometines the assembler fails to achieve this, which incurs 1-cycle penalty (the instruction fetch takes 2 cycles instead of 1). Consult Xtensa ISA Summary (LOOPNEZ description) for more info - it's vital for achieving the maximum loop/branch/jump speed!
Removing some superfluous instructions (forgotten debugging stuff).

Note 1: Passing the 3rd argument float* dsps_fft_w_table_fc32 to FFT functions is actually a good idea. Because when you do parallel computations of say 2 FFTs with different Ns, you need 2 different table pointers for them. The table pointer cannot be thus hardcoded in the FFT functions, as attempted.

Note 2: Optimized radix-2 ESP32-S3 ASM version is just 11-15% faster than optimized ESP32 ASM version because of a pipeline stall, which cannot be avoided.

Note 3: Optimized radix-4 ESP32-S3 ASM version is just 17-27% faster than the original ESP32 ASM version because only 4 fp32 complex reads and 4 fp32 complex writes can be sped up with 64-bit load/store instructions.

Note 4: When int is used for non-negative numbers only, the GCC does not know this and produces more complex code, which correctly handles negative values, which never happen - it just slows things down! Just have a look at what assembly GCC produces for the for cycle in dsps_fft4r_fc32_ae32.c when using int or uint for the j counter and m variables…

Note 5: As for an integer FFT, even some 30 years ago, when I programmed a full MP3 decoder in pure assembly (including a custom made DCT matrix decomposition - basically a custom made FDCT for maximum performance), I used a 2MHz fp32 DSP, and mostly never looked back since! As for the new ESP32-P4, I still wait for it to overcome its teething problems!

Happy Holiday!

f4lc0n signing off with the saying: "Est rerum omnium magister usus."

fft_fp32_xtensa_v2.1.zip (7-Zip)
Improved: Faster checking for N to be a power of 4 in radix-4 FFTs + branch/jump target alignment optimizations.
Added: Optimized radix-4 ESP32 ASM version of the original radix-4 ESP32 ASM version. Just comment out / disable the original FFT ASM source in dsps_fft4r_fc32_ae32.c and add dsps_fft4r_fc32_ae32_.S to your project - it will be called instead.
Added: Optimized radix-4 ESP32-S3 ASM version. Just add dsps_fft4r_fc32_aes3_.S to your project, add its prototype and #define dsps_fft4r_fc32_aes3(data, N) dsps_fft4r_fc32_aes3_(data, N, dsps_fft4r_w_table_fc32, dsps_fft4r_w_table_size) to dsps_fft4r.h and #define dsps_fft4r_fc32_aes3_enabled 1 to dsps_fft4r_platform.h and you are basically set.
Improved: Further speed optimizations of dsps_fft2r_fc32_aes3_.S. Now it is up to 32% faster (N=4096)!

The new performance table:
(R2 stands for radix-2, R4 for radix-4, and O for the new optimized version. The measurements are in CPU cycles.)

   N ‖   ANSI |  ESP32 |  ESP32 |ESP32-S3|ESP32-S3‖   ANSI |  ESP32 |  ESP32 |ESP32-S3
     ‖    R2  |    R2  |  O R2  |    R2  |  O R2  ‖    R4  |    R4  |  O R4  |  O R4
  16 ‖   1277 |    978 |    818 |    931 |    737 ‖    997 |    630 |    560 |    496
  32 ‖   2972 |   2319 |   1935 |   2208 |   1726 ‖        |        |        |
  64 ‖   6835 |   5380 |   4484 |   5125 |   3971 ‖   4996 |   3109 |   2987 |   2603
 128 ‖  15514 |  12265 |  10217 |  11690 |   9000 ‖        |        |        |
 256 ‖  34785 |  27566 |  22958 |  26287 |  20141 ‖  25090 |  15536 |  15266 |  13218
 512 ‖  77160 |  61237 |  50995 |  58420 |  44594 ‖        |        |        |
1024 ‖ 169583 | 134712 | 112184 | 128569 |  97847 ‖ 122224 |  75531 |  74729 |  64489
2048 ‖ 369785 | 293949 | 244801 | 280638 | 213058 ‖        |        |        |
4096 ‖ 800893 | 636994 | 530498 | 608323 | 460865 ‖ 578204 | 356902 | 354032 | 304880

The text was updated successfully, but these errors were encountered:

github-actions bot changed the title ~~SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3~~ SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) Dec 9, 2024

f4lc0n-asm changed the title ~~SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911)~~ SOLVED: 20% and up to 32% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) Dec 13, 2024

f4lc0n-asm changed the title ~~SOLVED: 20% and up to 32% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911)~~ SOLVED: 20% and up to 32% faster fp32 radix-2 FFT and up to 12% and up to 27% faster fp32 radix-4 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLVED: 20% and up to 32% faster fp32 radix-2 FFT and up to 12% and up to 27% faster fp32 radix-4 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) #1325

SOLVED: 20% and up to 32% faster fp32 radix-2 FFT and up to 12% and up to 27% faster fp32 radix-4 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) #1325

f4lc0n-asm commented Dec 9, 2024 •

edited

Loading

SOLVED: 20% and up to 32% faster fp32 radix-2 FFT and up to 12% and up to 27% faster fp32 radix-4 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) #1325

SOLVED: 20% and up to 32% faster fp32 radix-2 FFT and up to 12% and up to 27% faster fp32 radix-4 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) #1325

Comments

f4lc0n-asm commented Dec 9, 2024 • edited Loading

f4lc0n-asm commented Dec 9, 2024 •

edited

Loading