-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing 6-step FFT algorithm #967
Comments
RFFT (Real FFT)We first start with RFFT algorithm in which we use 6-step FFT algorithm as provided here . We test it by comparing its output against code
Performance (+ Assertion)
This Colab notebook contains the code. [Update] |
Matrix TranspositionIn the RFFT code provided in previous comment, the function CodeCurrently, this is happening via the following code:
However, as discussed in #938, we can use cache-oblivious algorithm (see #965) for matrix transposition as follows:
Performance( + Assertion)
This Colab notebook contains the code. |
fft_blockIn the step 2 and 5 of the 6-step FFT algorithm, we use the following recursive function:
We can think of the following ways to speed this up: (1) Avoid calling
(2) In (1), we have:
An initial (3) The factor For a given Note that
Therefore, in this version, in addition to (2), we can replace the following for-loop
with this code:
(4) we can precompute all the factors and pass the precomputed array as argument to the recursive function. We can use the following code to compute the array:
Performance (+ Assertion)In this Colab notebook, the performance of these four new versions are checked and compared with the baseline. In the 6-step FFT, the recursive function |
transpose + twiddle factorThis is regarding step 3 and 4 of the 6-step algorithm. Originally, we have the following code:
which tranpose the matrix and multiple its element by twiddle factor. The twiddle factor We can avoid calling
We still need to call
Performance ( + Assertion)This Colab Notebook contains the code that shows the performance of these three versions. As observed, the last version mentioned above outperforms the others. |
Function
|
Eight-step functionWe now work on the eight-step function as provided in this comment
We focus on the following part of the code:
Note 1: We can move Note 2: The factor, Performance (+ Assertion)This Colab Notebook contains the code. The following result is obtained: As observed, the version 1 and 2 show close performance. The code in version 2 is more complicated though. So, we go with version 1.
|
Put pieces together...It is now the time to put the pieces together, and compare our so-far-optimized RFFT with the one provided in this comment. Code is available in this Colab notebook. The running time of RFFT is recorded for input with length 2^2...2^20. For each length, the function is called 5000 times. Since the running time is small, small deviation may result in considerable difference when calculating the speed-up ratio (w.r.t the running time of
and, we can calculate the mean-based speed-up as follows:
Not sure if this approach is science-backed. Still, it can give us some idea about the range. In Colab, I got the following result (lower is better): |
[Recap] [Now] We are now getting closer to Scipy's performance! The code is available in this colab notebook and it has assertion to make sure the output is correct. |
@seanlaw
Notes: (2) In the combined steps 3 & 4, we do matrix transpose but we do not use COA. So, IMO, there might be some opportunity for performance boost here. I am going to provide the original version below to better understand this combined step:
(3) In each of the steps 2 and 4, the function
and, in the We can make this function cleaner by removing the need for the parameter
Still, this does not improve the performance. We can also convert the recursive function to a for-loop:
And still, there is no boost in performance. However, if we remove the outer for-loop I will update this post by adding a Colab notebook that contains the code. |
Is this the same as loop unrolling? |
Yes! Thank you!! I am going to go through that wikipedia page and see if I can find more resources to get more insight (I recently read in Itamar's book that one shouldn't rely TOO much on the compiler's capabilities in optimizing the performance 😅) |
[Update]
Couldn't find a way to tell compiler to unroll a loop! Still looking for a clean way to get that performance!
I have been experimenting different versions of my code. Need to prune them and just keep two or three versions. We can move forward from there. [NOTE] Please ignore the following as the performance gain is different when we add Recall that steps 3 & 4 of six-fft algorithm contains transpose & twiddle factor multiplication. Originally, it was:
However, it was slow. I then tried to use trigonometry trick to avoid calling
I think it is still slow! So, I tried to leverage
And the result shows:
|
@NimaSarajpoor Thanks for the update. Have you tried using |
According to this open issue in Numba, providing optional argument(s) in a NJIT-decorated function can result in a considerable performance gain. @NimaSarajpoor |
This issue is to optimize the 6-step FFT algorithm, as discussed initially in #938. We will try to improve the performance of each block of the algorithm. Each result MUST come with an end-to-end code, and the code MUST have assertion to make sure the output is correct.
The text was updated successfully, but these errors were encountered: