-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Real FFT? #28
Comments
I needed the same thing and ended up making this: https://crates.io/crates/realfft It's a simple wrapper of RustFFT that does real to complex fft, and complex to real ifft. And yes it runs about twice as fast. |
Thanks for this great crate as well 😄 Just want to mention that this was maybe up next after this AVX stabilization (awelkie/RustFFT#4 (comment)), crossing fingers. In any case, I'm using realFFT for my project as well, just thought it'd be great to have this directly in the RustFFT crate :) |
Once 5.0 is out, I plan to turn my attention to RustDCT for a little while. I don't think I'll be implementing much AVX stuff there, but I at least want to get it compiling with 5.0, and using the same scratch space setup that rustfft uses. After that, I can start looking into real-only FFTs. What sort of API do you exactly need for a real-only FFT? What's the application? Do you need access to complex numbers on one side of the FFT, but not the other? Or is it real-only on both sides? |
I think these two are by far most common use cases:
For even lengths, it's possible to use a complex-to-complex (i)fft of half the length, but I haven't found any such trick for odd lengths. That means that odd length real fft will most likely be much more work. The existing implementations in other libraries do things in a few ways. The simple approach of just converting it to complex becomes: [(x0r, 0), (x1r, 0, (x2r, 0), (x3r, 0), (x4r, 0, (x5r, 0)] FFT:in x becomes: [(X0r, X0i), (X1r, X1i), (X2r, X2i), (X3r, X3i), (X4r, X4i), (X5r, X5i)] Because x was real, this becomes a bit simpler: [(X0r, 0), (X1r, X1i), (X2r, X2i), (X3r, 0), (X2r, -X2i), (X1r, -X1i)] The last two values are redundant, so let's remove them: [(X0r, 0), (X1r, X1i), (X2r, X2i), (X3r, 0)] Starting from here, I have seen these different solutions:
I think in-place real-to-complex/complex-to-real transforms are a bad idea, as it means the complex values of the spectrum have to be split and stored as separate real values in the input vector. In RealFFT I went with option 1, and only out-of-place transforms. As a starting point I would propose to add the same functionality as RealFFT has:
|
There's a #4: fftw's R2HC format |
Oh, didn't see that one. Not sure which one is worst then, my #3 or the R2HC format... |
I've been using FFTS until now, The Fastest Fourier Transform in the South (it's faster than FFTW), specifically I've been using the 1d real function for real-to-complex FFT:
@ejmahler @HEnquist See also this: https://www.smcm.iqfr.csic.es/docs/intel/mkl/mkl_manual/fft/fft_PackedFormats.htm |
I can just mention that the RealFFT crate is now updated to use RustFFT 5.0. |
For my use case I convert a real to complex then take the magnitude of the first half the resulting output. // Switch to Complex input
for i in 0..FFT_LEN {
self.0[i].re = input[i];
self.0[i].im = 0.0;
}
FFT_FULL.process_with_scratch(&mut self.0[..], &mut self.1[..]);
// Calculate the magnitude
for i in 0..FFT_LEN/2 {
input[i] = self.0[i].norm();
} Where Self is /// FFT from the `rustfft` crate
pub struct RsFft(Box<[Complex32; FFT_LEN]>, Box<[Complex32; FFT_LEN]>); When I last bench marked with rustfft 4.0, IPP (using the ipp-sys crate) was much faster for that I used the functions |
The main thing I dislike with the inplace solutions is that they all offer very poor ergonomics since they need to store complex values in a real vector. The advantages of the Complex type are lost. |
The in place solutions are great if you can hide the ugliness behind an API
I kind of wish we could do something like return a boxed iterator over the
complex output, so we don’t actually need storage for it.
Specifically returning a boxed iterator isn’t great because of the
allocation, but are there any applications where lazily generating the
complex output would be useful? Or for C2R, taking an iterator as a
parameter
…On Thu, Jan 7, 2021 at 12:32 AM Henrik Enquist ***@***.***> wrote:
The main thing I dislike with the inplace solutions is that they all offer
very poor ergonomics since they need to store complex values in a real
vector. The advantages of the Complex type are lost.
@Cocalus <https://github.com/Cocalus> what lengths are you typically
using? Can they be odd? How much faster was ipp?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#28 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI2M6SMVOZAFBMEIFIQAWTSYVWSTANCNFSM4VMWGMWQ>
.
|
I was using 4096, I think IPP only supports powers of two. I retested with my case of real to magnitudes at 4096. In my benchmark IPP 2019 is about 5.5x faster than RustFFT 5.0. With 10000 iterations and only initializing the FFTs once (plan_fft_forward + ippsFFTInit_R_32f), rustfft took 175ms and IPP took 31ms. RustFFT has to move things into and out of the real buffer on top of twice as the effective FFT size. So adding a Real API should probably fix a decent chunk of that difference. The IPP magnitude calculation is also vectorized which probably helps. I could probably upload the hacked together benchmark if some wants to poke at it, but I think you need to register an email with intel to download IPP. |
Have you looked at the RealFFT crate? At 4096 the fft should be about twice as fast, and you can skip the conversion to complex. |
To explain my earlier comment in more detail: The core values of rust are performance, safety, productivity. The ideal real FFT API would have all three. I think we're all uncomfortable with the "keep the data inplace and do some weird trick to pack the results" output formats because it would heavily sacrifice productivity. I think it also hurts safety. Users will have to write loops with random array access to pull data out of the packed formats - and in my experience, it's easy to write loops like that correctly, but hard to write them in a way that avoids bounds checks. Many users will feel tempted to write things like So we're talking about doing the half-complex output instead - but I think that also hurts productivity and safety, just less so. Users won't have to remember a bizarre layout, but as I think about it, I think we're asking the average RustFFT user too much if we ask them to remember "we omit the last n/2 - 1 elements of the array, if you want them, iterate from element 1 to element n / 2 - 1 in reverse order and negate them all". So why not make the default output be the full complex array? Do the reversal and negation ourselves, and return a size-N array instead of size-N/2+1, so that the caller doesn't have to remember the exact specifics of which subsection of the array to reverse. Once we have that idea, my next thought is, how often do callers need the output in an array? In the case of RustDCT, you compute the DCT2 with an inner real-only FFT. Once you have the full complex output, you do some extra processing anyways, and the final output is just real numbers. So you never actually store the complex output, you only keep it around long enough to compute the real output, and then throw away the complex output. |
I on the other hand definitely prefer the half-complex format. My main use case is processing of audio data, doing stuff like convolution. Then I have a real valued signal, and a real valued impulse response that I want to convolute. So I transform them both, multiply the complex results, and transform back. Using the half-complex format means I have half as many coefficients to multiply and store. I think that most people who know they could use a Real2Complex fft in their application already have a good understanding of how ffts work, and have no problem with the half-complex format (as long as it doesn't use any weird packing tricks of course). But it would make sense to provide some conventient way of filling in the second half of the data when someone needs the full result. Another thought, if Real2Comples fft returns the full complex result, what should the Complex2Real ifft do? Ignore half of the data? Or assert that it's conjugate symmetric? Neither seems very attractive.. |
Yeah, the point about convolution makes sense. |
Although - an implicit part of the Complex-To-Real, even with the redundant elements omitted, is that the imaginary part of the first and last value is 0. Presumably if you do any sort of convolution involving multiplying with a complex number, your input to C2R is going to have a nonzero imaginary component. How does that affect the output? |
Doing convolution that way only works when both the signals are real. Then they both have zeros in those places of their transforms, and the zeros stay also after element-wise complex multiplication of the two transforms. |
Would it ever make sense to do a R2C where the inner FFT is inverse? Or only forward? Would it ever make sense to do a C2R where the inner FFT is forward? Or only inverse? |
I think only forward R2C and inverse C2R are useful in practice. I have never seen the other two (fftw doesn't have them for example) and I can't think of any reason for ever using them. |
I did some investigation on why it's so hard to find non-even real FFT algorithms: In order to compute a real FFT, you basically have to apply Mixed Radix, or rader's, or bluestein's. But like, a "Real FFT" version of those algorithms. You have to just derive a version of the algorithm that manually cuts out the wasted work. Size 2 is a special case, because it takes advantage of the fact that you can compute 2 real FFTs in parallel by computing a single c2c DFT, with one set of real data in the reals, and the other FFT in the imaginaries. You can completely recover the separate outputs of the two FFTs with a O(n) postprocess. So the algorithm in the PR and elsewhere is basically merging a For any non-even number, we'd need mixed radix algorithm, raders, bluesteins, butterflies, planning, the works. Having to switch back and forth between the real array and the "n / 2 + 1" array seems like a nightmare, but apparently a real-only FFT is trivially convertible to a transform called a "Discrete Hartley Transform", which is real-only, has all the same algorithms, etc, but with the advantage that all the outputs are the same size. So my current thinking is to make a new planner specifically for real FFTs, and put all this stuff in there. It can have a FFT planner internally if we find that it can speed things up to use an actual FFT, but otherwise, it'll stay off in its own little Discrete Hartley World. |
That's not a final plan obviously, but seeing how many different classes are going to be involved, I think it shouldn't be a part of the main planner. Given that we're going to almost double the code size, we might even want to split it into a separate crate entirely. |
This explains a lot, thanks! Building the DHT will be a pretty long term project. What to do in the meanwhile? One easy fix that I think is still better than nothing, is to provide R2C/C2R for odd lengths by just using C2C of the same length. This would at least mean there is a single interface that works for all lengths, even if there is no speed advantage for the odds. I could add this to RealFFT for now, and include the optimizations from #50. |
And even if it's not faster, there would be added convenience of not having to keep track of the redundant part of the complex output |
Yes the convenience itself makes it worth having it. Are you adding this in RustFFT now-ish? If yes I won't do anything more to RealFFT, just mark it as deprecated once this is done. |
It will be a while, i recommend porting it to realfft |
Maybe it makes sense to take inspiration from FFTS? https://github.com/linkotec/ffts |
From what I can tell, it's using the same trick as us: https://github.com/linkotec/ffts/blob/2c8da4877588e288ff4cd550f14bec2dc7bf668c/src/ffts_real.c#L407 The documentation of FFTS is pretty thin, but I think its main algorithms are only for powers of two. For anything else it uses Bluesteins. If that is true, it might be faster for powers of two, and the (quite few) special cases where the fastest solution is Bluestein. For everything else it will most likely be quite a bit slower. |
I made a big rewrite of RealFFT: HEnquist/realfft#7 |
I dug into FFTS, and it's even worse: If you pass in an odd size, it crashes on unaligned SSE loads. If you disable SSE, it computes the wrong output. So yeah, they just assume your input is even without asserting it. |
Well that explains it. I was confused when I couldn't find any check for if the length is even. I thought I just didn't see it, didn't expect it to be missing... |
Who uses odd FFT anyway? |
If you're computing an even-sized FFT that isn't a power of two, then it must have at least some odd prime factors. In that case, the library you're using will need to handle odd sizes internally in order to handle the odd prime factors of your size. |
The new RealFFT is now published on crates.io as version 1.0.0! |
Update: Ever since mentioning the discrete hartley transform, I've been working on deriving a variant of the six-step mixed radix FFT algorithm for the DHT. As far as I know, this algorithm is original research, and it was pretty tough to derive: // Computes a DHT via the six-step mixed-radix algorithm
fn compute_dht_mixedradix(buffer: &mut [f32], width: usize, height: usize) {
let len = buffer.len();
let mut scratch = vec![0.0; len];
assert_eq!(len, width * height);
// Step 1: Transpose the width x height array to height x width
transpose::transpose(buffer, &mut scratch, width, height);
// Step 2: Compute DHTs of size `height` down the rows of our transposed array
for chunk in scratch.chunks_exact_mut(height) {
compute_dht(chunk);
}
// Step 3: Apply twiddle factors
for k in 0..height {
// we need -k % height, but k is unsigned, so do it without actual negatives
let k_rev = subtract_mod(0, k, height);
for i in 0..width {
// -i % radix, but i is unsigned, so do it without actual negatives
let i_bot = subtract_mod(0, i, width);
let top_twiddle = compute_dft_twiddle_inverse(i * k, len);
let bot_twiddle = compute_dft_twiddle_inverse(i_bot * k, len);
// Instead of just multiplying a single input vlaue with a single complex number like we do in the DFT,
// we need to combine 4 numbers, determined by mirroring the input number across the horizontal and vertical axes of the array
let top_fwd = scratch[i*height + k];
let top_rev = scratch[i*height + k_rev];
let bot_fwd = scratch[i_bot*height + k];
let bot_rev = scratch[i_bot*height + k_rev];
// Since we're overwriting data that our mirrored input values will need whenthey compute their own twiddles,
// we currently can't apply twiddles inplace. An obvious optimization here is to compute all 4 values at once and write them all out at once.
// That would cut down on the number of flops by 75%, and would let us do this inplace
buffer[i*height + k] = 0.5 * (
top_fwd * top_twiddle.re
- top_fwd * top_twiddle.im
+ top_rev * top_twiddle.re
+ top_rev * top_twiddle.im
+ bot_fwd * bot_twiddle.re
+ bot_fwd * bot_twiddle.im
- bot_rev * bot_twiddle.re
+ bot_rev * bot_twiddle.im
);
}
}
// Step 4: Transpose the height x width array back to width x height
transpose::transpose(&buffer, &mut scratch, height, width);
// Step 5: Compute DHTs of size `width` down the rows of the array
for chunk in scratch.chunks_exact_mut(width) {
compute_dht(chunk);
}
// Step 6: Transpose the width x height array to height x width one lst time
transpose::transpose(&scratch, buffer, width, height);
} Pairing this with some trivial preprocessing or postprocessing, we can use this to implement R2C and C2R: // Computes a DFT of real-only input by converting the problem to a DHT
fn compute_r2c_via_dht(input: &mut [f32], output: &mut [Complex<f32>]) {
assert_eq!(output.len(), input.len() / 2 + 1);
compute_dht(input);
output[0] = Complex::from(input[0]);
for (k, output_cell) in output.iter_mut().enumerate().skip(1) {
*output_cell = Complex {
re: (input[input.len() - k] + input[k]) * 0.5,
im: (input[input.len() - k] - input[k]) * 0.5,
}
}
}
// Computes a DFT of real-only output by converting the problem to a DHT
fn compute_c2r_via_dht(input: &[Complex<f32>], output: &mut [f32]) {
assert_eq!(input.len(), output.len() / 2 + 1);
output[0] = input[0].re;
for (k, input_cell) in input.iter().enumerate().skip(1) {
output[k] = (input_cell.re + input_cell.im);
output[output.len() - k] = (input_cell.re - input_cell.im);
}
compute_dht(output);
} The hardest part was applying twiddle factors: Instead of just a pairwise multiplication like in the FFT mixed radix algorithm, you have to compute data from 4 different points that are mirrors of each other across the 2D array. I find this interesting, because it's sort of a reflection of the postprocessing you do to convert a R2C to a DHT. In both cases, you're looping over the front of the array and the back of the array simultaneously, and meeting in the middle. Except I didn't derive DHT mixed radix in terms of the conversion to a DFT - I derived it entirely in terms of the O(n^2) definition of the DHT. So even though the reflection doesn't show up in the O(n^2) definition of the DHT, that reflection still seems like an inherent part of the transform. The fact that the reversal shows up both in the mixed radix algorithm and the conversion to R2C reminds me of a fractal, where you zoom in and end up seeing copies of the original fractal, and it gives you sense that that shape is very important to the computation of that fractal for some reason. Fun stuff to think about. I know variants of rader's algorithm and bluestein's algorithm have already been derived, and i THINK good-thomas has as well, so with those 3 we have everything need to compute every DHT size. So next steps are, before committing to using the DHT, I also want to see if I can use the Discrete Cosine Transform Type 2 as the internal transform, since that also has a pretty trivial conversion to a DFT. I'm not sure how long that research will take, but it's entirely possible that a pure mixed radix algorithm like this for the DCT isn't a good idea, in which case it may become clear pretty quickly. After that, I'll start a new crate called RustDHT and start building it. Finally, I'll make a crate called "RustFFT-Real" which depends on RustDHT and encapsulates it all, so that the DHT is just an implementation detail of the real-valued FFT, rather than an API-guaranteed part of the process. Finally, once RustFFT-Real is released, I can convert RustDCT to use RustDHT instead of RustFFT. Long term, I'm considering making RustFFT a wrapper crate, that pulls in |
This is really good work! |
Update: i've been chugging away at this over the last few weeks. I have a repo for rustDHT, although it isn't published to crates.io yet https://github.com/ejmahler/RustDHT There's no planner or bluestein's or raders equivalent yet, I've been focusing on getting the performance for small prime factors up to where it should be. Prime butterflies can be computed with more or less the exact same algorithm as they are on a FFT, with a manual factorization of the naive DHT matrix. Radix4 works just like RustFFT's radix4, although the twiddle factor function is more complicated. MixedRadix in RustDHT ends up being almost the exact same speed as MixedRadix in RustFFT. My theory for this is because in the transposes, the compiler can copy both elements of a complex number with the same instruction, so it can optimize away half the instructions involved in a transpose. But you can't optimize away half the instructions of a real-valued transpose, so the real one ends up being the same instruction count as the complex one -- and thus, it takes the same amount of time. So I'm designing the library with the notion that we won't be doing MixedRadix, ever. That also rules out PFA, because the main value of PFA is that it's mixed radix without the twiddle factors. |
Agreed that these would be great additions to RustFFT. I'm also using FFT for audio processing (a VST plugin) which uses 1D real-to-complex and complex-to-real transforms. I did some simple benchmarks and RustFFT c2c is about 1.5x-2x slower than FFTW r2c and c2r for 4096 single precision values, for double precision the difference is even bigger, RustFFT is about 3x slower. For now I'm using RealFFT which is as fast as FFTW for single precision, and even faster for double precision, so big thanks for implementing that! ( All benchmarks run on AVX2 CPU of course) |
Btw @jesnor, I just published my bindings to FFTS (Fastest Fourier Transform in the South, which is faster than FFTW) here: |
How does ffts real-only FFT compared to the RealFFT crate? |
I also wonder about non-power-of-twos in FFTS. Since it uses Bluesteins for that, I would expect it to be far slower at that than both FFTW and RustFFT. |
Thanks for this great crate!
It would be great to have a faster way for doing a real to complex fft (such as from time to frequency domain), which is a very common use case.
It would approximately halve the computation time compared to doing it via a complex to complex fft.
https://www.dspguide.com/ch12/1.htm
For my use case I need to replace another fft crate (which uses cmake) because I want to cross compile to mac os, and I only need the real to complex fft for pow2 size.
The text was updated successfully, but these errors were encountered: