-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iterator-based approach performs 10x worse than manual implementation #80416
Comments
Perhaps it's because those cycle()? |
Try compiling benchmarks with
Possible. Edit: nevermind, it can't implement that because TRA must have a finite size, which |
I've found |
@JulianKnodt and @the8472, if zip + cycle is the likely culprit and there's nothing that can done to improve it, is the best solution a manual implementation, like I did? Is it still worth it for me to dig deeper into debugging? |
This is a purely personal POV, so take it as you will, but I think certainly the manual implementation is fine and it's not always worth making very neat abstractions due to increased burden on the compiler. |
For the particular use case at hand, I think it probably makes sense to stick with the manual implementation. I need to calculate the FFT coefficients for a single frequency in an embedded context with a real-time deadline. However, before I discovered the poor performance, I remember being impressed by this specific iteration example, because I was able express the entire problem in terms of iterator adapters (cycle, skip and zip). I was excited to share it with some of my coworkers as a great example of the zero-cost abstractions that Rust could offer. But then I found out it's not so zero-cost, which is disappointing. I might look into it more, but I don't know that I really have the expertise to improve the implementation. I would think these issues have already been considered in depth by people with a lot more experience in compilers. |
Very smart and dedicated people have worked on the stdlib, but the stdlib isn't perfect nor complete. If you see a possible hole, then see if it's worth filling, if you have time to do it :-) Perhaps cycle+zip is worth improving. |
I played around with the example just now and found something interesting. Apparently Based on that, does anyone have any immediate insight into what the problem might be? |
Looks like it's the interaction between This compiles to a constant: pub fn skip_manual() -> i16 {
let mut sum = 0;
for i in 0..LUT_LEN {
sum += LUT[(i + 2) % LUT_LEN];
}
sum
} But this does not. pub fn skip() -> i16 {
LUT.iter().cycle().skip(2).take(8).sum()
} Yet this does pub fn skip() -> i16 {
LUT.iter().skip(2).sum()
} |
Interesting, I think You can also remove the skip just by iterating thru eagerly and it removes some branches:
|
Can someone explain what's going on here? On both Godbolt and with In this example, |
Another approach, I haven't benchmarked it but on O3 it gives branch-free assembly only consisting of memory and simple arithmetic operations, similar to @JulianKnodt's approach with const LUT: [i16; 8] = [0, 1, 2, 3, 4, 5, 6, 7];
pub fn fft(samples: &[i16; 16]) -> (i32, i32) {
let sin = (0..).map(|i| LUT[i % LUT.len()]);
let cos = ((LUT.len()/4)..).map(|i| LUT[i % LUT.len()]);
sin.zip(cos).zip(samples).fold(
(0, 0), |(mut real, mut imag), ((sin, cos), &sample)| {
real += sample as i32 * (cos as i32);
imag += sample as i32 * (sin as i32);
(real, imag)
})
} In this case
rust/library/core/src/iter/adapters/zip.rs Line 386 in 32da90b
It can only be exact (upper bound == lower bound) if an upper bound exists which means it needs to fit into usize.
I would guess that llvm is simply making different inlining decisions based on the number of calls to one of the functions in the iterator chain and duplicating parts of that code goes over a threshold. Iterator optimizations are heavily dependent on inlining. |
@the8472, I just want to clarify, in case there was any miscommunication. In that example the same function gives two different results based on the presence or absence of a different function. It doesn't seem like inlining could explain that. It seems more like it's skipping an optimization pass when another function is present, or something. |
That seems very similar to a situation i've encountered when debugging #79246 |
@marmeladema, do you have a minimal example of that? Do you think it's worth making a separate issue about it? I can add my example from above. |
@the8472, I think the pub fn fft<N>(samples: &GenericArray<i16, Prod<N, LutLen>>) -> (I16F16, I16F16)
where
N: Mul<LutLen> + Unsigned,
Prod<N, LutLen>: ArrayLength<i16>,
{
const ZERO: I16F16 = I16F16::from_bits(0);
let sin = LUT.iter().cycle();
let mut cos = LUT.iter().cycle();
cos.nth(LUT.len() / 4 - 1); // Use nth not skip, so iterator is optimized
sin.zip(cos).zip(samples).fold(
(ZERO, ZERO), |(mut real, mut imag), ((&sin, &cos), &sample)| {
let sample = I1F15::from_bits(sample);
real += I16F16::from_num(sample.wide_mul(cos));
imag += I16F16::from_num(sample.wide_mul(sin));
(real, imag)
})
} |
I was talking about inlining of the callees. If you uncomment the function the iterator methods are used in more places which can affect inlining decisions. |
To help demonstrate what was said above |
@bradleyharden I suspect that is caused by the way that codegne unit (CGU) partitioning ends up happening. My guess is that the presence/absence of the other functions ends up changing which functions get placed in the same CGU, which may in turn affect which optimizations LLVM decides to perform. Can you try setting |
@JulianKnodt, adding @Aaron1011, adding Is this expected behavior? Or should I create a separate issue about it? I'm still a bit skeptical of @the8472's callee inlining explanation. |
Both Right now, pub struct CustomSkip<I> {
iter: I,
n: Option<usize>,
}
impl<I> CustomSkip<I> {
pub fn new(iter: I, n: usize) -> CustomSkip<I> {
CustomSkip { iter, n: Some(n) }
}
}
impl<I: Iterator> Iterator for CustomSkip<I> {
type Item = I::Item;
#[inline]
fn next(&mut self) -> Option<I::Item> {
if let Some(n) = self.n.take() {
self.iter.nth(n)
} else {
self.iter.next()
}
}
}
trait SkipExt: Iterator + Sized {
fn custom_skip(self, n: usize) -> CustomSkip<Self> {
CustomSkip::new(self, n)
}
}
impl<I: Iterator> SkipExt for I {} Indeed, this does seem to improve all of the examples discussed so far. But it only improves the original |
Another approach I have seen in some places in std is unrolling the first loop iteration so that the compiler has an easier time with the loop for the remainder. That could be tried by implementing |
@bradleyharden do you think you could see if
with |
Change branching in `iter.skip()` Optimize branching in `Skip`, which was brought up in rust-lang#80416. This assumes that if `next` is called, it's likely that there will be more calls to `next`, and the branch for skip will only be hit once thus it's unlikely to take that path. Even w/o the `unlikely` intrinsic, it compiles more efficiently, I believe because the path where `next` is called is always taken. It should be noted there are very few places in the compiler where `Skip` is used, so probably won't have a noticeable perf impact. [New impl](https://godbolt.org/z/85rdj4) [Old impl](https://godbolt.org/z/Wc74rh) [Some additional asm examples](https://godbolt.org/z/feKzoz) although they really don't have a ton of difference between them.
I'm not sure if this qualifies as a bug or not.
In short, the following two functions give the same output, but
fft
performs about 10x worse thanfft_manaul
.I originally ran into this in an embedded context, targeting
thumbv7em-none-eabihf
. The original function took aGenericArray
and used fixed-point types from thefixed
crate, but I was able to reproduce the issue using only primitive types targettingx86_64-pc-windows-msvc
.Here are the results of benchmarks from selected
nightly
versions from the past 18 months. The benchmarks were run using the code here.test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)
test tests::bench_fft_manual ... bench: 4 ns/iter (+/- 2)
test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)
(with
#![feature(const_slice_len)]
)test tests::bench_fft_manual ... bench: 5 ns/iter (+/- 1)
I took a look at the output of
cargo asm
for each version. The output offft
has 28 branch instructions, while the output offft_manual
has none. I guess the compiler is able to completely unroll the loop when it's written manually, but not when it's written usingIterator
s.Unfortunately, this is about the limit of my debugging ability currently. With some guidance, I might be able to do more of the leg work on my own.
Is this a known/expected performance difference? I hope not. I was really impressed at the elegance of the
Iterator
-based approach. I would be disappointed if the performance is really that bad.The text was updated successfully, but these errors were encountered: