-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized DCT Implementation #24
Comments
As a quick test, I benchmarked this with a test implementation based on rust_dct with the following code: fn external_dct(data: &mut [f64], width: usize, height: usize) {
let mut planner = rustdct::DCTplanner::new();
let width_dct = planner.plan_dct2(width);
let height_dct = planner.plan_dct2(height);
let mut scratch = vec![0f64; data.len()];
// width DCT
for (src, dest) in data.chunks_mut(width).zip(scratch.chunks_mut(width)) {
width_dct.process_dct2(src, dest);
}
// transpose
unsafe { transpose(width, height, &scratch, data) };
// height DCT
for (src, dest) in data.chunks_mut(height).zip(scratch.chunks_mut(height)) {
height_dct.process_dct2(src, dest);
}
// transpose back
unsafe { transpose(height, width, &scratch, data) };
} and "transpose" is copied from the rustFFT project. "external_dct_no_transpose" is the same as "external_dct", but with the transpose lines commented out. This represents the theoretical fastest possible execution - but in reality cache problems might prevent anything from ever going that fast.
The results show that using rust_dct over the current algorithm is a clear win, but that's not what we're testing. We want to know how much the transpose hurts runtime. For 8x8, we see that transposing takes 8% longer. For 16x16, transposing takes 12% longer. For 256x256, transposing takes 15% longer. So theoretically, the maximum improvement you can get is 8-15%. |
This is a small variation were the same DCT planner (and thus the same DCT instance) is re-used every time. For 8x8, we see that the computation takes 50% longer when we have to recompute the DCT instance every time. So separately from the discussion of transposing vs striding the column DCT, if you end up using rustdct I recommend re-architecting your hash so that the same DCT instance(s) are re-used multiple times. |
Those numbers are compelling but I have a few questions:
If the benchmark was properly vectorized I wouldn't expect to see the naive implementation to be an order of magnitude slower than one using doubles. I also want to clarify that the DCT data should be getting reused outside of the loop; the coefficients are calculated ahead of time and then reused by the same My column-indexing adapter may also be introducing some unnecessary bounds checks. I wonder if I can eliminate those without unsafe code. Note that the latest HEAD on |
You can see the full test code in my fork of img_hash: ejmahler@0262388
fn bench_naive(b: &mut test::Bencher, width: usize, height: usize) {
let mut signal = vec![0f64; width * height];
b.iter(|| { img_hash::dct_2d(&mut signal, width); });
}
#[bench] fn naive_impl_004x004(b: &mut test::Bencher) { bench_naive(b, 4, 4); }
#[bench] fn naive_impl_008x008(b: &mut test::Bencher) { bench_naive(b, 8, 8); }
#[bench] fn naive_impl_016x016(b: &mut test::Bencher) { bench_naive(b, 16, 16); }
//#[bench] fn naive_impl_256x256(b: &mut test::Bencher) { bench_naive(b, 256, 256); } And I modified img_hash to mark dct_2d function as
|
Yeah you're using the implementation from Cache trashing shouldn't be an issue as the coefficients matrix as well as the scratch space should all fit in L1. However, trig operations don't vectorize so I think they're dragging down the loop. Also, I'm using Also going to benchmark on my 7820x which supports AVX-2 fused multiply-add. I think by default the compiler only assumes SSE2 or something like that. |
I noticed now that RustDCT is generic over the floating point types and can operate on I should have benchmark results tonight. |
Yeah RustDCT +transpose beats my naive DCT implementation hands-down even in the smallest case so I'm completely fine with switching over. I've spent all night trying to optimize a safe transpose but your implementation beats my best effort by 20% on larger matrices: 0764a76 Looking at the optimized assembly, I haven't completely eliminated bounds-checks but the ones that remain don't seem to be in the hottest part of the loop because replacing the index with I think this technique could be adaptable to your tiled implementation, though; with some more tweaking I might be able to eliminate the bounds checks entirely in safe code. I also tried doubling Speaking of which, I bet we could squeeze out a good bit more performance with either implementation if we used allocations that were aligned to cache lines. My transpose seems to be a lot more sensitive to cache alignment. However, it worsens performance on the smaller matrices if I add a leading loop that processes the unaligned elements and then passes the remaining aligned slice to |
I've done a little more research on this, and have learned some things:
These are benchmark results with scratch allocations:
And these are the same benchmarks but with the scratch preallocated or omitted:
As you can see, allocation seems to add about 1750ns, all on its own. (the 256x256 results are slower for the no-allocation version, but I suspect it's just noise). Note that these benchmarks were all run on a windows PC, and it looks like allocations are pretty expensive on windows.
So the transposeless version is 4x faster than getting trait objects from the DCT planner and doing transposes etc, and 3x faster than using the size-8 butterflies directly, but still dong transposes. So eliminating the transposes would definitely speed things up.
|
Finally, I'm annoyed at having to copy+paste transposes everywhere, so I intend to publish a crate this week that will have a few different transpose routines in it. I'll let you know when its up. |
https://crates.io/crates/transpose If you come up with a way to write this code safely I'd be happy to accept a pull request |
I may look at it closer at a later date but as far as I unfortunately can't avoid the allocation when resizing the image; the However, I can combine the allocation of the image data when converting it to floats (probably what you meant) and the allocation of the scratch space in the DCT function. That's quite trivial.
|
Calling this good with |
Seeking an optimized DCT implementation with a compatible license.
Options:
Needs benchmarking as current API supports 1D slices only so it needs a transpose step that the in-tree impl doesn't. (Multidimensional DCTs ejmahler/rust_dct#2)
Dubious:
Can provide interface to link in like old
UserDCT
implementation but composable with the new 3.0-alpha API.cc @ejmahler
The text was updated successfully, but these errors were encountered: