-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve YUV to RGB Performance #66
Conversation
Thanks for following up on this! I have no problem with integer math there if the results are equal-ish, but some suggestions:
|
|
I wouldn't do feature flags here. These should just be different methods, not different implementations of the same method, and if it's just a bit of logic (e.g., no crate dependencies or so) features are too heavyweight.
Sure, lets start with another method.
Unchecked is a rather technical term in Rust and usually means strictly no bound checks via |
This reverts commit 5b4bdec.
Unfortunately I made some mistakes during testing. The integer math implementation was not correct.
When running the benchmark with |
Again, I recommend a profiler. If you are on Windows, Superluminal might be worth a shot (they have a free trial version). As a wild guess, running as |
I tried some of the recommended profilers from the profiling section in the Rust book, but I need to understand them properly. The granularity of the profilers I tried is functions, not instructions, so that is not immediatly helpful. I have never done any serious profiling, so please bear with me here :) |
I just ran this on my machine (Ryzen 9 7950X3D) with
Using Superluminal the "hot" sections for each are this: The bad news is, both functions are pretty poor assembly wise (no vectorization, just single scalar operations). The good news is, it should be possible to get some good speed boost beyond both the floaty one, and the int one. What I'd do (without having actually tried, YMMV) is vaguely:
Once you've done these steps you have a really good chance that by then the compiler groks that it should vectorize that unrolled loop, giving you hopefully a 4x speedup (you can try the same with 8x or 16x), and possibly get even more of a speedup. You'd pay the overhead of copying the data into and out from these overaligned buffers, but that might be much less that the cost of doing unaligned conversion math on all of them. |
I tested this and copying the relevant parts of the buffer (
I like the idea of tricking the compiler to optimize. The copy_planes_x8 function is not used yet, because I keep messing up the indices in the write_rgb8 loop. What do you think of wide or ndarray? I have made good experiences with ndarray for easy implicit SIMD optimizations, but the U and V planes probably have to be scaled for that. |
I haven't looked into this, but I wouldn't be surprised if you have to re-think memory access, cache and loop order to get the best performance. Not sure how relevant it is given its age, but I found a paper on the problem: |
The paper contains some interesting tricks. I implemented the lookup table and it slightly reduces the benchmark time.
Some ideas from the paper rely on pointer magic, which is at least unsafe {...} or unsupported in Rust. I really liked the idea of replacing the clamping with a lookup (Sec 3.3). The conversion & clamping from f32 and i32 back to u8 takes a considerable amount of time:
The clamping test use a 512x512x3 long buffer, so the measured numbers align with the other benchmarks. When removing the I will look further into the clamping and will research branchless alternatives. For |
I only skimmed the code, but I don't really understand how the pointer magic works. For now this is totally fine, but we probably want a safe alternative for later (maybe a safe transmute on a slice level if really needed). Also, it appears your benchmark does these lookups in an incremental fashion, which probably means you get lots of caching, but in reality your access might be "random" and thus perform worse. |
It behaves like a hard-sigmoid activation function. In the example the YUV->RGB formula has a limited number of possible values for each color, for blue that is [-227, 480]. E. Dupuis built an array of (227 + 480) values and using the color value as an index to map the unclamped value into a clamped value.
You are right! Especially the lookup is significantly worse after using random numbers.
|
I figured out the indices for the x8 implementation, so the test passes now (a deviation of 1 per pixel per color is allowed). The integer math is using i16s now, since the accuracy is similar. I did not test the entire spectrum of YUV values, but for the included h264 frame the RGB values are the same. This will be very interesting, because we can do i16x16 operations for add, sub, mul and clamping using AVX. The existing x8 implementation is now using wide::f32x8 (wide) and is a bit faster than the original function.
When compiling with |
My numbers look a bit different relatively speaking,
The hot paths are: with the disassembly in question being: The issue you're probably running into is that you're manually (serially) trying to convert f32 values into u8 (
I get a total time of
That said, not sure if that's the fastest / best way of converting. |
Apparently Crossing my fingers, this feels like it's on the home stretch! |
Interesting! I just did the benchmark again, after a reboot and still have the same numbers. It sounds silly, but were you actually using the computer when running the benchmark? Starting a youtube video during the benchmark skews the results for me. But I also only have 4 physical cores :') Good catch with let upper_bound = wide::f32x8::splat(255.0);
let lower_bound = wide::f32x8::splat(0.0);
...
let (r_pack, g_pack, b_pack) = (
r_pack.fast_min(upper_bound).fast_max(lower_bound).fast_trunc_int(),
g_pack.fast_min(upper_bound).fast_max(lower_bound).fast_trunc_int(),
b_pack.fast_min(upper_bound).fast_max(lower_bound).fast_trunc_int());
let (r_pack, g_pack, b_pack) = (r_pack.as_array_ref(), g_pack.as_array_ref(), b_pack.as_array_ref()); The numbers are still really good though:
I am super curious about the i16x16 results now :) |
Not really, the results are pretty consistent for me. Just reran the previous commit, the numbers are more or less the same (+/- 20us), but definitely x8 was by far the worst. That said, this is probably impact of slightly different CPU architecture (7950X3D), cache size, ... New commit is fantastic, x8_mul_add beats everything:
How do you want to proceed with this? From my side, this is almost ready to merge. How about
Would that be missing something? I assume |
i16x16 is not faster than f32x8 and considering all of the weird code introduced by the integer math idea, I would drop the topic entirely and do f32 math exclusively.
All yes
Technically the video can be any size, especially when I think about web and responsive design. But the underlying encoder will probably pad the buffers internally. I will look into specialized buffers with padding
I will check the entire spectrum
Yes, it will check: avx, sse, simd128 (for wasm32), neon (for aarch64) and fall back to scalar math if none are found. It's a compile time check.
If performance is linear and my calculations are correct, my CPU should now be able convert a 4k30fps video in real-time. I think that's a good milestone :) |
The benchmarks would not work anymore in that case. I added both methods to be pub for now. Is there a cfg or feature flag for benchmarks?
I added a simple check (
Turns out, that the results differ by 1 on some occasions. I have not found a pattern yet, but I assume it's a rounding error. |
Hm, it appears this would only be an issue if you changed resolution during decoding, which probably happens so rarely that noticing a 1 pixel difference is not an issue given the entire pipeline is already lossy in the first place. It's also unclear if any one of these is more correct, but that in turn has some related questions w.r.t color profiles. I think in the long term we want to move RGB conversion away from being an impl on the YUV buffer, and instead have it as a separate struct / traits so people can pick and / or implement their own conversion logic, address color profiles, ..., but that's for another PR. For now I'd just go ahead and land this as discussed, and then later this can just be refactored to give people more flexibility. |
Seems like the CI/CD pipeline does not have an issue with the test either. I added a max diff of one per color value. I am happy with the state of this PR, if there is anything you want, let me know |
Since this is merged now I wanted to say again: thank you so much! This finally fixed a long-standing performance issue, and I'm extremly happy how neat the resulting solution is! I did a few smaller clean up commits that I didn't want to stall the PR with (mostly about the macro use, running I'll release a new version (0.6.4) with your improvements in a bit. |
0.6.4 has been released. |
Benchmarks
cargo +nightly bench convert_yuv_to_rgb_512x512
:cargo +nightly bench convert_yuv_to_rgb_1920x1080
:Tested against 2588499
I hope you like funky math :)
If you are ok with this, I can continue with the other functions and clean up afterwards