Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve YUV to RGB Performance #66

Merged
merged 18 commits into from
Dec 9, 2024
Merged

Conversation

jnnks
Copy link

@jnnks jnnks commented Nov 21, 2024

  • avoids floating point math
  • it delivers the same result for the two benchmark test cases, but I imagine it does not deliver the same result for all inputs

Benchmarks

cargo +nightly bench convert_yuv_to_rgb_512x512:

  • before: 1,318,003.51 ns/iter (+/- 222,491.23)
  • after: 866,946.60 ns/iter (+/- 41,533.46)

cargo +nightly bench convert_yuv_to_rgb_1920x1080:

  • before: 10,523,719.80 ns/iter (+/- 2,798,988.23)
  • after: 7,002,009.60 ns/iter (+/- 1,819,635.86)

Tested against 2588499

I hope you like funky math :)
If you are ok with this, I can continue with the other functions and clean up afterwards

@ralfbiedert
Copy link
Owner

Thanks for following up on this! I have no problem with integer math there if the results are equal-ish, but some suggestions:

  • Since other people might feel differently, there should be a way to choose what math to do
  • That "way" could be having multiple methods, maybe even some type-trait-parameter magic like xxx.write_rgb::<IntegerMath>(&mut buffer) that might allow for multiple implementations (that conceivably even users could do, then maybe even with Rayon in their own code)
  • I don't get the unchecked here, you don't seem to be doing unsafe (and you shouldn't, at least not in the default implementation; if we had type'ed impls, maybe we could have one)?
  • I'd also recommend to run this with a profiler / disassembler and see if you get AVX2+ instructions when compiling for native CPUs. If not it might be worthwhile to make sure that buffer is allocated with the proper alignment and maybe introduce some aligned wrapper types?

@jnnks
Copy link
Author

jnnks commented Nov 22, 2024

Since other people might feel differently, there should be a way to choose what math to do

  • I can add a feature flag

That "way" could be having multiple methods, maybe even some type-trait-parameter magic ...

  • I would prefer a different method name, tbh
  • from the perspective of someone looking into the lib for the first time a suffix at the end of the method is much easier to understand

I don't get the unchecked here, ...

  • unckecked as in no asserts to check buffer bounds

I'd also recommend to run this with a profiler...

  • that will take some time I assume

@ralfbiedert
Copy link
Owner

I can add a feature flag

I wouldn't do feature flags here. These should just be different methods, not different implementations of the same method, and if it's just a bit of logic (e.g., no crate dependencies or so) features are too heavyweight.

I would prefer a different method name, tbh

Sure, lets start with another method.

unckecked as in no asserts to check buffer bounds

Unchecked is a rather technical term in Rust and usually means strictly no bound checks via unsafe. At least the code you have so far still has bounds checks (e.g., self.y[base_y] checks base_y is inside the array), and I prefer to keep it that way.

Jannik Schleicher added 2 commits November 24, 2024 12:40
@jnnks
Copy link
Author

jnnks commented Nov 24, 2024

Unfortunately I made some mistakes during testing. The integer math implementation was not correct.

cargo +nightly bench convert_yuv_to_rgb_512x512

test convert_yuv_to_rgb_512x512          ... bench:   1,320,918.10 ns/iter (+/- 116,723.12)
test convert_yuv_to_rgb_512x512_int_math ... bench:   1,015,732.15 ns/iter (+/- 66,961.28)

RUSTFLAGS='-C target-cpu=native' cargo +nightly bench convert_yuv_to_rgb_512x512

test convert_yuv_to_rgb_512x512          ... bench:   1,218,058.07 ns/iter (+/- 230,080.26)
test convert_yuv_to_rgb_512x512_int_math ... bench:   1,581,914.55 ns/iter (+/- 364,022.30)

When running the benchmark with RUSTFLAGS='-C target-cpu=native', the int math becomes significantly slower, but I am not sure why.

@ralfbiedert
Copy link
Owner

When running the benchmark with RUSTFLAGS='-C target-cpu=native', the int math becomes significantly slower, but I am not sure why.

Again, I recommend a profiler. If you are on Windows, Superluminal might be worth a shot (they have a free trial version). As a wild guess, running as native might unlock float vector instructions (and / or maybe even some FMA), while at the same time your integer loop might be harder to optimize and / or do more work (e.g., clamping, which probably adds 2 * 3 = 6 branches per pixel).

@jnnks
Copy link
Author

jnnks commented Nov 24, 2024

I tried some of the recommended profilers from the profiling section in the Rust book, but I need to understand them properly.

The granularity of the profilers I tried is functions, not instructions, so that is not immediatly helpful.
RUSTFLAGS='-C target-cpu=native' samply record cargo +nightly bench convert_yuv_to_rgb_512x512 generates the following call tree:
image
Which suggests to me that write_rgb8_int_math takes 1/2 as much time, as the sample count is roughly 1/2. But in reality that is not the case at all.

I have never done any serious profiling, so please bear with me here :)
I am using Linux.

@ralfbiedert
Copy link
Owner

I just ran this on my machine (Ryzen 9 7950X3D) with native CPU.

test convert_yuv_to_rgb_512x512          ... bench:     643,150.00 ns/iter (+/- 32,933.50)
test convert_yuv_to_rgb_512x512_int_math ... bench:     588,450.00 ns/iter (+/- 46,479.00)

Using Superluminal the "hot" sections for each are this:

Float math:
Screenshot_1

Int math:
Screenshot_2

The bad news is, both functions are pretty poor assembly wise (no vectorization, just single scalar operations). The good news is, it should be possible to get some good speed boost beyond both the floaty one, and the int one.

What I'd do (without having actually tried, YMMV) is vaguely:

  • Check what's the cost of just memcpy-ing the entire RGB / YUV texture. If that time is "very small" relative to the conversion, proceed to next step
  • Have internal buffers for YUV / RGB that are overaligned (e.g., to 4 / 8 / 16 / 32 bytes) and that support multiple elements, e.g., 4), you'll probably want a special transparent struct for that (e.g., PixelComponents([u8; 4]). The trick here is to make it really easy for the compiler to see that 1) all data is aligned and 2) how you access that aligned data.
  • Do you YUV -> RGB conversion in a loop like before, but instead of doing 1 element, do all 4 elements in that loop "unrolled". You might need some extra handling of excess data at the end of each row (e.g., just access next bytes but discard), and at the end of buffer (stop loop early and do last elements manually).

Once you've done these steps you have a really good chance that by then the compiler groks that it should vectorize that unrolled loop, giving you hopefully a 4x speedup (you can try the same with 8x or 16x), and possibly get even more of a speedup.

You'd pay the overhead of copying the data into and out from these overaligned buffers, but that might be much less that the cost of doing unaligned conversion math on all of them.

Jannik Schleicher added 2 commits November 28, 2024 21:35
@jnnks
Copy link
Author

jnnks commented Nov 28, 2024

Check what's the cost of just memcpy-ing the entire RGB / YUV texture. If that time is "very small" relative to the conversion, proceed to next step

I tested this and copying the relevant parts of the buffer (width * height not stride * height) is <2% of the conversion time. When packing the color values into groups of 8, it takes <3%.

RUSTFLAGS='-C target-cpu=native' cargo +nightly bench convert_yuv_to_rgb_512x512
test convert_yuv_to_rgb_512x512                ... bench:   1,476,245.10 ns/iter (+/- 429,006.75)
test convert_yuv_to_rgb_512x512_copy_planes    ... bench:      20,203.08 ns/iter (+/- 2,335.37)
test convert_yuv_to_rgb_512x512_copy_planes_x8 ... bench:      35,961.64 ns/iter (+/- 5,412.43)
test convert_yuv_to_rgb_512x512_int_math       ... bench:   1,396,169.80 ns/iter (+/- 184,486.73)
test convert_yuv_to_rgb_512x512_x8             ... bench:   1,373,674.05 ns/iter (+/- 45,413.79)

Have internal buffers for YUV / RGB that are overaligned (e.g., to 4 / 8 / 16 / 32 bytes) and that support multiple elements, e.g., 4), you'll probably want a special transparent struct for that (e.g., PixelComponents([u8; 4]). The trick here is to make it really easy for the compiler to see that 1) all data is aligned and 2) how you access that aligned data.

I like the idea of tricking the compiler to optimize. write_rgb8_x8 is a minimum effort implementation (the test test_write_rgb8_x8 fails, because values diverge too much).

The copy_planes_x8 function is not used yet, because I keep messing up the indices in the write_rgb8 loop.
With the U and V planes a quarter of the size, it is a bit tricky to wrap my head around the indexing. Right now, we process the image row-wise. While we get ordered access to the Y plane, we have to lookup the same U and V values twice, correct?


What do you think of wide or ndarray? I have made good experiences with ndarray for easy implicit SIMD optimizations, but the U and V planes probably have to be scaled for that.

@ralfbiedert
Copy link
Owner

With the U and V planes a quarter of the size, it is a bit tricky to wrap my head around the indexing. Right now, we process the image row-wise. While we get ordered access to the Y plane, we have to lookup the same U and V values twice, correct?

I haven't looked into this, but I wouldn't be surprised if you have to re-think memory access, cache and loop order to get the best performance. Not sure how relevant it is given its age, but I found a paper on the problem:

http://lestourtereaux.free.fr/papers/data/yuvrgb.pdf

@jnnks
Copy link
Author

jnnks commented Dec 4, 2024

The paper contains some interesting tricks. I implemented the lookup table and it slightly reduces the benchmark time.

test convert_yuv_to_rgb_512x512            ... bench:   1,240,672.85 ns/iter (+/- 229,919.14)
test convert_yuv_to_rgb_512x512_int_math   ... bench:   1,398,108.16 ns/iter (+/- 121,577.64)
test convert_yuv_to_rgb_512x512_x8         ... bench:   1,370,854.05 ns/iter (+/- 75,570.61)

test convert_yuv_to_rgb_512x512_int_lookup ... bench:   1,242,725.61 ns/iter (+/- 85,690.72)
test convert_yuv_to_rgb_512x512_lookup     ... bench:   1,167,169.96 ns/iter (+/- 95,340.19)

Some ideas from the paper rely on pointer magic, which is at least unsafe {...} or unsupported in Rust. I really liked the idea of replacing the clamping with a lookup (Sec 3.3). The conversion & clamping from f32 and i32 back to u8 takes a considerable amount of time:

test clamping_f32_u8                       ... bench:     472,408.60 ns/iter (+/- 104,557.68)
test clamping_i32_u8                       ... bench:     563,045.05 ns/iter (+/- 146,890.94)
test clamping_lookup                       ... bench:     514,645.23 ns/iter (+/- 73,549.45)
  • clamping_i32_u8: from i32 to u8 with i32.clamp(0, 255) as u8
    • directly casting i32 as u8 does not work, as it will over/underflow
  • clamping_f32_u8: from f32 to u8 with f32 as u8
  • clamping_lookup: from isize to u8 with lookup table as described in the paper, using unsafe pointer manipulation
    • I probably did something wrong here, Rust does not support pointer math as elegantly as the C code in the paper

The clamping test use a 512x512x3 long buffer, so the measured numbers align with the other benchmarks. When removing the clamp(0, 255) from the int_math algorithm the time is reduced by roughly the same amount clamping_i32_u8 takes.
40% of the time is spent clamping.

I will look further into the clamping and will research branchless alternatives. For f32s it's mandatory, but maybe there is some bit magic for the i32s.

@ralfbiedert
Copy link
Owner

I only skimmed the code, but I don't really understand how the pointer magic works. For now this is totally fine, but we probably want a safe alternative for later (maybe a safe transmute on a slice level if really needed).

Also, it appears your benchmark does these lookups in an incremental fashion, which probably means you get lots of caching, but in reality your access might be "random" and thus perform worse.

@jnnks
Copy link
Author

jnnks commented Dec 5, 2024

I only skimmed the code, but I don't really understand how the pointer magic works

It behaves like a hard-sigmoid activation function. In the example the YUV->RGB formula has a limited number of possible values for each color, for blue that is [-227, 480]. E. Dupuis built an array of (227 + 480) values and using the color value as an index to map the unclamped value into a clamped value.
I changed it to not use undefined behavior.

Also, it appears your benchmark does these lookups in an incremental fashion, which probably means you get lots of caching, but in reality your access might be "random" and thus perform worse.

You are right! Especially the lookup is significantly worse after using random numbers.

test clamping_f32_u8                       ... bench:     467,216.20 ns/iter (+/- 62,301.58)
test clamping_i32_u8                       ... bench:     626,011.79 ns/iter (+/- 74,282.14)
test clamping_lookup                       ... bench:     736,682.20 ns/iter (+/- 63,212.27)

@jnnks
Copy link
Author

jnnks commented Dec 6, 2024

I figured out the indices for the x8 implementation, so the test passes now (a deviation of 1 per pixel per color is allowed).

The integer math is using i16s now, since the accuracy is similar. I did not test the entire spectrum of YUV values, but for the included h264 frame the RGB values are the same. This will be very interesting, because we can do i16x16 operations for add, sub, mul and clamping using AVX.

The existing x8 implementation is now using wide::f32x8 (wide) and is a bit faster than the original function.

test convert_yuv_to_rgb_512x512            ... bench:   1,353,391.38 ns/iter (+/- 154,339.35)
test convert_yuv_to_rgb_512x512_lookup     ... bench:   1,352,759.49 ns/iter (+/- 244,724.19)
test convert_yuv_to_rgb_512x512_i16_math   ... bench:   1,566,020.30 ns/iter (+/- 213,608.03)
test convert_yuv_to_rgb_512x512_i16_lookup ... bench:   1,488,072.48 ns/iter (+/- 194,859.50)
test convert_yuv_to_rgb_512x512_x8         ... bench:   1,260,816.24 ns/iter (+/- 113,147.20)

When compiling with RUSTFLAGS='-C target-cpu=native' I can see that AVX instructions are generated, but the improvement is a bit disappointing to be honest.

@ralfbiedert
Copy link
Owner

ralfbiedert commented Dec 6, 2024

My numbers look a bit different relatively speaking, x8 performing the worst:

test convert_yuv_to_rgb_512x512            ... bench:     620,730.00 ns/iter (+/- 22,816.00)
test convert_yuv_to_rgb_512x512_i16_lookup ... bench:     557,630.00 ns/iter (+/- 14,977.00)
test convert_yuv_to_rgb_512x512_i16_math   ... bench:     613,970.00 ns/iter (+/- 17,916.00)
test convert_yuv_to_rgb_512x512_lookup     ... bench:     446,850.00 ns/iter (+/- 26,981.00)
test convert_yuv_to_rgb_512x512_x8         ... bench:     715,435.00 ns/iter (+/- 62,842.00)

The hot paths are:

image

with the disassembly in question being:

image

The issue you're probably running into is that you're manually (serially) trying to convert f32 values into u8 (r_pack[i] as u8). If I instead do this on my machine

let (r_pack, g_pack, b_pack) = (r_pack.round_int(), g_pack.round_int(), b_pack.round_int());
let (r_pack, g_pack, b_pack) = (r_pack.as_array_ref(), g_pack.as_array_ref(), b_pack.as_array_ref());

I get a total time of

test convert_yuv_to_rgb_512x512_x8         ... bench:     228,533.33 ns/iter (+/- 17,070.33)

That said, not sure if that's the fastest / best way of converting.

@ralfbiedert
Copy link
Owner

Apparently x_pack.fast_trunc_int() is the way to go. Maybe also do mul_add instead of the manual multiplication and addition for the x_pack values, which might be faster and improve the math.

Crossing my fingers, this feels like it's on the home stretch!

@jnnks
Copy link
Author

jnnks commented Dec 7, 2024

My numbers look a bit different relatively speaking, x8 performing the worst:

Interesting! I just did the benchmark again, after a reboot and still have the same numbers. It sounds silly, but were you actually using the computer when running the benchmark? Starting a youtube video during the benchmark skews the results for me. But I also only have 4 physical cores :')


Good catch with round_int and trunc_int, I completely missed those!
Unfortunately the tests for x8 fail when using round_int, trunc_int and the fast_ variants. From trunc_int docs: "... This saturates out of range values ...", I am afraid it will saturate for i32s, not u8s, so we need to clamp first:

let upper_bound = wide::f32x8::splat(255.0);
let lower_bound = wide::f32x8::splat(0.0);
...
let (r_pack, g_pack, b_pack) = (
    r_pack.fast_min(upper_bound).fast_max(lower_bound).fast_trunc_int(), 
    g_pack.fast_min(upper_bound).fast_max(lower_bound).fast_trunc_int(), 
    b_pack.fast_min(upper_bound).fast_max(lower_bound).fast_trunc_int());
let (r_pack, g_pack, b_pack) = (r_pack.as_array_ref(), g_pack.as_array_ref(), b_pack.as_array_ref());

The numbers are still really good though:

test convert_yuv_to_rgb_512x512            ... bench:   1,258,478.84 ns/iter (+/- 233,945.48)
test convert_yuv_to_rgb_512x512_i16_lookup ... bench:   1,325,969.98 ns/iter (+/- 72,279.72)
test convert_yuv_to_rgb_512x512_i16_math   ... bench:   1,407,387.25 ns/iter (+/- 82,439.34)
test convert_yuv_to_rgb_512x512_lookup     ... bench:   1,179,533.52 ns/iter (+/- 69,437.23)
test convert_yuv_to_rgb_512x512_x8         ... bench:     594,461.93 ns/iter (+/- 33,532.46)
test convert_yuv_to_rgb_512x512_x8_mul_add ... bench:     560,746.76 ns/iter (+/- 35,957.46)

Crossing my fingers, this feels like it's on the home stretch!

I am super curious about the i16x16 results now :)

@ralfbiedert
Copy link
Owner

ralfbiedert commented Dec 7, 2024

It sounds silly, but were you actually using the computer when running the benchmark?

Not really, the results are pretty consistent for me. Just reran the previous commit, the numbers are more or less the same (+/- 20us), but definitely x8 was by far the worst. That said, this is probably impact of slightly different CPU architecture (7950X3D), cache size, ...

New commit is fantastic, x8_mul_add beats everything:

test convert_yuv_to_rgb_512x512            ... bench:     651,320.00 ns/iter (+/- 16,189.00)
test convert_yuv_to_rgb_512x512_i16_lookup ... bench:     560,112.50 ns/iter (+/- 7,469.12)
test convert_yuv_to_rgb_512x512_i16_math   ... bench:     638,930.00 ns/iter (+/- 13,611.00)
test convert_yuv_to_rgb_512x512_lookup     ... bench:     461,365.00 ns/iter (+/- 20,584.00)
test convert_yuv_to_rgb_512x512_x8         ... bench:     218,943.33 ns/iter (+/- 4,399.67)
test convert_yuv_to_rgb_512x512_x8_mul_add ... bench:     207,442.50 ns/iter (+/- 4,412.88)

How do you want to proceed with this? From my side, this is almost ready to merge. How about

  • we remove all other YUV conversions except SIMD and original scalar one
  • rename the scalar and make it pub(crate) only
  • make the x8 pub(crate) as well
  • have a new write_rgb8 that checks if the image size is eligible for SIMD, if yes, it dispatches to SIMD, if not it uses scalar one (or maybe all of that isn't necessary at all because only certain powers of 2 are acceptable for video sizes anyway, tbh no idea)
  • add one more unit test in yuv.rs that ensures (maybe on a random-ish YUV) that the results of SIMD and scalar remain the same in the future

Would that be missing something? I assume wide is cross platform and just falls back to scalar operations and the overhead of automatic selection is practically nil?

@jnnks
Copy link
Author

jnnks commented Dec 7, 2024

i16x16 is not faster than f32x8 and considering all of the weird code introduced by the integer math idea, I would drop the topic entirely and do f32 math exclusively.

  • we remove all other YUV conversions except SIMD and original scalar one
  • rename the scalar and make it pub(crate) only
  • make the x8 pub(crate) as well

All yes

  • have a new write_rgb8 that checks if the image size is eligible for SIMD, if yes, it dispatches to SIMD, if not it uses scalar one (or maybe all of that isn't necessary at all because only certain powers of 2 are acceptable for video sizes anyway, tbh no idea)

Technically the video can be any size, especially when I think about web and responsive design. But the underlying encoder will probably pad the buffers internally. I will look into specialized buffers with padding

  • add one more unit test in yuv.rs that ensures (maybe on a random-ish YUV) that the results of SIMD and scalar remain the same in the future

I will check the entire spectrum

I assume wide is cross platform and just falls back to scalar operations and the overhead of automatic selection is practically nil?

Yes, it will check: avx, sse, simd128 (for wasm32), neon (for aarch64) and fall back to scalar math if none are found. It's a compile time check.

Would that be missing something?

If performance is linear and my calculations are correct, my CPU should now be able convert a 4k30fps video in real-time. I think that's a good milestone :)

Jannik Schleicher added 3 commits December 9, 2024 13:53
- remove experimental implementations
- add benchmarks for 4k resolution
@jnnks
Copy link
Author

jnnks commented Dec 9, 2024

  • we remove all other YUV conversions except SIMD and original scalar one
  • rename the scalar and make it pub(crate) only
  • make the x8 pub(crate) as well

The benchmarks would not work anymore in that case. I added both methods to be pub for now. Is there a cfg or feature flag for benchmarks?

have a new write_rgb8 that checks if the image size is eligible for SIMD, if yes, it dispatches to SIMD, if not it uses scalar one (or maybe all of that isn't necessary at all because only certain powers of 2 are acceptable for video sizes anyway, tbh no idea)

I added a simple check (width % 8 == 0) for now. That should be good enough until there are specialized, aligned buffers.

add one more unit test in yuv.rs that ensures (maybe on a random-ish YUV) that the results of SIMD and scalar remain the same in the future

Turns out, that the results differ by 1 on some occasions. I have not found a pattern yet, but I assume it's a rounding error.
See test_write_rgb8_f32x8_spectrum.
What do you want to do here?

@ralfbiedert
Copy link
Owner

Turns out, that the results differ by 1 on some occasions. I have not found a pattern yet, but I assume it's a rounding error. See test_write_rgb8_f32x8_spectrum. What do you want to do here?

Hm, it appears this would only be an issue if you changed resolution during decoding, which probably happens so rarely that noticing a 1 pixel difference is not an issue given the entire pipeline is already lossy in the first place.

It's also unclear if any one of these is more correct, but that in turn has some related questions w.r.t color profiles.

I think in the long term we want to move RGB conversion away from being an impl on the YUV buffer, and instead have it as a separate struct / traits so people can pick and / or implement their own conversion logic, address color profiles, ..., but that's for another PR.

For now I'd just go ahead and land this as discussed, and then later this can just be refactored to give people more flexibility.

Jannik Schleicher added 2 commits December 9, 2024 15:42
@jnnks
Copy link
Author

jnnks commented Dec 9, 2024

Seems like the CI/CD pipeline does not have an issue with the test either. I added a max diff of one per color value.

I am happy with the state of this PR, if there is anything you want, let me know

@ralfbiedert ralfbiedert marked this pull request as ready for review December 9, 2024 18:51
@ralfbiedert ralfbiedert merged commit 937e248 into ralfbiedert:master Dec 9, 2024
17 checks passed
@ralfbiedert
Copy link
Owner

Since this is merged now I wanted to say again: thank you so much! This finally fixed a long-standing performance issue, and I'm extremly happy how neat the resulting solution is!

I did a few smaller clean up commits that I didn't want to stall the PR with (mostly about the macro use, running fmt, and a note on target-cpu=native to actually get the benefits in here). I also took the liberty to credit you in the README. Please let me know if you have objections with either.

I'll release a new version (0.6.4) with your improvements in a bit.

@ralfbiedert
Copy link
Owner

0.6.4 has been released.

@jnnks jnnks deleted the yuv2rgb_perf branch December 11, 2024 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants