Improve YUV to RGB Performance #66

jnnks · 2024-11-21T19:44:51Z

avoids floating point math
it delivers the same result for the two benchmark test cases, but I imagine it does not deliver the same result for all inputs

Benchmarks

cargo +nightly bench convert_yuv_to_rgb_512x512:

before: 1,318,003.51 ns/iter (+/- 222,491.23)
after: 866,946.60 ns/iter (+/- 41,533.46)

cargo +nightly bench convert_yuv_to_rgb_1920x1080:

before: 10,523,719.80 ns/iter (+/- 2,798,988.23)
after: 7,002,009.60 ns/iter (+/- 1,819,635.86)

Tested against 2588499

I hope you like funky math :)
If you are ok with this, I can continue with the other functions and clean up afterwards

ralfbiedert · 2024-11-22T07:41:26Z

Thanks for following up on this! I have no problem with integer math there if the results are equal-ish, but some suggestions:

Since other people might feel differently, there should be a way to choose what math to do
That "way" could be having multiple methods, maybe even some type-trait-parameter magic like xxx.write_rgb::<IntegerMath>(&mut buffer) that might allow for multiple implementations (that conceivably even users could do, then maybe even with Rayon in their own code)
I don't get the unchecked here, you don't seem to be doing unsafe (and you shouldn't, at least not in the default implementation; if we had type'ed impls, maybe we could have one)?
I'd also recommend to run this with a profiler / disassembler and see if you get AVX2+ instructions when compiling for native CPUs. If not it might be worthwhile to make sure that buffer is allocated with the proper alignment and maybe introduce some aligned wrapper types?

jnnks · 2024-11-22T10:18:41Z

Since other people might feel differently, there should be a way to choose what math to do

I can add a feature flag

That "way" could be having multiple methods, maybe even some type-trait-parameter magic ...

I would prefer a different method name, tbh
from the perspective of someone looking into the lib for the first time a suffix at the end of the method is much easier to understand

I don't get the unchecked here, ...

unckecked as in no asserts to check buffer bounds

I'd also recommend to run this with a profiler...

that will take some time I assume

ralfbiedert · 2024-11-22T16:53:56Z

I can add a feature flag

I wouldn't do feature flags here. These should just be different methods, not different implementations of the same method, and if it's just a bit of logic (e.g., no crate dependencies or so) features are too heavyweight.

I would prefer a different method name, tbh

Sure, lets start with another method.

unckecked as in no asserts to check buffer bounds

Unchecked is a rather technical term in Rust and usually means strictly no bound checks via unsafe. At least the code you have so far still has bounds checks (e.g., self.y[base_y] checks base_y is inside the array), and I prefer to keep it that way.

This reverts commit 5b4bdec.

jnnks · 2024-11-24T12:07:34Z

Unfortunately I made some mistakes during testing. The integer math implementation was not correct.

cargo +nightly bench convert_yuv_to_rgb_512x512

test convert_yuv_to_rgb_512x512          ... bench:   1,320,918.10 ns/iter (+/- 116,723.12)
test convert_yuv_to_rgb_512x512_int_math ... bench:   1,015,732.15 ns/iter (+/- 66,961.28)

RUSTFLAGS='-C target-cpu=native' cargo +nightly bench convert_yuv_to_rgb_512x512

test convert_yuv_to_rgb_512x512          ... bench:   1,218,058.07 ns/iter (+/- 230,080.26)
test convert_yuv_to_rgb_512x512_int_math ... bench:   1,581,914.55 ns/iter (+/- 364,022.30)

When running the benchmark with RUSTFLAGS='-C target-cpu=native', the int math becomes significantly slower, but I am not sure why.

ralfbiedert · 2024-11-24T13:09:11Z

When running the benchmark with RUSTFLAGS='-C target-cpu=native', the int math becomes significantly slower, but I am not sure why.

Again, I recommend a profiler. If you are on Windows, Superluminal might be worth a shot (they have a free trial version). As a wild guess, running as native might unlock float vector instructions (and / or maybe even some FMA), while at the same time your integer loop might be harder to optimize and / or do more work (e.g., clamping, which probably adds 2 * 3 = 6 branches per pixel).

jnnks · 2024-11-24T15:57:21Z

I tried some of the recommended profilers from the profiling section in the Rust book, but I need to understand them properly.

The granularity of the profilers I tried is functions, not instructions, so that is not immediatly helpful.
RUSTFLAGS='-C target-cpu=native' samply record cargo +nightly bench convert_yuv_to_rgb_512x512 generates the following call tree:

Which suggests to me that write_rgb8_int_math takes 1/2 as much time, as the sample count is roughly 1/2. But in reality that is not the case at all.

I have never done any serious profiling, so please bear with me here :)
I am using Linux.

ralfbiedert · 2024-11-24T17:49:53Z

I just ran this on my machine (Ryzen 9 7950X3D) with native CPU.

test convert_yuv_to_rgb_512x512          ... bench:     643,150.00 ns/iter (+/- 32,933.50)
test convert_yuv_to_rgb_512x512_int_math ... bench:     588,450.00 ns/iter (+/- 46,479.00)

Using Superluminal the "hot" sections for each are this:

Float math:

Int math:

The bad news is, both functions are pretty poor assembly wise (no vectorization, just single scalar operations). The good news is, it should be possible to get some good speed boost beyond both the floaty one, and the int one.

What I'd do (without having actually tried, YMMV) is vaguely:

Check what's the cost of just memcpy-ing the entire RGB / YUV texture. If that time is "very small" relative to the conversion, proceed to next step
Have internal buffers for YUV / RGB that are overaligned (e.g., to 4 / 8 / 16 / 32 bytes) and that support multiple elements, e.g., 4), you'll probably want a special transparent struct for that (e.g., PixelComponents([u8; 4]). The trick here is to make it really easy for the compiler to see that 1) all data is aligned and 2) how you access that aligned data.
Do you YUV -> RGB conversion in a loop like before, but instead of doing 1 element, do all 4 elements in that loop "unrolled". You might need some extra handling of excess data at the end of each row (e.g., just access next bytes but discard), and at the end of buffer (stop loop early and do last elements manually).

Once you've done these steps you have a really good chance that by then the compiler groks that it should vectorize that unrolled loop, giving you hopefully a 4x speedup (you can try the same with 8x or 16x), and possibly get even more of a speedup.

You'd pay the overhead of copying the data into and out from these overaligned buffers, but that might be much less that the cost of doing unaligned conversion math on all of them.

jnnks · 2024-11-28T21:55:44Z

Check what's the cost of just memcpy-ing the entire RGB / YUV texture. If that time is "very small" relative to the conversion, proceed to next step

I tested this and copying the relevant parts of the buffer (width * height not stride * height) is <2% of the conversion time. When packing the color values into groups of 8, it takes <3%.

RUSTFLAGS='-C target-cpu=native' cargo +nightly bench convert_yuv_to_rgb_512x512
test convert_yuv_to_rgb_512x512                ... bench:   1,476,245.10 ns/iter (+/- 429,006.75)
test convert_yuv_to_rgb_512x512_copy_planes    ... bench:      20,203.08 ns/iter (+/- 2,335.37)
test convert_yuv_to_rgb_512x512_copy_planes_x8 ... bench:      35,961.64 ns/iter (+/- 5,412.43)
test convert_yuv_to_rgb_512x512_int_math       ... bench:   1,396,169.80 ns/iter (+/- 184,486.73)
test convert_yuv_to_rgb_512x512_x8             ... bench:   1,373,674.05 ns/iter (+/- 45,413.79)

Have internal buffers for YUV / RGB that are overaligned (e.g., to 4 / 8 / 16 / 32 bytes) and that support multiple elements, e.g., 4), you'll probably want a special transparent struct for that (e.g., PixelComponents([u8; 4]). The trick here is to make it really easy for the compiler to see that 1) all data is aligned and 2) how you access that aligned data.

I like the idea of tricking the compiler to optimize. write_rgb8_x8 is a minimum effort implementation (the test test_write_rgb8_x8 fails, because values diverge too much).

The copy_planes_x8 function is not used yet, because I keep messing up the indices in the write_rgb8 loop.
With the U and V planes a quarter of the size, it is a bit tricky to wrap my head around the indexing. Right now, we process the image row-wise. While we get ordered access to the Y plane, we have to lookup the same U and V values twice, correct?

What do you think of wide or ndarray? I have made good experiences with ndarray for easy implicit SIMD optimizations, but the U and V planes probably have to be scaled for that.

ralfbiedert · 2024-12-01T21:20:40Z

With the U and V planes a quarter of the size, it is a bit tricky to wrap my head around the indexing. Right now, we process the image row-wise. While we get ordered access to the Y plane, we have to lookup the same U and V values twice, correct?

I haven't looked into this, but I wouldn't be surprised if you have to re-think memory access, cache and loop order to get the best performance. Not sure how relevant it is given its age, but I found a paper on the problem:

http://lestourtereaux.free.fr/papers/data/yuvrgb.pdf

jnnks · 2024-12-04T21:27:21Z

The paper contains some interesting tricks. I implemented the lookup table and it slightly reduces the benchmark time.

test convert_yuv_to_rgb_512x512            ... bench:   1,240,672.85 ns/iter (+/- 229,919.14)
test convert_yuv_to_rgb_512x512_int_math   ... bench:   1,398,108.16 ns/iter (+/- 121,577.64)
test convert_yuv_to_rgb_512x512_x8         ... bench:   1,370,854.05 ns/iter (+/- 75,570.61)

test convert_yuv_to_rgb_512x512_int_lookup ... bench:   1,242,725.61 ns/iter (+/- 85,690.72)
test convert_yuv_to_rgb_512x512_lookup     ... bench:   1,167,169.96 ns/iter (+/- 95,340.19)

Some ideas from the paper rely on pointer magic, which is at least unsafe {...} or unsupported in Rust. I really liked the idea of replacing the clamping with a lookup (Sec 3.3). The conversion & clamping from f32 and i32 back to u8 takes a considerable amount of time:

test clamping_f32_u8                       ... bench:     472,408.60 ns/iter (+/- 104,557.68)
test clamping_i32_u8                       ... bench:     563,045.05 ns/iter (+/- 146,890.94)
test clamping_lookup                       ... bench:     514,645.23 ns/iter (+/- 73,549.45)

clamping_i32_u8: from i32 to u8 with i32.clamp(0, 255) as u8
- directly casting i32 as u8 does not work, as it will over/underflow
clamping_f32_u8: from f32 to u8 with f32 as u8
clamping_lookup: from isize to u8 with lookup table as described in the paper, using unsafe pointer manipulation
- I probably did something wrong here, Rust does not support pointer math as elegantly as the C code in the paper

The clamping test use a 512x512x3 long buffer, so the measured numbers align with the other benchmarks. When removing the clamp(0, 255) from the int_math algorithm the time is reduced by roughly the same amount clamping_i32_u8 takes.
40% of the time is spent clamping.

I will look further into the clamping and will research branchless alternatives. For f32s it's mandatory, but maybe there is some bit magic for the i32s.

ralfbiedert · 2024-12-04T21:53:10Z

I only skimmed the code, but I don't really understand how the pointer magic works. For now this is totally fine, but we probably want a safe alternative for later (maybe a safe transmute on a slice level if really needed).

Also, it appears your benchmark does these lookups in an incremental fashion, which probably means you get lots of caching, but in reality your access might be "random" and thus perform worse.

jnnks · 2024-12-05T11:11:29Z

I only skimmed the code, but I don't really understand how the pointer magic works

It behaves like a hard-sigmoid activation function. In the example the YUV->RGB formula has a limited number of possible values for each color, for blue that is [-227, 480]. E. Dupuis built an array of (227 + 480) values and using the color value as an index to map the unclamped value into a clamped value.
I changed it to not use undefined behavior.

Also, it appears your benchmark does these lookups in an incremental fashion, which probably means you get lots of caching, but in reality your access might be "random" and thus perform worse.

You are right! Especially the lookup is significantly worse after using random numbers.

test clamping_f32_u8                       ... bench:     467,216.20 ns/iter (+/- 62,301.58)
test clamping_i32_u8                       ... bench:     626,011.79 ns/iter (+/- 74,282.14)
test clamping_lookup                       ... bench:     736,682.20 ns/iter (+/- 63,212.27)

jnnks · 2024-12-06T11:39:56Z

I figured out the indices for the x8 implementation, so the test passes now (a deviation of 1 per pixel per color is allowed).

The integer math is using i16s now, since the accuracy is similar. I did not test the entire spectrum of YUV values, but for the included h264 frame the RGB values are the same. This will be very interesting, because we can do i16x16 operations for add, sub, mul and clamping using AVX.

The existing x8 implementation is now using wide::f32x8 (wide) and is a bit faster than the original function.

test convert_yuv_to_rgb_512x512            ... bench:   1,353,391.38 ns/iter (+/- 154,339.35)
test convert_yuv_to_rgb_512x512_lookup     ... bench:   1,352,759.49 ns/iter (+/- 244,724.19)
test convert_yuv_to_rgb_512x512_i16_math   ... bench:   1,566,020.30 ns/iter (+/- 213,608.03)
test convert_yuv_to_rgb_512x512_i16_lookup ... bench:   1,488,072.48 ns/iter (+/- 194,859.50)
test convert_yuv_to_rgb_512x512_x8         ... bench:   1,260,816.24 ns/iter (+/- 113,147.20)

When compiling with RUSTFLAGS='-C target-cpu=native' I can see that AVX instructions are generated, but the improvement is a bit disappointing to be honest.

ralfbiedert · 2024-12-06T17:40:16Z

My numbers look a bit different relatively speaking, x8 performing the worst:

test convert_yuv_to_rgb_512x512            ... bench:     620,730.00 ns/iter (+/- 22,816.00)
test convert_yuv_to_rgb_512x512_i16_lookup ... bench:     557,630.00 ns/iter (+/- 14,977.00)
test convert_yuv_to_rgb_512x512_i16_math   ... bench:     613,970.00 ns/iter (+/- 17,916.00)
test convert_yuv_to_rgb_512x512_lookup     ... bench:     446,850.00 ns/iter (+/- 26,981.00)
test convert_yuv_to_rgb_512x512_x8         ... bench:     715,435.00 ns/iter (+/- 62,842.00)

The hot paths are:

with the disassembly in question being:

The issue you're probably running into is that you're manually (serially) trying to convert f32 values into u8 (r_pack[i] as u8). If I instead do this on my machine

let (r_pack, g_pack, b_pack) = (r_pack.round_int(), g_pack.round_int(), b_pack.round_int());
let (r_pack, g_pack, b_pack) = (r_pack.as_array_ref(), g_pack.as_array_ref(), b_pack.as_array_ref());

I get a total time of

test convert_yuv_to_rgb_512x512_x8         ... bench:     228,533.33 ns/iter (+/- 17,070.33)

That said, not sure if that's the fastest / best way of converting.

ralfbiedert · 2024-12-06T21:35:39Z

Apparently x_pack.fast_trunc_int() is the way to go. Maybe also do mul_add instead of the manual multiplication and addition for the x_pack values, which might be faster and improve the math.

Crossing my fingers, this feels like it's on the home stretch!

jnnks · 2024-12-07T14:01:03Z

My numbers look a bit different relatively speaking, x8 performing the worst:

Interesting! I just did the benchmark again, after a reboot and still have the same numbers. It sounds silly, but were you actually using the computer when running the benchmark? Starting a youtube video during the benchmark skews the results for me. But I also only have 4 physical cores :')

Good catch with round_int and trunc_int, I completely missed those!
Unfortunately the tests for x8 fail when using round_int, trunc_int and the fast_ variants. From trunc_int docs: "... This saturates out of range values ...", I am afraid it will saturate for i32s, not u8s, so we need to clamp first:

let upper_bound = wide::f32x8::splat(255.0);
let lower_bound = wide::f32x8::splat(0.0);
...
let (r_pack, g_pack, b_pack) = (
    r_pack.fast_min(upper_bound).fast_max(lower_bound).fast_trunc_int(), 
    g_pack.fast_min(upper_bound).fast_max(lower_bound).fast_trunc_int(), 
    b_pack.fast_min(upper_bound).fast_max(lower_bound).fast_trunc_int());
let (r_pack, g_pack, b_pack) = (r_pack.as_array_ref(), g_pack.as_array_ref(), b_pack.as_array_ref());

The numbers are still really good though:

test convert_yuv_to_rgb_512x512            ... bench:   1,258,478.84 ns/iter (+/- 233,945.48)
test convert_yuv_to_rgb_512x512_i16_lookup ... bench:   1,325,969.98 ns/iter (+/- 72,279.72)
test convert_yuv_to_rgb_512x512_i16_math   ... bench:   1,407,387.25 ns/iter (+/- 82,439.34)
test convert_yuv_to_rgb_512x512_lookup     ... bench:   1,179,533.52 ns/iter (+/- 69,437.23)
test convert_yuv_to_rgb_512x512_x8         ... bench:     594,461.93 ns/iter (+/- 33,532.46)
test convert_yuv_to_rgb_512x512_x8_mul_add ... bench:     560,746.76 ns/iter (+/- 35,957.46)

Crossing my fingers, this feels like it's on the home stretch!

I am super curious about the i16x16 results now :)

ralfbiedert · 2024-12-07T15:24:50Z

It sounds silly, but were you actually using the computer when running the benchmark?

Not really, the results are pretty consistent for me. Just reran the previous commit, the numbers are more or less the same (+/- 20us), but definitely x8 was by far the worst. That said, this is probably impact of slightly different CPU architecture (7950X3D), cache size, ...

New commit is fantastic, x8_mul_add beats everything:

test convert_yuv_to_rgb_512x512            ... bench:     651,320.00 ns/iter (+/- 16,189.00)
test convert_yuv_to_rgb_512x512_i16_lookup ... bench:     560,112.50 ns/iter (+/- 7,469.12)
test convert_yuv_to_rgb_512x512_i16_math   ... bench:     638,930.00 ns/iter (+/- 13,611.00)
test convert_yuv_to_rgb_512x512_lookup     ... bench:     461,365.00 ns/iter (+/- 20,584.00)
test convert_yuv_to_rgb_512x512_x8         ... bench:     218,943.33 ns/iter (+/- 4,399.67)
test convert_yuv_to_rgb_512x512_x8_mul_add ... bench:     207,442.50 ns/iter (+/- 4,412.88)

How do you want to proceed with this? From my side, this is almost ready to merge. How about

we remove all other YUV conversions except SIMD and original scalar one
rename the scalar and make it pub(crate) only
make the x8 pub(crate) as well
have a new write_rgb8 that checks if the image size is eligible for SIMD, if yes, it dispatches to SIMD, if not it uses scalar one (or maybe all of that isn't necessary at all because only certain powers of 2 are acceptable for video sizes anyway, tbh no idea)
add one more unit test in yuv.rs that ensures (maybe on a random-ish YUV) that the results of SIMD and scalar remain the same in the future

Would that be missing something? I assume wide is cross platform and just falls back to scalar operations and the overhead of automatic selection is practically nil?

jnnks · 2024-12-07T16:19:23Z

i16x16 is not faster than f32x8 and considering all of the weird code introduced by the integer math idea, I would drop the topic entirely and do f32 math exclusively.

we remove all other YUV conversions except SIMD and original scalar one

rename the scalar and make it pub(crate) only

make the x8 pub(crate) as well

All yes

have a new write_rgb8 that checks if the image size is eligible for SIMD, if yes, it dispatches to SIMD, if not it uses scalar one (or maybe all of that isn't necessary at all because only certain powers of 2 are acceptable for video sizes anyway, tbh no idea)

Technically the video can be any size, especially when I think about web and responsive design. But the underlying encoder will probably pad the buffers internally. I will look into specialized buffers with padding

add one more unit test in yuv.rs that ensures (maybe on a random-ish YUV) that the results of SIMD and scalar remain the same in the future

I will check the entire spectrum

I assume wide is cross platform and just falls back to scalar operations and the overhead of automatic selection is practically nil?

Yes, it will check: avx, sse, simd128 (for wasm32), neon (for aarch64) and fall back to scalar math if none are found. It's a compile time check.

Would that be missing something?

If performance is linear and my calculations are correct, my CPU should now be able convert a 4k30fps video in real-time. I think that's a good milestone :)

- remove experimental implementations - add benchmarks for 4k resolution

jnnks · 2024-12-09T13:09:52Z

we remove all other YUV conversions except SIMD and original scalar one

rename the scalar and make it pub(crate) only

make the x8 pub(crate) as well

The benchmarks would not work anymore in that case. I added both methods to be pub for now. Is there a cfg or feature flag for benchmarks?

have a new write_rgb8 that checks if the image size is eligible for SIMD, if yes, it dispatches to SIMD, if not it uses scalar one (or maybe all of that isn't necessary at all because only certain powers of 2 are acceptable for video sizes anyway, tbh no idea)

I added a simple check (width % 8 == 0) for now. That should be good enough until there are specialized, aligned buffers.

add one more unit test in yuv.rs that ensures (maybe on a random-ish YUV) that the results of SIMD and scalar remain the same in the future

Turns out, that the results differ by 1 on some occasions. I have not found a pattern yet, but I assume it's a rounding error.
See test_write_rgb8_f32x8_spectrum.
What do you want to do here?

ralfbiedert · 2024-12-09T13:23:10Z

Turns out, that the results differ by 1 on some occasions. I have not found a pattern yet, but I assume it's a rounding error. See test_write_rgb8_f32x8_spectrum. What do you want to do here?

Hm, it appears this would only be an issue if you changed resolution during decoding, which probably happens so rarely that noticing a 1 pixel difference is not an issue given the entire pipeline is already lossy in the first place.

It's also unclear if any one of these is more correct, but that in turn has some related questions w.r.t color profiles.

I think in the long term we want to move RGB conversion away from being an impl on the YUV buffer, and instead have it as a separate struct / traits so people can pick and / or implement their own conversion logic, address color profiles, ..., but that's for another PR.

For now I'd just go ahead and land this as discussed, and then later this can just be refactored to give people more flexibility.

jnnks · 2024-12-09T14:55:23Z

Seems like the CI/CD pipeline does not have an issue with the test either. I added a max diff of one per color value.

I am happy with the state of this PR, if there is anything you want, let me know

ralfbiedert · 2024-12-09T20:23:00Z

Since this is merged now I wanted to say again: thank you so much! This finally fixed a long-standing performance issue, and I'm extremly happy how neat the resulting solution is!

I did a few smaller clean up commits that I didn't want to stall the PR with (mostly about the macro use, running fmt, and a note on target-cpu=native to actually get the benefits in here). I also took the liberty to credit you in the README. Please let me know if you have objections with either.

I'll release a new version (0.6.4) with your improvements in a bit.

ralfbiedert · 2024-12-09T20:45:00Z

0.6.4 has been released.

Avoid Floating Point Math

Loading
Loading status checks…

5b4bdec

Jannik Schleicher added 2 commits November 24, 2024 12:40

Revert "Avoid Floating Point Math"

0901313

This reverts commit 5b4bdec.

Avoid Floating Point Math

Loading
Loading status checks…

cba2cfd

Jannik Schleicher added 2 commits November 28, 2024 21:35

Add write_rgb8_x8 and copy_planes

1074644

Add copy_planes_x8

Loading
Loading status checks…

f588c3a

Add Lookup Tables and Clamp Benchmarks

Loading
Loading status checks…

fe7217d

Use Random Numbers For Benchmark

ce380b9

Jannik Schleicher added 4 commits December 6, 2024 12:02

Fix write_rgb8_x8

354ab3d

Remove copy_planes*

4c39fff

Use i16 for int math

ca2cf29

Implement AVX SIMD

d31b2bc

Use SIMD Clamping

b56628d

Add i16x16

302d3a2

Jannik Schleicher added 3 commits December 9, 2024 13:53

Clean Up

97e6fc9

- remove experimental implementations - add benchmarks for 4k resolution

Add Spectrum Test

Loading
Loading status checks…

d5f60a5

Add Whitespce

Loading
Loading status checks…

79af3b1

Jannik Schleicher added 2 commits December 9, 2024 15:42

Remove Additional Benchmarks

158a1a8

Move Test

Loading
Loading status checks…

292761d

ralfbiedert marked this pull request as ready for review December 9, 2024 18:51

ralfbiedert merged commit 937e248 into ralfbiedert:master Dec 9, 2024
17 checks passed

ralfbiedert mentioned this pull request Dec 9, 2024

Use rayon to speed up yuv to rgb conversion. #27

Closed

jnnks deleted the yuv2rgb_perf branch December 11, 2024 10:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve YUV to RGB Performance #66

Improve YUV to RGB Performance #66

jnnks commented Nov 21, 2024

ralfbiedert commented Nov 22, 2024

jnnks commented Nov 22, 2024

ralfbiedert commented Nov 22, 2024

jnnks commented Nov 24, 2024

ralfbiedert commented Nov 24, 2024

jnnks commented Nov 24, 2024 •

edited

Loading

ralfbiedert commented Nov 24, 2024

jnnks commented Nov 28, 2024

ralfbiedert commented Dec 1, 2024

jnnks commented Dec 4, 2024

ralfbiedert commented Dec 4, 2024

jnnks commented Dec 5, 2024

jnnks commented Dec 6, 2024

ralfbiedert commented Dec 6, 2024 •

edited

Loading

ralfbiedert commented Dec 6, 2024

jnnks commented Dec 7, 2024

ralfbiedert commented Dec 7, 2024 •

edited

Loading

jnnks commented Dec 7, 2024

jnnks commented Dec 9, 2024

ralfbiedert commented Dec 9, 2024

jnnks commented Dec 9, 2024

ralfbiedert commented Dec 9, 2024

ralfbiedert commented Dec 9, 2024

Improve YUV to RGB Performance #66

Improve YUV to RGB Performance #66

Conversation

jnnks commented Nov 21, 2024

Benchmarks

ralfbiedert commented Nov 22, 2024

jnnks commented Nov 22, 2024

ralfbiedert commented Nov 22, 2024

jnnks commented Nov 24, 2024

ralfbiedert commented Nov 24, 2024

jnnks commented Nov 24, 2024 • edited Loading

ralfbiedert commented Nov 24, 2024

jnnks commented Nov 28, 2024

ralfbiedert commented Dec 1, 2024

jnnks commented Dec 4, 2024

ralfbiedert commented Dec 4, 2024

jnnks commented Dec 5, 2024

jnnks commented Dec 6, 2024

ralfbiedert commented Dec 6, 2024 • edited Loading

ralfbiedert commented Dec 6, 2024

jnnks commented Dec 7, 2024

ralfbiedert commented Dec 7, 2024 • edited Loading

jnnks commented Dec 7, 2024

jnnks commented Dec 9, 2024

ralfbiedert commented Dec 9, 2024

jnnks commented Dec 9, 2024

ralfbiedert commented Dec 9, 2024

ralfbiedert commented Dec 9, 2024

jnnks commented Nov 24, 2024 •

edited

Loading

ralfbiedert commented Dec 6, 2024 •

edited

Loading

ralfbiedert commented Dec 7, 2024 •

edited

Loading