Speed up RemoteFX encoding #571

elmarco · 2024-11-04T11:15:27Z

introduce rfx encoding benchmark with criterion
speed up a bit DWT with const generics
use rayon to parallel encode each tile
speed up a bit rgb->yuv conversion

Unfortunately, it's not simple to use hand-written assembly from a project like yuvutils, because RDP uses odd bias and precision. Something left for the another day.

CBenoit

Nice work!! Do you have some numbers from criterion? How much the performance was improved?

suggestion: Put the benchmarks in a separate crate ironrdp-benchmarks or benchmarks.

crates/ironrdp-server/src/encoder/mod.rs

crates/ironrdp-graphics/Cargo.toml

crates/ironrdp-graphics/src/color_conversion.rs

crates/ironrdp-server/src/encoder/rfx.rs

crates/ironrdp-server/src/lib.rs

elmarco · 2024-11-04T13:56:57Z

fwiw, I opened awxkee/yuvutils-rs#3

CBenoit · 2024-11-04T13:59:39Z

fwiw, I opened awxkee/yuvutils-rs#3

Thank you for following up on this. I'll watch the issue.

elmarco · 2024-11-04T18:26:45Z

Nice work!! Do you have some numbers from criterion? How much the performance was improved?

Well, criterion is huge time saver depending on your HW, but it still uses same amount of compute:
rfx_enc time: [6.4386 ms 6.4813 ms 6.5296 ms]
change: [-85.105% -85.013% -84.895%] (p = 0.00 < 0.05)
Performance has improved.

suggestion: Put the benchmarks in a separate crate ironrdp-benchmarks or benchmarks.

what's the rationale to put tests and benchmark in various crates? We try to ensure a common framework for the various crates that way?

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

In theory, this could help the compiler to unroll loops.. doesn't seem to be the case though, but it allows to drop the assert_eq!() at least. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

That seems to speed up a bit the code: rfxenc time: [46.040 µs 46.288 µs 46.698 µs] change: [-9.2580% -8.6663% -7.8304%] (p = 0.00 < 0.05) Performance has improved. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

That doesn't change the speed though, code isn't inlined afaict. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

This can help a lot wall-clock time, but depends on CPU. rfx_enc time: [9.7885 ms 10.123 ms 10.439 ms] change: [-80.484% -79.847% -79.208%] (p = 0.00 < 0.05) Performance has improved. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

Apparently it already did, I do not observe perf improvements. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

Unfortunately, that doesn't seem to help unrolling & vectorizing: no perf improvements. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

rgb2yuv time: [11.706 µs 11.716 µs 11.727 µs] change: [-24.083% -23.682% -23.394%] (p = 0.00 < 0.05) Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

CBenoit · 2024-11-05T10:59:18Z

Nice work!! Do you have some numbers from criterion? How much the performance was improved?

Well, criterion is huge time saver depending on your HW, but it still uses same amount of compute: rfx_enc time: [6.4386 ms 6.4813 ms 6.5296 ms] change: [-85.105% -85.013% -84.895%] (p = 0.00 < 0.05) Performance has improved.

Good numbers! By "criterion" here, you mean "rayon" for parallelizing the encoding?

suggestion: Put the benchmarks in a separate crate ironrdp-benchmarks or benchmarks.

what's the rationale to put tests and benchmark in various crates? We try to ensure a common framework for the various crates that way?

For the tests, the rationale is explained at several places:

ARCHITECTURE.md: https://github.com/Devolutions/IronRDP/blob/master/ARCHITECTURE.md#testing

tests/main.rs:

IronRDP/crates/ironrdp-testsuite-core/tests/main.rs

Lines 3 to 12 in 58f31b8

    
           //! Integration Tests (IT) 
        
           //! 
        
           //! Integration tests are all contained in this single crate, and organized in modules. 
        
           //! This is to prevent `rustc` to re-link the library crates with each of the integration 
        
           //! tests (one for each *.rs file / test crate under the `tests/` folder). 
        
           //! Performance implication: https://github.com/rust-lang/cargo/pull/5022#issuecomment-364691154 
        
           //! 
        
           //! This is also good for execution performance. 
        
           //! Cargo will run all tests from a single binary in parallel, but 
        
           //! binaries themselves are run sequentally.

EDIT: Also a recommend read with similar points made: https://matklad.github.io/2021/02/27/delete-cargo-integration-tests.html

I'll also add less objective opinions and personal tastes:

The lint unused_crate_dependencies will detect less false-positives.
We can put code useful across tests along in the library part of the crate, for instance the macros: https://github.com/Devolutions/IronRDP/blob/58f31b88e06209a0cf24bb17213ae081438e3e57/crates/ironrdp-testsuite-core/src/macros.rs
The Cargo.toml is kept free from test-only elements which are not relevant for a consumer point of view (although I think it's typically fine)

For the benchmarks, in addition to the above arguments:

It's easier for someone not super familiar with the codebase to discover about the benchmarks
It's possible to see the list of all the benchmarks in the Cargo.toml
As for the tooling, you have the autocompletion with cargo bench --bench ^<TAB> which also returns you the list of targets. (Assuming you have completers for that in your shell.)

We also have a single ironrdp-fuzz crate in the fuzz folder at the root of the workspace for similar reasons. It's also easier to have a single README.md documenting things at one easy to discover place.

That's about it for the rationale behind this suggestion.

CBenoit

LGTM! Really nice work on the performance.

elmarco force-pushed the bench branch from 35aac2e to e9825d7 Compare November 4, 2024 12:03

CBenoit reviewed Nov 4, 2024

View reviewed changes

elmarco added 9 commits November 5, 2024 12:19

feat(server): warn if encoding takes >10ms

c01c382

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

refactor(server): factor out remotefx tile encoding

1b03948

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

feat(bench): benchmark the remotefx encoder

fd1fd39

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

feat(bench): benchmark rgb2yuv tile encoding

5aee8e6

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

refactor(graphics): use fixed-size slices in to_64x64_ycbcr_tile

d185b13

In theory, this could help the compiler to unroll loops.. doesn't seem to be the case though, but it allows to drop the assert_eq!() at least. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

perf(graphics): use const generics for DWT

92fe437

That seems to speed up a bit the code: rfxenc time: [46.040 µs 46.288 µs 46.698 µs] change: [-9.2580% -8.6663% -7.8304%] (p = 0.00 < 0.05) Performance has improved. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

refactor(graphics): const pixel_format_to_rgb_fn

93dc066

That doesn't change the speed though, code isn't inlined afaict. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

refactor(graphics): make sure Rust uses const YUV matrix values

bfcf7d0

Apparently it already did, I do not observe perf improvements. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

elmarco force-pushed the bench branch from e9825d7 to 8abd94c Compare November 5, 2024 08:19

elmarco added 2 commits November 5, 2024 13:45

refactor(graphics): use an ExactSizeIterator for iter_to_ycbcr

d5a81a1

Unfortunately, that doesn't seem to help unrolling & vectorizing: no perf improvements. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

perf(graphics): help Rust to inline iter_to_ycbcr with format

a5e7a8e

rgb2yuv time: [11.706 µs 11.716 µs 11.727 µs] change: [-24.083% -23.682% -23.394%] (p = 0.00 < 0.05) Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>

elmarco force-pushed the bench branch from 8abd94c to a5e7a8e Compare November 5, 2024 09:45

CBenoit approved these changes Nov 5, 2024

View reviewed changes

CBenoit merged commit 2a4d357 into Devolutions:master Nov 5, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up RemoteFX encoding #571

Speed up RemoteFX encoding #571

elmarco commented Nov 4, 2024

CBenoit left a comment

elmarco commented Nov 4, 2024

CBenoit commented Nov 4, 2024

elmarco commented Nov 4, 2024

CBenoit commented Nov 5, 2024 •

edited

Loading

CBenoit left a comment

Speed up RemoteFX encoding #571

Speed up RemoteFX encoding #571

Conversation

elmarco commented Nov 4, 2024

CBenoit left a comment

Choose a reason for hiding this comment

elmarco commented Nov 4, 2024

CBenoit commented Nov 4, 2024

elmarco commented Nov 4, 2024

CBenoit commented Nov 5, 2024 • edited Loading

CBenoit left a comment

Choose a reason for hiding this comment

CBenoit commented Nov 5, 2024 •

edited

Loading