Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up RemoteFX encoding #571

Merged
merged 11 commits into from
Nov 5, 2024
Merged

Speed up RemoteFX encoding #571

merged 11 commits into from
Nov 5, 2024

Conversation

elmarco
Copy link
Contributor

@elmarco elmarco commented Nov 4, 2024

  • introduce rfx encoding benchmark with criterion
  • speed up a bit DWT with const generics
  • use rayon to parallel encode each tile
  • speed up a bit rgb->yuv conversion

Unfortunately, it's not simple to use hand-written assembly from a project like yuvutils, because RDP uses odd bias and precision. Something left for the another day.

Copy link
Member

@CBenoit CBenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!! Do you have some numbers from criterion? How much the performance was improved?

suggestion: Put the benchmarks in a separate crate ironrdp-benchmarks or benchmarks.

crates/ironrdp-server/src/encoder/mod.rs Outdated Show resolved Hide resolved
crates/ironrdp-graphics/Cargo.toml Outdated Show resolved Hide resolved
crates/ironrdp-graphics/src/color_conversion.rs Outdated Show resolved Hide resolved
crates/ironrdp-graphics/src/color_conversion.rs Outdated Show resolved Hide resolved
crates/ironrdp-server/src/encoder/rfx.rs Outdated Show resolved Hide resolved
crates/ironrdp-server/src/lib.rs Outdated Show resolved Hide resolved
@elmarco
Copy link
Contributor Author

elmarco commented Nov 4, 2024

fwiw, I opened awxkee/yuvutils-rs#3

@CBenoit
Copy link
Member

CBenoit commented Nov 4, 2024

fwiw, I opened awxkee/yuvutils-rs#3

Thank you for following up on this. I'll watch the issue.

@elmarco
Copy link
Contributor Author

elmarco commented Nov 4, 2024

Nice work!! Do you have some numbers from criterion? How much the performance was improved?

Well, criterion is huge time saver depending on your HW, but it still uses same amount of compute:
rfx_enc time: [6.4386 ms 6.4813 ms 6.5296 ms]
change: [-85.105% -85.013% -84.895%] (p = 0.00 < 0.05)
Performance has improved.

suggestion: Put the benchmarks in a separate crate ironrdp-benchmarks or benchmarks.

what's the rationale to put tests and benchmark in various crates? We try to ensure a common framework for the various crates that way?

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
In theory, this could help the compiler to unroll loops.. doesn't seem
to be the case though, but it allows to drop the assert_eq!() at least.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
That seems to speed up a bit the code:
rfxenc                  time:   [46.040 µs 46.288 µs 46.698 µs]
                        change: [-9.2580% -8.6663% -7.8304%] (p = 0.00 < 0.05)
                        Performance has improved.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
That doesn't change the speed though, code isn't inlined afaict.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
This can help a lot wall-clock time, but depends on CPU.

rfx_enc                 time:   [9.7885 ms 10.123 ms 10.439 ms]
                        change: [-80.484% -79.847% -79.208%] (p = 0.00 < 0.05)
                        Performance has improved.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Apparently it already did, I do not observe perf improvements.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Unfortunately, that doesn't seem to help unrolling & vectorizing: no
perf improvements.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
rgb2yuv                 time:   [11.706 µs 11.716 µs 11.727 µs]
                        change: [-24.083% -23.682% -23.394%] (p = 0.00 < 0.05)

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
@CBenoit
Copy link
Member

CBenoit commented Nov 5, 2024

Nice work!! Do you have some numbers from criterion? How much the performance was improved?

Well, criterion is huge time saver depending on your HW, but it still uses same amount of compute: rfx_enc time: [6.4386 ms 6.4813 ms 6.5296 ms] change: [-85.105% -85.013% -84.895%] (p = 0.00 < 0.05) Performance has improved.

Good numbers! By "criterion" here, you mean "rayon" for parallelizing the encoding?

suggestion: Put the benchmarks in a separate crate ironrdp-benchmarks or benchmarks.

what's the rationale to put tests and benchmark in various crates? We try to ensure a common framework for the various crates that way?

For the tests, the rationale is explained at several places:

  • ARCHITECTURE.md: https://github.com/Devolutions/IronRDP/blob/master/ARCHITECTURE.md#testing
  • tests/main.rs:
    //! Integration Tests (IT)
    //!
    //! Integration tests are all contained in this single crate, and organized in modules.
    //! This is to prevent `rustc` to re-link the library crates with each of the integration
    //! tests (one for each *.rs file / test crate under the `tests/` folder).
    //! Performance implication: https://github.com/rust-lang/cargo/pull/5022#issuecomment-364691154
    //!
    //! This is also good for execution performance.
    //! Cargo will run all tests from a single binary in parallel, but
    //! binaries themselves are run sequentally.

EDIT: Also a recommend read with similar points made: https://matklad.github.io/2021/02/27/delete-cargo-integration-tests.html

I'll also add less objective opinions and personal tastes:

For the benchmarks, in addition to the above arguments:

  • It's easier for someone not super familiar with the codebase to discover about the benchmarks
  • It's possible to see the list of all the benchmarks in the Cargo.toml
  • As for the tooling, you have the autocompletion with cargo bench --bench ^<TAB> which also returns you the list of targets. (Assuming you have completers for that in your shell.)

We also have a single ironrdp-fuzz crate in the fuzz folder at the root of the workspace for similar reasons. It's also easier to have a single README.md documenting things at one easy to discover place.

That's about it for the rationale behind this suggestion.

Copy link
Member

@CBenoit CBenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Really nice work on the performance.

@CBenoit CBenoit merged commit 2a4d357 into Devolutions:master Nov 5, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants