Evaluate more advanced optimizations like LTO, PGO, PLO #141

zamazan4ik · 2024-09-05T17:51:28Z

zamazan4ik
Sep 5, 2024

Hi!

I just read an article about Harper at Reddit - nice work! I guess I have several possibly interesting ideas to try with Harper regarding its performance and binary size.

At first, I saw that Link-Time Optimization (LTO) was not enabled. Have you tried to enable it before for the project? It can help a lot with reducing the binary size and helps a compiler perform more aggressive optimizations (always a good thing to have). If you think that enabling LTO with the default one "Release" profile can affect developers experience too much, you can create a dedicated build profile like "advanced_release" or "dist" - many projects enable LTO exactly in this way.

Secondly, after LTO I highly recommend taking a look at PGO (Profile-Guided Optimization). This optimization gives to a compiler more information about how a program is executed. Based on this, the compiler can perform more aggressive optimizations with better runtime performance. I collect as much as many materials about PGO in my repo - https://github.com/zamazan4ik/awesome-pgo . There you can read more about actual PGO benchmarks in various software (parsers, compilers, databases, etc.). Also, highly recommend to read the (unfinished-yet) article/book about PGO - it can answer many of your possible questions.

I also performed some quick PGO benchmarks for the project based on its built-in benchmarks.

Test environment

Fedora 40
Linux kernel 6.10.7
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.80.1
harper version: master branch on commit ccf14d1535c2f1450b42027afac2a8446f98e11d
Disabled Turbo boost

taskset -c 0 is used for reducing the OS scheduler's noise during the benchmarks (as much as I can guarantee ofc). For PGO optimization I use cargo-pgo tool.

I got the following results.
Release (taskset -c 0 cargo bench --workspace --all-features):

     Running benches/parse_demo.rs (target/release/deps/parse_demo-04215e47acae334a)
parse_demo              time:   [31.299 µs 31.409 µs 31.569 µs]
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

lint_demo               time:   [397.99 µs 398.06 µs 398.13 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

lint_demo_uncached      time:   [36.969 ms 36.974 ms 36.979 ms]
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

PGO optimized compared to Release (taskset -c 0 cargo pgo optimize bench -- --workspace --all-features):

     Running benches/parse_demo.rs (target/x86_64-unknown-linux-gnu/release/deps/parse_demo-e8f3360d7fa72eaf)
Benchmarking parse_demo
Benchmarking parse_demo: Warming up for 3.0000 s
Benchmarking parse_demo: Collecting 100 samples in estimated 5.0619 s (192k iterations)
Benchmarking parse_demo: Analyzing
parse_demo              time:   [26.031 µs 26.092 µs 26.187 µs]
                        change: [-17.014% -16.795% -16.606%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking lint_demo
Benchmarking lint_demo: Warming up for 3.0000 s
Benchmarking lint_demo: Collecting 100 samples in estimated 5.9111 s (15k iterations)
Benchmarking lint_demo: Analyzing
lint_demo               time:   [400.42 µs 400.76 µs 401.27 µs]
                        change: [+0.6026% +0.6648% +0.7513%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking lint_demo_uncached
Benchmarking lint_demo_uncached: Warming up for 3.0000 s
Benchmarking lint_demo_uncached: Collecting 100 samples in estimated 9.0357 s (200 iterations)
Benchmarking lint_demo_uncached: Analyzing
lint_demo_uncached      time:   [45.079 ms 45.095 ms 45.112 ms]
                        change: [+21.919% +21.966% +22.014%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

(just for reference) PGO instrumented compared to Release (taskset -c 0 cargo pgo bench -- --workspace --all-features):

     Running benches/parse_demo.rs (target/x86_64-unknown-linux-gnu/release/deps/parse_demo-e8f3360d7fa72eaf)
Benchmarking parse_demo
Benchmarking parse_demo: Warming up for 3.0000 s
Benchmarking parse_demo: Collecting 100 samples in estimated 5.2021 s (71k iterations)
Benchmarking parse_demo: Analyzing
parse_demo              time:   [73.345 µs 73.424 µs 73.564 µs]
                        change: [+133.51% +134.08% +134.49%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  8 (8.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking lint_demo
Benchmarking lint_demo: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
Benchmarking lint_demo: Collecting 100 samples in estimated 6.1270 s (5050 iterations)
Benchmarking lint_demo: Analyzing
lint_demo               time:   [1.1247 ms 1.1250 ms 1.1253 ms]
                        change: [+182.46% +182.58% +182.70%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

Benchmarking lint_demo_uncached
Benchmarking lint_demo_uncached: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, or reduce sample count to 90.
Benchmarking lint_demo_uncached: Collecting 100 samples in estimated 5.2008 s (100 iterations)
Benchmarking lint_demo_uncached: Analyzing
lint_demo_uncached      time:   [54.176 ms 54.230 ms 54.298 ms]
                        change: [+46.517% +46.671% +46.859%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

According to the results, PGO can help with improving the library performance further. However, in the uncached example, we see performance degradation. I think it's due to the training dataset skew between loads for something like that - more experiments can be performed in this area. Before that, maybe this PGO-related information would be helpful for other performance-oriented users.

After PGO, I can suggest evaluating PLO (Post-Link Optimization) with LLVM BOLT as an additional optimization step. However, I recommend enabling it only after PGO (PGO usually works better than PLO in practice for now).

Regarding priorities. I highly suggest enabling LTO now. PGO and PLO, IMHO, can wait for more time (I guess spending this time on actual features would be a better option since switching on PGO with PLO, and possible CI pipelines tweaks can consume too much human resources).

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate more advanced optimizations like LTO, PGO, PLO #141

{{title}}

Replies: 0 comments

Select a reply

Evaluate more advanced optimizations like LTO, PGO, PLO #141

zamazan4ik Sep 5, 2024

Replies: 0 comments

zamazan4ik
Sep 5, 2024