You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Not a problem - just an idea of how the compiler's performance can be improved.
Describe alternatives you've considered
Leave the things as is.
Describe the solution you'd like
I checked Profile-Guided Optimization (PGO) on many projects (including compilers) - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Since this, I decided to test PGO with the Bend compiler. Below are the numbers.
Test environment
Fedora 40
Linux kernel 6.8.10
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.77.0-nightly
Bend version: the latest for now from the main branch on commit 0b6c3350a4e556df665ef146ac6d04e6b73bbae9
Disabled Turbo boost
Benchmark
I decided to perform PGO benchmarks on a simple scenario - bend check sorter.bend (sorter.bend took from here). For PGO optimization I use cargo-pgo tool. Release build is done with cargo build --release, PGO instrumented with cargo pgo build, PGO optimized - cargo pgo optimize build, BOLT instrumented (with PGO optimization) - cargo pgo bolt build --with-pgo, BOLT optimized - cargo pgo bolt optimize --with-pgo. The training workload is the same for PGO and PLO - bend check sorter.bend.
taskset -c 0 is used for reducing the OS scheduler's influence on the results during the benchmarks. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).
Results
I got the following results:
hyperfine --warmup 200 --min-runs 1000 -i -N 'taskset -c 0 ./bend_release check sorter.bend' 'taskset -c 0 ./bend_optimized check sorter.bend' 'taskset -c 0 ./bend_pgo_and_bolt_optimized check sorter.bend'
Benchmark 1: taskset -c 0 ./bend_release check sorter.bend
Time (mean ± σ): 4.4 ms ± 0.1 ms [User: 3.0 ms, System: 1.3 ms]
Range (min … max): 4.3 ms … 4.7 ms 1000 runs
Benchmark 2: taskset -c 0 ./bend_optimized check sorter.bend
Time (mean ± σ): 3.8 ms ± 0.1 ms [User: 2.5 ms, System: 1.3 ms]
Range (min … max): 3.7 ms … 4.1 ms 1000 runs
Benchmark 3: taskset -c 0 ./bend_pgo_and_bolt_optimized check sorter.bend
Time (mean ± σ): 3.8 ms ± 0.1 ms [User: 2.4 ms, System: 1.2 ms]
Range (min … max): 3.7 ms … 4.1 ms 1000 runs
Summary
taskset -c 0 ./bend_pgo_and_bolt_optimized check sorter.bend ran
1.01 ± 0.02 times faster than taskset -c 0 ./bend_optimized check sorter.bend
1.18 ± 0.03 times faster than taskset -c 0 ./bend_release check sorter.bend
According to the results, PGO measurably improves the compiler's performance at least in the simple benchmark above. However, PLO (with LLVM BOLT) doesn't bring measurable improvements at least in this scenario.
For reference, I also measured the slowdown from PGO and PLO instrumentation:
hyperfine --warmup 100 --min-runs 500 -i -N 'taskset -c 0 ./bend_release check sorter.bend' 'taskset -c 0 ./bend_instrumented check sorter.bend' 'taskset -c 0 ./bend_optimized_bolt_instrumented check sorter.bend'
Benchmark 1: taskset -c 0 ./bend_release check sorter.bend
Time (mean ± σ): 4.4 ms ± 0.1 ms [User: 3.1 ms, System: 1.2 ms]
Range (min … max): 4.3 ms … 5.3 ms 676 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: taskset -c 0 ./bend_instrumented check sorter.bend
Time (mean ± σ): 6.5 ms ± 0.1 ms [User: 4.2 ms, System: 2.1 ms]
Range (min … max): 6.3 ms … 6.8 ms 500 runs
Benchmark 3: taskset -c 0 ./bend_optimized_bolt_instrumented check sorter.bend
Time (mean ± σ): 157.7 ms ± 5.1 ms [User: 30.4 ms, System: 108.6 ms]
Range (min … max): 147.7 ms … 259.6 ms 500 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
taskset -c 0 ./bend_release check sorter.bend ran
1.47 ± 0.03 times faster than taskset -c 0 ./bend_instrumented check sorter.bend
35.57 ± 1.30 times faster than taskset -c 0 ./bend_optimized_bolt_instrumented check sorter.bend
Perform more PGO benchmarks with other datasets (if you are interested enough in it). If it shows improvements - add a note to the documentation (the README file, I guess) about possible improvements in the compiler's performance with PGO.
Probably, you can try to get some insights about how the code can be optimized further based on the changes that the compiler performed with PGO. It can be done via analyzing flamegraphs before and after applying PGO to understand the difference.
More intensive testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO). However, I recommend starting from the usual PGO since it's a much more stable technology with much fewer limitations.
I would be happy to answer your questions about PGO and PLO.
By the way, it can be an interesting idea to evaluate PGO optimization applicability to the Bend program itself, not just for the compiler and similar tools (formatters, etc.).
P.S. Please do not treat the issue like a bug or something like that - it's just an improvement idea. Since the "Discussions" functionality is disabled in this repo, I created the Issue instead.
The text was updated successfully, but these errors were encountered:
I had no idea that this was a builtin option in the Rust compiler, very cool.
Some thoughts about this:
Adding PGO to the Bend build looks very promissing, probably will be done at some point.
Bend goes out of the way to have as few dependencies as possible, especially at its core. So I doubt link-time optimizations will improve too much.
The Bend compiler is NOT optimized, almost at all. Our focus so far has been on correctness and simplicity. Before investigating compile optimizations, we'll have to profile and optimize the logic of Bend itself, which I imagine will bring much greater improvements.
I'm much more interested in the performance this can bring to the runtime. We should try doing this to the C runtime. I imagine that for CUDA, PGO is not really a thing?
Bend goes out of the way to have as few dependencies as possible, especially at its core. So I doubt link-time optimizations will improve too much.
Probably - at least it should improve the binary size. This optimization is already enabled (so I changed nothing for this in the build scripts).
The Bend compiler is NOT optimized, almost at all. Our focus so far has been on correctness and simplicity. Before investigating compile optimizations, we'll have to profile and optimize the logic of Bend itself, which I imagine will bring much greater improvements.
Yep. However, even after performing manual optimizations with a higher probability PGO still will be useful. All modern "mature" compilers like Clang, Rustc, GCC, and many others still have large performance benefits from enabling PGO for them.
I'm much more interested in the performance this can bring to the runtime. We should try doing this to the C runtime. I imagine that for CUDA, PGO is not really a thing?
It depends. If we talk about GPU code - yes, PGO is not a thing yet for it. If we talk about the CPU part - PGO should work as for regular non-CUDA code.
Is your feature request related to a problem? Please describe.
Not a problem - just an idea of how the compiler's performance can be improved.
Describe alternatives you've considered
Leave the things as is.
Describe the solution you'd like
I checked Profile-Guided Optimization (PGO) on many projects (including compilers) - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Since this, I decided to test PGO with the Bend compiler. Below are the numbers.
Test environment
Bend
version: the latest for now from themain
branch on commit0b6c3350a4e556df665ef146ac6d04e6b73bbae9
Benchmark
I decided to perform PGO benchmarks on a simple scenario -
bend check sorter.bend
(sorter.bend
took from here). For PGO optimization I use cargo-pgo tool. Release build is done withcargo build --release
, PGO instrumented withcargo pgo build
, PGO optimized -cargo pgo optimize build
, BOLT instrumented (with PGO optimization) -cargo pgo bolt build --with-pgo
, BOLT optimized -cargo pgo bolt optimize --with-pgo
. The training workload is the same for PGO and PLO -bend check sorter.bend
.taskset -c 0
is used for reducing the OS scheduler's influence on the results during the benchmarks. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).Results
I got the following results:
where:
bend_release
- Release + LTO buildbend_optimized
- Release + LTO + PGO optimizedbend_pgo_and_bolt_optimized
- Release + LTO + PGO optimized + BOLT optimizedAccording to the results, PGO measurably improves the compiler's performance at least in the simple benchmark above. However, PLO (with LLVM BOLT) doesn't bring measurable improvements at least in this scenario.
For reference, I also measured the slowdown from PGO and PLO instrumentation:
where:
bend_release
- Release + LTO buildbend_instrumented
- Release + LTO + PGO instrumentedbend_optimized_bolt_instrumented
- Release + LTO + PGO optimized + BOLT instrumentedAlso, I measured binary size changes in different compilation modes (without stripping):
Further steps
I can suggest the following action points:
I would be happy to answer your questions about PGO and PLO.
By the way, it can be an interesting idea to evaluate PGO optimization applicability to the Bend program itself, not just for the compiler and similar tools (formatters, etc.).
P.S. Please do not treat the issue like a bug or something like that - it's just an improvement idea. Since the "Discussions" functionality is disabled in this repo, I created the Issue instead.
The text was updated successfully, but these errors were encountered: