Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #741

Closed
zamazan4ik opened this issue Sep 14, 2023 · 7 comments
Closed

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #741

zamazan4ik opened this issue Sep 14, 2023 · 7 comments

Comments

@zamazan4ik
Copy link

Hi!

Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects - the results are available here. Here you can find different applications from different domains that were accelerated with PGO: virtual machines (like QEMU and CrosVM), compilers, gRPC workloads, benchmark tools, databases, and much more. That's why I think it's worth trying to apply PGO to Broot. I ran some benchmarks and want to share my results.

Test environment

  • Hardware: Macbook M1 Pro
  • OS: macOS 13.4 Ventura
  • Rust: 1.72
  • Broot version: the latest main branch (1b5c1838b3a533cab390def547ef5cfb892c47f3 commit)

Benchmark

As an evaluation and training set, I used these benchmarks https://github.com/Canop/broot/tree/main/benches via cargo bench. PGO has trained also on these benchmarks with cargo pgo bench (see below the link to this awesome tool). All measurements were done with the same background noise (as much as I can guarantee on this OS).

Results

The results are presented in the cargo bench format. Since I do not know the correct way to copy these fancy tables properly, I will attach the screenshots (sorry for that).

Release run:
image

Instrumented compared to Release (here you can evaluate how benchmarks are slow with Instrumentation enabled):
image

Then I ran cargo bench once again with the Release version to restore the benchmarks state to a baseline Release.

Release + PGO optimized compared to Release:
image

As you see, PGO helps with achieving better performance at least in the provided by the project benchmarks.

Possible future steps

I can suggest the following things to do:

  • Evaluate PGO's applicability to the Broot binary itself (instead of the benchmarks).
  • If PGO helps to achieve better performance - add a note to Broot's documentation (the README file?) about that. In this case, users and maintainers will be aware of another optimization opportunity for Broot.
  • Provide PGO integration into the build scripts. It can help users and maintainers easily apply PGO for their own workloads.
  • Optimize prebuilt Broot binaries with PGO.

After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.

For the Rust projects, I recommend starting with cargo-pgo.

@Canop
Copy link
Owner

Canop commented Sep 14, 2023

Running PGO to optimize up to 12% some specific tasks doesn't seem worth the potential degradation of non optimized ones, which is inherent to PGO.

@zamazan4ik
Copy link
Author

Running PGO to optimize up to 12% some specific tasks doesn't seem worth the potential degradation of non optimized ones, which is inherent to PGO.

Right now there is no proof that PGO will be degraded important for the users' scenarios. You can check how PGO is integrated into other projects like Clang, Rustc, Python, and others (more integrations are here - https://github.com/zamazan4ik/awesome-pgo#pgo-showcases). If you have good coverage of all scenarios, you can collect multiple profiles, merge them, and then PGO will optimize for all scenarios. Even this generic merged scenario can be helpful with optimizing the program in general (e.g. completely the same thing does Rustc with its PGO pipeline).

If you think PGO profiles from cargo bench are not good enough - that's fair enough. That's why I suggested testing PGO directly on Broot's binary instead of the benchmarks.

If we are not able to collect generic-enough profiles - okay. We can perform PGO benchmarks, document the results in the documentation, and integrate PGO build mode into the build scripts. So the users/maintainers can decide on their own - do they want to optimize Broot with PGO or no.

@Stargateur
Copy link
Contributor

Stargateur commented Sep 14, 2023

I don't see why broot should use a experimental technology, hard to maintain for one dev, for few hypothetical speed when anyway broot is more likely to be bottleneck by OS/hardware anyway. I wonder if the benchmark test have even few IO in them.

@Canop
Copy link
Owner

Canop commented Sep 14, 2023

I don't see why broot should use a experimental technology

Same feeling. I've never seen impressive results in my tests of PGO and it never seemed worth the pain. So I'm not going to invest here unless I see new results.

@zamazan4ik
Copy link
Author

zamazan4ik commented Sep 14, 2023

I don't see why broot should use a experimental technology

It depends on your definition of "experimental" :) If "experimental" means "new to Broot" - I agree. But PGO itself is not a novel technique at all. E.g. PGO was implemented in GCC somewhere near the 4.5 version (I am too young to remember such releases in practice), and Clang also implemented PGO for a while. Cannot quickly find when PGO was implemented in Rustc, but Rust's implementation fully relies on the LLVM one. From the usage perspective, PGO is used as an optimization technique for the project itself during years (good examples are all Chromium-based browsers, Clang/GCC/Rust itself, CPython). From the companies' perspective, Google and Facebook are major users of PGO. E.g. Google uses PGO (in Sampling mode aka AutoFDO, but that's just an implementation detail). About Google experience you can read here. So I do not agree that PGO is an experimental technology across the industry but agree that PGO adoption overall is less compared to "-O3" and "LTO" optimization options.

Update: forgot to mention LLVM BOLT. This technology I agree to consider this "experimental" even if Facebook/Meta has huge experience with adopting it to their servers. According to my tests, there are a lot of caveats with BOLT in practice like bugs, ridiculous memory consumption, etc.

hard to maintain for one dev

Of course, I cannot estimate on your side, how hard for is the maintenance of this thing. You can how PGO is integrated into other projects here. You have multiple options of how to integrate PGO into a project with different maintenance cost:

  • Test and document PGO effects on Broot performance. Usually needs to be done once and never (or veeeeeeery rarely) touched again
  • Integrate building the project with PGO as an opt-in feature. This kind of integration does not require regular maintenance either.
  • Add PGO profile generations and PGO optimizations into the CI. This way is usually a bit harder. How hard it is? Well, from my experience, the sample workload is not changed frequently so you don't need to touch these scripts regularly. E.g. Pydantic-core uses this approach: add build-pgo make target pydantic/pydantic-core#741

for few hypothetical speed when anyway broot is more likely to be bottleneck by OS/hardware anyway

That's why I showed you PGO improvement results on the Broot benchmarks :) If you think that these improvements are not important - okay, but in this case I do not understand why you have such benchmarks :D if you have CPU-bound benchmarks for something that means that they are important for you. Also, here you can see PGO improvements on other projects, some of them at first seems like IO-bound but get interesting improvements from PGO perspective like hurl results.

However, I agree that testing PGO directly on the Broot binary itself can be more interesting to see. I didn't do it yet. The issue is just an idea of how to (possibly) improve the performance - maybe someone could find this idea worth trying.

@zamazan4ik
Copy link
Author

I've never seen impressive results in my tests of PGO

In general or in Broot? If we are talking in general, I have all PGO results for real-life applications here. For every showcase you can follow the link read PGO effects on the software performance. Sometimes it's large enough (20% usually in compilers-like workloads), sometimes much less (like DragonflyDB).

If we are talking about Broot. Yes, right now we only see improvement results in the project benchmarks, not directly in the Broot performance itself - it need to be tested as well.

@Canop
Copy link
Owner

Canop commented Sep 23, 2023

I decided not to pursue this ATM. This might be revised later.

@Canop Canop closed this as completed Sep 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants