Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate Profile-Guided Optimization (PGO) #2386

Open
zamazan4ik opened this issue Sep 25, 2023 · 10 comments
Open

Evaluate Profile-Guided Optimization (PGO) #2386

zamazan4ik opened this issue Sep 25, 2023 · 10 comments
Assignees

Comments

@zamazan4ik
Copy link

Hi!

Here I am posting an idea for optimizing Youki with Profile-Guided Optimization (PGO). Recently I started evaluating PGO across multiple software domains - all my current results are available here: https://github.com/zamazan4ik/awesome-pgo . For Youki I did some quick benchmarks on my local Linux machine and want to share the actual performance numbers.

Test environment

  • Fedora 38
  • Linux kernel 6.4.15-200.fc38.x86_64
  • AMD Ryzen 9 5900x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Rust rustc 1.72.0 (5680fa18f 2023-08-23)
  • Youki version: latest commit (646c1034f78454904cc3e1ccec2cd8dc270ab3fd commit) in the main branch

Benchmark

As a benchmark, I use the suggested in the README file workload with sudo ./youki create -b tutorial a && sudo ./youki start a && sudo ./youki delete -f a

youki_release is built with just youki-release. PGO optimized build is done with cargo-pgo (cargo pgo build + run the benchmark with the Instrumented Youki + cargo pgo optimize build). As a training workload, I use the benchmark itself.

Results

The results are presented in hyperfine format. All benchmarks are done multiple times, in different order, etc - the results are reproducible.

sudo hyperfine --prepare 'sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches' --warmup 100 --min-runs 500 'sudo ./youki_release create -b tutorial a && sudo ./youki_release start a && sudo ./youki_release delete -f a' 'sudo ./youki_optimized create -b tutorial a && sudo ./youki_optimized start a && sudo ./youki_optimized delete -f a'
Benchmark 1: sudo ./youki_release create -b tutorial a && sudo ./youki_release start a && sudo ./youki_release delete -f a
  Time (mean ± σ):      78.6 ms ±   3.7 ms    [User: 11.2 ms, System: 43.9 ms]
  Range (min … max):    70.9 ms …  97.8 ms    500 runs

Benchmark 2: sudo ./youki_optimized create -b tutorial a && sudo ./youki_optimized start a && sudo ./youki_optimized delete -f a
  Time (mean ± σ):      77.4 ms ±   3.6 ms    [User: 10.9 ms, System: 44.1 ms]
  Range (min … max):    70.6 ms …  90.0 ms    500 runs

Summary
  sudo ./youki_optimized create -b tutorial a && sudo ./youki_optimized start a && sudo ./youki_optimized delete -f a ran
    1.02 ± 0.07 times faster than sudo ./youki_release create -b tutorial a && sudo ./youki_release start a && sudo ./youki_release delete -f a

Just for reference, I also share the results for Instrumentation mode:

LLVM_PROFILE_FILE=/home/zamazan4ik/open_source/youki/target/pgo-profiles/youki_%m_%p.profraw sudo hyperfine --prepare 'sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches' --warmup 10 --min-runs 100 'sudo ./youki_instrumented create -b tutorial a && sudo ./youki_instrumented start a && sudo ./youki_instrumented delete -f a'
Benchmark 1: sudo ./youki_instrumented create -b tutorial a && sudo ./youki_instrumented start a && sudo ./youki_instrumented delete -f a
  Time (mean ± σ):     161.1 ms ±   3.3 ms    [User: 20.3 ms, System: 116.8 ms]
  Range (min … max):   154.8 ms … 170.7 ms    100 runs

According to the tests, PGO helps with achieving quite better performance (1-2%). Not a great win but it's not bad "just" for a compiler option. On a scale, even 1% is a good thing to achieve.

Further steps

If you think that it's worth it, I think we can perform more robust PGO benchmarks for Youki. And then document the results of the project. So other people will be able to optimize Youki for their own workloads.

@utam0k
Copy link
Member

utam0k commented Sep 25, 2023

@zamazan4ik Could you write the documentation on how to optimize with PGO in our official documentation?
https://containers.github.io/youki/youki.html

@zamazan4ik
Copy link
Author

I think I can. But do you want to publish this documentation based only on the tests above? Do you think that PGO according to the results above worth it?

@utam0k
Copy link
Member

utam0k commented Sep 25, 2023

@zamazan4ik
To be honest, I am too much of a beginner to know how much performance improvement I should expect. Anyway, I'd like to see the result of this.

If you think that it's worth it, I think we can perform more robust PGO benchmarks for Youki.

@zamazan4ik
Copy link
Author

To be honest, I am too much of a beginner to know how much performance improvement I should expect. Anyway, I'd like to see the result of this.

I already posted some results in the starting post with my current results:

sudo hyperfine --prepare 'sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches' --warmup 100 --min-runs 500 'sudo ./youki_release create -b tutorial a && sudo ./youki_release start a && sudo ./youki_release delete -f a' 'sudo ./youki_optimized create -b tutorial a && sudo ./youki_optimized start a && sudo ./youki_optimized delete -f a'
Benchmark 1: sudo ./youki_release create -b tutorial a && sudo ./youki_release start a && sudo ./youki_release delete -f a
  Time (mean ± σ):      78.6 ms ±   3.7 ms    [User: 11.2 ms, System: 43.9 ms]
  Range (min … max):    70.9 ms …  97.8 ms    500 runs

Benchmark 2: sudo ./youki_optimized create -b tutorial a && sudo ./youki_optimized start a && sudo ./youki_optimized delete -f a
  Time (mean ± σ):      77.4 ms ±   3.6 ms    [User: 10.9 ms, System: 44.1 ms]
  Range (min … max):    70.6 ms …  90.0 ms    500 runs

Summary
  sudo ./youki_optimized create -b tutorial a && sudo ./youki_optimized start a && sudo ./youki_optimized delete -f a ran
    1.02 ± 0.07 times faster than sudo ./youki_release create -b tutorial a && sudo ./youki_release start a && sudo ./youki_release delete -f a

youki_release - the default Release build, youki_optimized - Release + PGO optimization version.

@utam0k
Copy link
Member

utam0k commented Sep 25, 2023

@zamazan4ik Thanks for your guide. 1ms probably doesn't have a strong meaning in the real world. However, youki is a product that is open to many new challenges, depending on our interests. How many times can you run it and still arrive at the same result?

@zamazan4ik
Copy link
Author

Thanks for your guide. 1ms probably doesn't have a strong meaning in the real world.

it's up to you :) However, if Youki really cares about performance, it's not a bad thing to automatically increase the performance for 1-2% just with a compiler option.

How many times can you run it and still arrive at the same result?

As you see, hyperfine already runs the test workload multiple times (500 runs + 100 for warmup). Regarding the whole experiment - I did it multiple times (more than 3 times), in different order between binaries - the results are the same.

@utam0k
Copy link
Member

utam0k commented Sep 25, 2023

As you see, hyperfine already runs the test workload multiple times (500 runs + 100 for warmup). Regarding the whole experiment - I did it multiple times (more than 3 times), in different order between binaries - the results are the same.

👍

it's up to you :) However, if Youki really cares about performance, it's not a bad thing to automatically increase the performance for 1-2% just with a compiler option.

I'm curious about multiple kernel versions. Is this compilation kernel-independent? It would be attractive if it did not.

@zamazan4ik
Copy link
Author

I'm curious about multiple kernel versions. Is this compilation kernel-independent? It would be attractive if it did not.

Yes, it does not depend on the kernel. So it would be nice if you can reproduce the results on different setups.

@YJDoc2
Copy link
Collaborator

YJDoc2 commented Sep 26, 2023

@zamazan4ik Thanks for this issue and initial investigation! I don't have detailed idea on PGO uses apart from this blog on using pgo in rust compiler blog, so correct me if I'm wrong :

If I'm correct a "pgo-build" has extra instructions in it similar to a code-coverage build, and it dumps some data in a file regarding actual function call and uses similar to coverage info. Then the next compilation uses this to optimize it.

If this is correct, then

  • I think it'd be more beneficial to do other kinds of runs than the benchmark, as they might represent the real use-cases more accurately
  • Can we use Unit tests that we have for getting PGO data, i.e. apart from the create-start-rm command, also use data from a unit test run? Or does it need a complete compiled binary run by external sources in order to generate data?
  • I think for getting the pgo data, we can run the create-start-rm commands a single time, instead of 100 times as in bench mark? For benchmark it helps to remove noise, (if I'm correct) pgo will only analyze the code path taken, which should be same in 99/100 runs of this (I'm accounting 1 run for spurious errors) ; is this right?
  • One issue I think that we can have here is that once we do namespace switch / chroot, all data for the next stuff will not be o/p to the original data-file, and will be essentially lost? I think we are experiencing similar issue with systemd logging, where after namespace/fs switch, the logs cannot be accessed on host. One potential fix is that we manually copy over the "new" data file that (I think) will get created in the bundle dir, and combine these two somehow, and that might help in getting the data for the stuff after chroot.
  • I think containerd or oci integration tests we have would be much more useful for getting a more "real-world" data, but I feel that might be much more difficult to get data from, as they run youki via their harness.

Please let me know if anything is wrong above, looking forward to your thoughts!

EDIT:

Also, if I understand correctly, then if we do decide on using pgo, we will have two compilation steps in our release workflow, and in between, we'll have some steps to generate the pgo data? Can we use pre-generated data to do pgo instead of having to generate data right before the second compilation (ignoring that this might not be as accurate)?

@zamazan4ik
Copy link
Author

I think it'd be more beneficial to do other kinds of runs than the benchmark, as they might represent the real use-cases more accurately

Yes, you are right. I used only one test scenario because it was the only scenario that I found in the Youki repo. If we can collect profiles from other real-life workloads - it would be awesome!

Can we use Unit tests that we have for getting PGO data, i.e. apart from the create-start-rm command, also use data from a unit test run? Or does it need a complete compiled binary run by external sources in order to generate data?

Yes, technically it's possible to use the profiles from the unit tests for PGO. But I do not recommend it since usually, unit tests do not represent the real-life workload - unit tests tend to cover as many as possible cases (mostly cold paths of the program), so the optimizer will optimize not for real-life cases but for the unit tests.

I think containerd or oci integration tests we have would be much more useful for getting a more "real-world" data, but I feel that might be much more difficult to get data from, as they run youki via their harness.

Maybe, here I trust you as a domain expert :)

I think for getting the pgo data, we can run the create-start-rm commands a single time, instead of 100 times as in bench mark? For benchmark it helps to remove noise, (if I'm correct) pgo will only analyze the code path taken, which should be same in 99/100 runs of this (I'm accounting 1 run for spurious errors) ; is this right?

Yes, you can use a single run - it's completely fine. The only reason why I collected 100 profiles instead of one - that I was too lazy to change the running command :) If you want to collect multiple profiles from multiple workloads - it's also fine. You will need just merge them into one "prepared" profile with llvm-profdata utility.

Also, if I understand correctly, then if we do decide on using pgo, we will have two compilation steps in our release workflow, and in between, we'll have some steps to generate the pgo data? Can we use pre-generated data to do pgo instead of having to generate data right before the second compilation (ignoring that this might not be as accurate)?

Yes, usually PGO (via Instrumentation) means having a 2-stage compilation process. There is another PGO kind that is called Sampling PGO (or AutoFDO) - you can read about it at https://clang.llvm.org/docs/UsersManual.html#using-sampling-profilers but let's talk only about Instrumentation PGO for now.

Yes, you can generate PGO data directly during the build process every time (compile with Instrumentation, run on some workload, recompile once again with the just collected profiles) or use pre-generated profiles and skip the Instrumentation stage.

But if you use pre-generated profiles, you need to keep in mind the following things:

  • Profiles format is compiler-dependent. So usually compiler upgrade means that the internal PGO profile format can be changed. And you will need to regenerate profiles once again. If you support multiple compiler versions - you need to maintain multiple profile versions.
  • There is a thing called "profile skew". It means that when you develop your program, the source code changes. And pre-generated PGO profiles will be less and less efficient when time flies. So if you want to keep PGO profiles useful for optimization purposes - you still need to regenerate them from time to time. However, if Youki's source does not change much - there will be no need to update the profiles regularly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants