Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized codesize benchmarks do not clearly show the power of individual optimizations #5945

Open
Manishearth opened this issue Dec 24, 2024 · 3 comments

Comments

@Manishearth
Copy link
Member

Came up in #5935

ICU4X has test-c-tiny and test-js-tiny to show how far codesize can be optimized.

These are incremental, applying optimization on top of optimization to slowly reduce codesize. This shows a nice progression, but it is not helpful when understanding what the effect of each optimization is in isolation.

I think this is an important function of such a benchmark: many of these techniques are not uniformly available and impose additional constraints upon the build: some require nightly, some required paired Rust/Clang versions, some force build-std, some require a particular C compiler, some reduce debuggability, and so on.

Furthermore, a lot of these benchmarks build on top of each other: using a release build will of course help LTO be more effective (percent-wise).

Providing numbers for every combination is going to be a lot of work and likely an overwhelming amount of data. However, I think what we could do is identify a list of optimizations that are potentially relevant but not necessarily always possible, and then provide numbers for:

  • plain release build
  • release build with each of these optimizations individually applied
    • for optimizations that build on each other; e.g. -Clinker-plugin-lto needs LTO, apply its dependencies too
  • release build with all but one of these optimizations applied
    • similar setup for dependent optimizations: remove both
  • release build with all optimizations applied

This would both give us an idea of the immediate wins of individual optimizations, and how they cumulatively work together.

The list of optimizations I can identify are:

  • LTO (off, on, thin, it seems like thin gives us the best perf?)
    • -Clinker-plugin-lto
  • --gc-sections
    • --strip-all
  • panic=abort
    • panic-abort std
      • panic-immediate-abort std
  • one-step vs two-step clang
  • use of lld (?)
  • inclusion of debug symbols in the first place (same as strip? unclear)

This list might be larger than necessary, so we could merge some entries if desired. I might also be missing something. I didn't include Rust debug vs release here because I don't think debug build codesize numbers really mean much, and I can't think of a usecase for caring about those numbers.

Thoughts? @sffc

@sffc
Copy link
Member

sffc commented Dec 26, 2024

Many of these are already documented on sites such as https://github.com/johnthagen/min-sized-rust?tab=readme-ov-file and I think we should reference those documents when possible. We can mention those, but it is probably most useful for us to focus on optimizations unique to libraries and especially cross-language libraries.

The Rust-specific optimizations in the above list include:

Optimization Description Special Requirements Tradeoff
opt-level = "s" Compiles code prioritizing code size (same as c flag -Os) None May impact runtime performance
codegen-units = 1 Compiles code all at once None None
strip = true Removes all symbols from the binary (same as ld flag --strip-all) None Harder to debug runtime problems
lto = true Runs LTO on the Rust code, including dead-code elimination None None
panic = "abort" Makes panics abort instead of unwind. Panics should still print debug information to stderr. None Unable to handle panics in code (note that ICU4X itself should not panic for recoverable errors)
build-std Rebuilds the Rust standard library with the above flags Rust Nightly None
panic-immediate-abort Compiles the standard library such that panics do not print debug information, which in turn allows most core::fmt::Debug impls to be eliminated Rust Nightly Harder to debug runtime problems

If the tradeoff is "may increase compile times", I did not include it above, because almost all of these will impact compile times. If "Special Requirements" says Rust Nightly, I also did not include it as a Tradeoff since it is already listed in the table.

There are two new things in that list since the last time I looked at it which I haven't evaluated:

I think panic-immediate-abort should achieve both of these benefits, certainly the second one. (Side note: I'm not very impressed that -Zfmt-debug=shallow only impacts derive(Debug) and not custom Debug impls; that flag discourages the use of custom Debug impls, which are sometimes necessary to either work around deficiencies in derive(Debug) (such as the derive unnecessarily adding T: Debug bounds even if T is not a field) or to provide a more useful debug string.)

Now for the options that are unique to ICU4X and generally cross-language Rust libraries:

Optimization Description Special Requirements Tradeoff
-flto Runs LTO on the C code, including dead-code elimination Clang None
-flto and Rust -Clinker-plugin-lto Runs LTO on the C and Rust code at the same time, including dead-code elimination. Can inline Rust functions into C functions. Clang, compatible LLVM versions Tightly couples Rust and C toolchains
ld flag --gc-sections Runs simple garbage collection (dead-code elimination) on the binary. Note: this flag is not needed if using -flto. None None
ld flag --strip-all Removes all symbols from the binary None Harder to debug runtime problems

The first 3 rows are basically an enumeration. You shouldn't need --gc-sections if you are using -flto, and -Clinker-plugin-lto only makes sense if you are using Clang and -flto to build your C code.

Now, the first time I had to comprehend all these flags, I was a bit like, ":angry: why doesn't -Os just do everything? I want to just set my flag and move on." I am now more mature and I understand why these are all separate options, but I still think it is valuable to have a recommended starter set. My preference would be if the starter set would be "everything", including the things that require Rust Nightly and Compatible LLVM, and we have a getting started guide that says which things to disable if you encounter build problems (such as incompatible LLVM or inability to use Rust Nightly).

Does this make sense? Did I miss anything?

@Manishearth
Copy link
Member Author

Yes, this makes sense, but I don't think it fully answers the question I had here which was "which axes should our ideal benchmark be testing?" Do you think we should be testing each of these?

And yeah, the "use everything and disable the things you don't need" path was why I suggested having an "release build with all but one of these optimizations applied" test for each. Though I think some intermediate numbers are also valuable since people will quite often be in the position of wanting to get the easy wins without dealing with toolchain complexities (nightlies, etc).

@Manishearth
Copy link
Member Author

For now I'm going to stick with the benchmarks in my PR (with the tweaks I was planning on doing), and we can leave this issue open to properly fix them as needed. The PR is a good start for the principled approach proposed here but is trying not to do everything just yet since we don't quite need that for the current set of discussions.

Manishearth added a commit that referenced this issue Dec 31, 2024
Progress on #5945

This does not fully enact the vision in #5945, but it does start testing
various combinations of tactics. I didn't want to test *all*
combinations and I didn't want to spend time figuring out which
combinations we really want, so I went ahead and tested the ones most
relevant to the current investigation (#5935).


This shows the effects of LTO, linker-plugin-lto, and
stripping+gc-sections on panic-abort (with panic-immediate-abort std)
release Rust builds.


Benchmarks with just panic=abort but no
panic-immediate-abort/panic-abort std would probably be useful to
clients but I haven't added them now. It may be worth using makefile
magic to truly build a matrix of targets.



<!--
Thank you for your pull request to ICU4X!

Reminder: try to use [Conventional
Comments](https://conventionalcomments.org/) to make comments clearer.

Please see
https://github.com/unicode-org/icu4x/blob/main/CONTRIBUTING.md for
general
information on contributing to ICU4X.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants