Optimized codesize benchmarks do not clearly show the power of individual optimizations #5945

Manishearth · 2024-12-24T01:48:43Z

Came up in #5935

ICU4X has test-c-tiny and test-js-tiny to show how far codesize can be optimized.

These are incremental, applying optimization on top of optimization to slowly reduce codesize. This shows a nice progression, but it is not helpful when understanding what the effect of each optimization is in isolation.

I think this is an important function of such a benchmark: many of these techniques are not uniformly available and impose additional constraints upon the build: some require nightly, some required paired Rust/Clang versions, some force build-std, some require a particular C compiler, some reduce debuggability, and so on.

Furthermore, a lot of these benchmarks build on top of each other: using a release build will of course help LTO be more effective (percent-wise).

Providing numbers for every combination is going to be a lot of work and likely an overwhelming amount of data. However, I think what we could do is identify a list of optimizations that are potentially relevant but not necessarily always possible, and then provide numbers for:

plain release build
release build with each of these optimizations individually applied
- for optimizations that build on each other; e.g. -Clinker-plugin-lto needs LTO, apply its dependencies too
release build with all but one of these optimizations applied
- similar setup for dependent optimizations: remove both
release build with all optimizations applied

This would both give us an idea of the immediate wins of individual optimizations, and how they cumulatively work together.

The list of optimizations I can identify are:

LTO (off, on, thin, it seems like thin gives us the best perf?)
- -Clinker-plugin-lto
--gc-sections
- --strip-all
panic=abort
- panic-abort std
  - panic-immediate-abort std
one-step vs two-step clang
use of lld (?)
inclusion of debug symbols in the first place (same as strip? unclear)

This list might be larger than necessary, so we could merge some entries if desired. I might also be missing something. I didn't include Rust debug vs release here because I don't think debug build codesize numbers really mean much, and I can't think of a usecase for caring about those numbers.

Thoughts? @sffc

The text was updated successfully, but these errors were encountered:

sffc · 2024-12-26T22:11:06Z

Many of these are already documented on sites such as https://github.com/johnthagen/min-sized-rust?tab=readme-ov-file and I think we should reference those documents when possible. We can mention those, but it is probably most useful for us to focus on optimizations unique to libraries and especially cross-language libraries.

The Rust-specific optimizations in the above list include:

Optimization	Description	Special Requirements	Tradeoff
opt-level = "s"	Compiles code prioritizing code size (same as c flag `-Os`)	None	May impact runtime performance
codegen-units = 1	Compiles code all at once	None	None
strip = true	Removes all symbols from the binary (same as ld flag `--strip-all`)	None	Harder to debug runtime problems
lto = true	Runs LTO on the Rust code, including dead-code elimination	None	None
panic = "abort"	Makes panics abort instead of unwind. Panics should still print debug information to stderr.	None	Unable to handle panics in code (note that ICU4X itself should not panic for recoverable errors)
build-std	Rebuilds the Rust standard library with the above flags	Rust Nightly	None
panic-immediate-abort	Compiles the standard library such that panics do not print debug information, which in turn allows most `core::fmt::Debug` impls to be eliminated	Rust Nightly	Harder to debug runtime problems

If the tradeoff is "may increase compile times", I did not include it above, because almost all of these will impact compile times. If "Special Requirements" says Rust Nightly, I also did not include it as a Tradeoff since it is already listed in the table.

There are two new things in that list since the last time I looked at it which I haven't evaluated:

I think panic-immediate-abort should achieve both of these benefits, certainly the second one. (Side note: I'm not very impressed that -Zfmt-debug=shallow only impacts derive(Debug) and not custom Debug impls; that flag discourages the use of custom Debug impls, which are sometimes necessary to either work around deficiencies in derive(Debug) (such as the derive unnecessarily adding T: Debug bounds even if T is not a field) or to provide a more useful debug string.)

Now for the options that are unique to ICU4X and generally cross-language Rust libraries:

Optimization	Description	Special Requirements	Tradeoff
`-flto`	Runs LTO on the C code, including dead-code elimination	Clang	None
`-flto` and Rust `-Clinker-plugin-lto`	Runs LTO on the C and Rust code at the same time, including dead-code elimination. Can inline Rust functions into C functions.	Clang, compatible LLVM versions	Tightly couples Rust and C toolchains
ld flag `--gc-sections`	Runs simple garbage collection (dead-code elimination) on the binary. Note: this flag is not needed if using `-flto`.	None	None
ld flag `--strip-all`	Removes all symbols from the binary	None	Harder to debug runtime problems

The first 3 rows are basically an enumeration. You shouldn't need --gc-sections if you are using -flto, and -Clinker-plugin-lto only makes sense if you are using Clang and -flto to build your C code.

Now, the first time I had to comprehend all these flags, I was a bit like, ":angry: why doesn't -Os just do everything? I want to just set my flag and move on." I am now more mature and I understand why these are all separate options, but I still think it is valuable to have a recommended starter set. My preference would be if the starter set would be "everything", including the things that require Rust Nightly and Compatible LLVM, and we have a getting started guide that says which things to disable if you encounter build problems (such as incompatible LLVM or inability to use Rust Nightly).

Does this make sense? Did I miss anything?

Manishearth · 2024-12-27T19:40:07Z

Yes, this makes sense, but I don't think it fully answers the question I had here which was "which axes should our ideal benchmark be testing?" Do you think we should be testing each of these?

And yeah, the "use everything and disable the things you don't need" path was why I suggested having an "release build with all but one of these optimizations applied" test for each. Though I think some intermediate numbers are also valuable since people will quite often be in the position of wanting to get the easy wins without dealing with toolchain complexities (nightlies, etc).

Manishearth · 2024-12-27T19:42:09Z

For now I'm going to stick with the benchmarks in my PR (with the tweaks I was planning on doing), and we can leave this issue open to properly fix them as needed. The PR is a good start for the principled approach proposed here but is trying not to do everything just yet since we don't quite need that for the current set of discussions.

Progress on #5945 This does not fully enact the vision in #5945, but it does start testing various combinations of tactics. I didn't want to test *all* combinations and I didn't want to spend time figuring out which combinations we really want, so I went ahead and tested the ones most relevant to the current investigation (#5935). This shows the effects of LTO, linker-plugin-lto, and stripping+gc-sections on panic-abort (with panic-immediate-abort std) release Rust builds. Benchmarks with just panic=abort but no panic-immediate-abort/panic-abort std would probably be useful to clients but I haven't added them now. It may be worth using makefile magic to truly build a matrix of targets.

Manishearth mentioned this issue Dec 25, 2024

Improved c-tiny benchmark #5948

Merged

sffc mentioned this issue Dec 26, 2024

Bump MSRV to 1.80 #5934

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized codesize benchmarks do not clearly show the power of individual optimizations #5945

Optimized codesize benchmarks do not clearly show the power of individual optimizations #5945

Manishearth commented Dec 24, 2024

sffc commented Dec 26, 2024

Manishearth commented Dec 27, 2024

Manishearth commented Dec 27, 2024