-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we get smaller binary code? #488
Comments
Random ramblings
I took a look at the output of For master, I get 1431 lines of assembly code. This is surely less than 9 KiB of binary code, but I do not have randr enabled. With the patch from my last comment... uhm... nothing changed (well, the number of lines did not change). Weird. With When scrolling through this, a lot of the code looks like bad versions of
Could this come from lines like Adding the following to
turns that 6537-lines-monster into a 3442-lines-monster. And there are now 44 calls to Could it be that this big code size comes from "moving large things into an enum with many variants"? Edit: Seems like this kind of pattern does indeed generate way too much code: https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=ee3bd9d4445b7657ee94dfec319a88ee Edit: Yup. https://stackoverflow.com/questions/39219961/how-to-get-assembly-output-from-building-with-cargo taught me to use
Edit: Hm... This code does not look like it went through compiler optimisations. Nothing is inlined in |
Thanks for looking into this! It turns out that I wasn't using --release (I had assumed that After adding You can see the size of individual functions with |
Random data point to make me feel better: |
Actually, it's worse than that: I need |
Most TryParse implementations that we have contain a series of calls to other try_parse() implementations. Most of these are for primitive types, e.g. u32. Thus, some rough sketch of the code is this: let (a, remaining) = u32::try_parse(remaining)?; let (b, remaining) = u32::try_parse(remaining)?; let (c, remaining) = u32::try_parse(remaining)?; let (d, remaining) = u32::try_parse(remaining)?; This is relatively readable code. We do not have to mess round with byte offsets or something like that. However, every single call to u32::try_parse() checks if there are another 32 bit of data available. This leads to lots of repetitions in the assembly code, because these length checks are not merged. This commit changes the code generator to compute the minimum size that some object can have on the wire. This minimum size is the sum of all fixed-size data. Basically, this means that lists are assumed to be empty. Then, the generated code checks if at least the minimum number of bytes is available. If not, an early returns is done. This early length check allows the compiler (in release mode) to optimise away most later length checks, thus resulting in less assembly code. The effect of this commit was measured in two ways. 1. "cargo rustc --release -- --Copt-level=s --emit asm" allows to get an assembly file for the library. Before this commit, this assembly file has 159662 lines. Afterwards, it has 157796 lines. This is a reduction of about 1.2 %. 2. "cargo build --release --example xclock_utc" and calling "strip" on the resulting binary previously produced a file with 503 KiB. After this commit, the file has only 499 KiB. Thus, about 4 KiB or 1 % is saved. Related-to: #488 Signed-off-by: Uli Schlachter <psychon@znc.in>
I am not sure how useful But okay, My attempts at smaller code size so far are documented in #491 and #492. Keeping each idea in its own PR perhaps leads to less of a mess than my comments above. The result is not much of a reduction. |
Another random idea: All the event parsing code "likely" is only called by Edit: Google suggests that LTO is a better idea than Edit: LTO seems to help. With
a simple use x11rb::connection::RequestConnection as _;
use x11rb::rust_connection::RustConnection;
fn out_of_thin_air<T>() -> T {
unsafe {
std::ptr::read_volatile(0x1337 as _)
}
}
fn main() {
if false {
println!("Hello World");
} else {
let conn: RustConnection = out_of_thin_air();
let event_bytes: &[u8] = out_of_thin_air();
println!("{}", conn.parse_event(event_bytes).is_ok());
}
} |
It is now unused. This should help with #488, I hope: Less code means smaller binary size. Signed-off-by: Uli Schlachter <psychon@znc.in>
Some numbers to see how the new release is doing. I built the [profile.release]
opt-level = 'z'
lto = true
codegen-units = 1
panic = 'abort' Results are: I need a different benchmark (well, 4k less in one case)
|
Random data point from neXromancers/shotgun#40 (thanks @9ary ): This is some code that doesn't do any event parsing, so most of what we already looked at in this issue doesn't apply. Only two functions from x11rb show up as being large:
And Edit: Poor man's
For current master this ends with
With #838 this ends with
So... that PR gets rid of the largest function and the second-largest function doesn't get much larger. (Yes, number of lines of output is a bad proxy for binary size - I am counting the number of instructions here and not the size of the function.) |
linebender/druid#1025 (comment)
I'm not sure I understand this correctly (is it "there are multiple things (e.g.
KeyPressEvent::try_parse
) that are in total 3kb in size" or is it "KeyPressEvent::try_parse
is 3kb in size"?), but I cannot really reproduce.I copied together some self-contained code for `KeyPressEvent::try_parse`
I cannot easily see the binary size, but here is the assembly for `KeyPressEvent::try_parse`:
Without optimisation, the output is a lot more ugly, but I do not think that looking at this output makes sense.
One thing I notice: llvm managed to merge all the error handling, but it does not notice that it can simplify
if length < 4 then goto error; if length < 8 then goto error;
etc. Addingif initial_value.len() < 32 { return Err(ParseError::ParseError); }
as a new first line toKeyPressEvent::try_parse
helps here. The assembly now only has 56 lines. There are some simplifications that I do not immediately understand, but all of this "cmp
with small number, then jump" was merged into a singlecmp rdx, 31
. I guess generating something like this "everywhere" in the code generator shouldn't be too hard and should help a lot.For the timeline: Optimisation just for the sake of optimisation is hard. It makes more sense to take "size of some program" as a measurement. Thus, I suggest not to merge anything on this before the release and instead proceed carefully.
A goal for optimisation might be to take one of the examples in this repo and check their binary size. For example
cargo build --release --example xclock_utc
results in a 7.3 MiB binary whichstrip
turns into 503 KiB.After the following patch, this turns into 7.3 MiB and 499 KiB. That's already 4 KiB less, just by adding more code to the generated code. :-)
CC @jneem I'd be happy about your input here. (And I have never worked with
cargo bloat
before.)One quick question: Did you use
cargo build --release
? Or did I perhaps misunderstand you?The text was updated successfully, but these errors were encountered: