Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we get smaller binary code? #488

Open
psychon opened this issue Jun 16, 2020 · 8 comments
Open

Can we get smaller binary code? #488

psychon opened this issue Jun 16, 2020 · 8 comments
Labels
help wanted Extra attention is needed P4 Priority Low

Comments

@psychon
Copy link
Owner

psychon commented Jun 16, 2020

linebender/druid#1025 (comment)

According to cargo bloat, Event::parse is more than 9kb of code (this is only with randr). Then there's also a bunch of things like KeyPressEvent::try_parse at around the 3kb mark. I bet that rustc is doing a lot of inlining, and that the error-handling code could be de-duplicated.

I'm not sure I understand this correctly (is it "there are multiple things (e.g. KeyPressEvent::try_parse) that are in total 3kb in size" or is it "KeyPressEvent::try_parse is 3kb in size"?), but I cannot really reproduce.

I copied together some self-contained code for `KeyPressEvent::try_parse`
use std::convert::TryInto;

pub enum ParseError { ParseError }

pub type Window = u32;

pub type Pixmap = u32;

pub type Cursor = u32;

pub type Font = u32;

pub type Gcontext = u32;

pub type Colormap = u32;

pub type Atom = u32;

pub type Drawable = u32;

pub type Fontable = u32;

pub type Bool32 = u32;

pub type Visualid = u32;

pub type Timestamp = u32;

pub type Keysym = u32;

pub type Keycode = u8;

pub type Keycode32 = u32;

pub type Button = u8;


/// A type implementing this trait can be parsed from some raw bytes.
pub trait TryParse: Sized {
    /// Try to parse the given values into an instance of this type.
    ///
    /// If parsing is successful, an instance of the type and a slice for the remaining data should
    /// be returned. Otherwise, an error is returned.
    fn try_parse(value: &[u8]) -> Result<(Self, &[u8]), ParseError>;
}


macro_rules! implement_try_parse {
    ($t:ty) => {
        impl TryParse for $t {
            fn try_parse(value: &[u8]) -> Result<(Self, &[u8]), ParseError> {
                let len = std::mem::size_of::<$t>();
                let bytes = value
                    .get(..len)
                    .ok_or(ParseError::ParseError)?
                    .try_into() // TryInto<[u8; len]>
                    .unwrap();
                Ok((<$t>::from_ne_bytes(bytes), &value[len..]))
            }
        }
    };
}

impl TryParse for bool {
    fn try_parse(value: &[u8]) -> Result<(Self, &[u8]), ParseError> {
        let (data, remaining) = u8::try_parse(value)?;
        Ok((data != 0, remaining))
    }
}

implement_try_parse!(u8);
implement_try_parse!(i8);
implement_try_parse!(u16);
implement_try_parse!(i16);
implement_try_parse!(u32);
implement_try_parse!(i32);
implement_try_parse!(u64);
implement_try_parse!(i64);

pub struct KeyPressEvent {
    pub response_type: u8,
    pub detail: Keycode,
    pub sequence: u16,
    pub time: Timestamp,
    pub root: Window,
    pub event: Window,
    pub child: Window,
    pub root_x: i16,
    pub root_y: i16,
    pub event_x: i16,
    pub event_y: i16,
    pub state: u16,
    pub same_screen: bool,
}
impl TryParse for KeyPressEvent {
    fn try_parse(initial_value: &[u8]) -> Result<(Self, &[u8]), ParseError> {
        let remaining = initial_value;
        let (response_type, remaining) = u8::try_parse(remaining)?;
        let (detail, remaining) = Keycode::try_parse(remaining)?;
        let (sequence, remaining) = u16::try_parse(remaining)?;
        let (time, remaining) = Timestamp::try_parse(remaining)?;
        let (root, remaining) = Window::try_parse(remaining)?;
        let (event, remaining) = Window::try_parse(remaining)?;
        let (child, remaining) = Window::try_parse(remaining)?;
        let (root_x, remaining) = i16::try_parse(remaining)?;
        let (root_y, remaining) = i16::try_parse(remaining)?;
        let (event_x, remaining) = i16::try_parse(remaining)?;
        let (event_y, remaining) = i16::try_parse(remaining)?;
        let (state, remaining) = u16::try_parse(remaining)?;
        let (same_screen, remaining) = bool::try_parse(remaining)?;
        let remaining = remaining.get(1..).ok_or(ParseError::ParseError)?;
        let result = KeyPressEvent { response_type, detail, sequence, time, root, event, child, root_x, root_y, event_x, event_y, state, same_screen };
        let _ = remaining;
        let remaining = initial_value.get(32..)
            .ok_or(ParseError::ParseError)?;
        Ok((result, remaining))
    }
}
The resulting compiler output with `-Copt-level=3 --edition=2018` is 41 KiB of text (according to https://rust.godbolt.org).
I cannot easily see the binary size, but here is the assembly for `KeyPressEvent::try_parse`:
<example::KeyPressEvent as example::TryParse>::try_parse:
        push    rbp
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        sub     rsp, 1
        mov     rax, rdi
        test    rdx, rdx
        je      .LBB1_16
        cmp     rdx, 1
        je      .LBB1_16
        mov     rcx, rdx
        and     rcx, -2
        cmp     rcx, 2
        je      .LBB1_16
        mov     rdi, rdx
        and     rdi, -4
        cmp     rdi, 4
        je      .LBB1_16
        cmp     rdi, 8
        je      .LBB1_16
        cmp     rdi, 12
        je      .LBB1_16
        cmp     rdi, 16
        je      .LBB1_16
        cmp     rcx, 20
        je      .LBB1_16
        cmp     rcx, 22
        je      .LBB1_16
        cmp     rcx, 24
        je      .LBB1_16
        cmp     rcx, 26
        je      .LBB1_16
        cmp     rcx, 28
        je      .LBB1_16
        cmp     rdx, 30
        je      .LBB1_16
        cmp     byte ptr [rsi + 30], 0
        setne   cl
        cmp     rdx, 31
        je      .LBB1_16
        cmp     rdx, 32
        jae     .LBB1_15
.LBB1_16:
        mov     byte ptr [rax + 30], 2
.LBB1_17:
        add     rsp, 1
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        pop     rbp
        ret
.LBB1_15:
        mov     r9b, byte ptr [rsi]
        mov     bl, byte ptr [rsi + 1]
        mov     byte ptr [rsp], bl
        movzx   r10d, word ptr [rsi + 2]
        mov     edi, dword ptr [rsi + 4]
        mov     r11d, dword ptr [rsi + 8]
        mov     ebx, dword ptr [rsi + 12]
        mov     ebp, dword ptr [rsi + 16]
        movzx   r14d, word ptr [rsi + 20]
        movzx   r15d, word ptr [rsi + 22]
        movzx   r12d, word ptr [rsi + 24]
        movzx   r13d, word ptr [rsi + 26]
        movzx   r8d, word ptr [rsi + 28]
        add     rsi, 32
        add     rdx, -32
        mov     dword ptr [rax], edi
        mov     dword ptr [rax + 4], r11d
        mov     dword ptr [rax + 8], ebx
        mov     dword ptr [rax + 12], ebp
        mov     word ptr [rax + 16], r10w
        mov     word ptr [rax + 18], r14w
        mov     word ptr [rax + 20], r15w
        mov     word ptr [rax + 22], r12w
        mov     word ptr [rax + 24], r13w
        mov     word ptr [rax + 26], r8w
        mov     byte ptr [rax + 28], r9b
        mov     bl, byte ptr [rsp]
        mov     byte ptr [rax + 29], bl
        mov     byte ptr [rax + 30], cl
        mov     qword ptr [rax + 32], rsi
        mov     qword ptr [rax + 40], rdx
        jmp     .LBB1_17
That's just 90 lines of assembly and it does not call any other code. This can't be 3 KiB of binary code.

Without optimisation, the output is a lot more ugly, but I do not think that looking at this output makes sense.

One thing I notice: llvm managed to merge all the error handling, but it does not notice that it can simplify if length < 4 then goto error; if length < 8 then goto error; etc. Adding if initial_value.len() < 32 { return Err(ParseError::ParseError); } as a new first line to KeyPressEvent::try_parse helps here. The assembly now only has 56 lines. There are some simplifications that I do not immediately understand, but all of this "cmp with small number, then jump" was merged into a single cmp rdx, 31. I guess generating something like this "everywhere" in the code generator shouldn't be too hard and should help a lot.

For the timeline: Optimisation just for the sake of optimisation is hard. It makes more sense to take "size of some program" as a measurement. Thus, I suggest not to merge anything on this before the release and instead proceed carefully.

A goal for optimisation might be to take one of the examples in this repo and check their binary size. For example cargo build --release --example xclock_utc results in a 7.3 MiB binary which strip turns into 503 KiB.
After the following patch, this turns into 7.3 MiB and 499 KiB. That's already 4 KiB less, just by adding more code to the generated code. :-)

diff --git a/generator/src/generator/namespace.rs b/generator/src/generator/namespace.rs
index c54f996..826744b 100644
--- a/generator/src/generator/namespace.rs
+++ b/generator/src/generator/namespace.rs
@@ -2560,6 +2560,9 @@ impl<'ns, 'c> NamespaceGenerator<'ns, 'c> {
                     if parse_size_constraint != StructSizeConstraint::None {
                         outln!(out, "let remaining = initial_value;");
                     }
+                    if let StructSizeConstraint::Fixed(size) = parse_size_constraint {
+                        outln!(out, "if remaining.len() < {} {{ return Err(ParseError::ParseError); }}", size);
+                    }
                     Self::emit_let_value_for_dynamic_align(fields, out);
                     for field in fields.iter() {
                         self.emit_field_parse(

CC @jneem I'd be happy about your input here. (And I have never worked with cargo bloat before.)
One quick question: Did you use cargo build --release? Or did I perhaps misunderstand you?

@psychon psychon added help wanted Extra attention is needed P4 Priority Low labels Jun 16, 2020
@psychon
Copy link
Owner Author

psychon commented Jun 16, 2020

Random ramblings

Event::parse is more than 9kb of code (this is only with randr)

I took a look at the output of objdump -S target/release/examples/xclock_utc (this is without any extensions enabled). Looks like a lot was inlined into this function.

For master, I get 1431 lines of assembly code. This is surely less than 9 KiB of binary code, but I do not have randr enabled.

With the patch from my last comment... uhm... nothing changed (well, the number of lines did not change). Weird.

With cargo build --release --example xclock_utc --features all-extensions, this function turns into a 6537-lines-monster. Hm...

When scrolling through this, a lot of the code looks like bad versions of memcpy, i.e. stuff like this:

   2f87e:	8b 8c 24 97 00 00 00 	mov    0x97(%rsp),%ecx
   2f885:	89 8c 24 67 02 00 00 	mov    %ecx,0x267(%rsp)
   2f88c:	48 8b 8c 24 90 00 00 	mov    0x90(%rsp),%rcx
   2f893:	00 
   2f894:	48 89 8c 24 60 02 00 	mov    %rcx,0x260(%rsp)
   2f89b:	00 
   2f89c:	8b 8c 24 67 02 00 00 	mov    0x267(%rsp),%ecx
   2f8a3:	89 8c 24 87 01 00 00 	mov    %ecx,0x187(%rsp)
   2f8aa:	48 8b 8c 24 60 02 00 	mov    0x260(%rsp),%rcx
   2f8b1:	00 
   2f8b2:	48 89 8c 24 80 01 00 	mov    %rcx,0x180(%rsp)
   2f8b9:	00 
   2f8ba:	48 8b 8c 24 80 01 00 	mov    0x180(%rsp),%rcx
   2f8c1:	00 
   2f8c2:	48 89 8c 24 d3 01 00 	mov    %rcx,0x1d3(%rsp)
   2f8c9:	00 
   2f8ca:	8b 8c 24 87 01 00 00 	mov    0x187(%rsp),%ecx
   2f8d1:	89 8c 24 da 01 00 00 	mov    %ecx,0x1da(%rsp)
   2f8d8:	48 8b 8c 24 d0 01 00 	mov    0x1d0(%rsp),%rcx
   2f8df:	00 
   2f8e0:	49 89 4e 01          	mov    %rcx,0x1(%r14)
   2f8e4:	48 8b 8c 24 d6 01 00 	mov    0x1d6(%rsp),%rcx
   2f8eb:	00 
   2f8ec:	49 89 4e 07          	mov    %rcx,0x7(%r14)
   2f8f0:	41 88 46 0f          	mov    %al,0xf(%r14)

Could this come from lines like xproto::BUTTON_PRESS_EVENT => return Ok(Self::ButtonPress(event.try_into()?)), where the compiler copies the result from event.try_into()? into the newly constructed enum?

Adding the following to Cargo.toml

[profile.release]
opt-level = 'z' 

turns that 6537-lines-monster into a 3442-lines-monster. And there are now 44 calls to memcpy (the old code had 26). No idea where 44 comes from. And the assembly still contains unrolled memcpys... :-/

Could it be that this big code size comes from "moving large things into an enum with many variants"?

Edit: Seems like this kind of pattern does indeed generate way too much code: https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=ee3bd9d4445b7657ee94dfec319a88ee

Edit: Yup. https://stackoverflow.com/questions/39219961/how-to-get-assembly-output-from-building-with-cargo taught me to use cargo rustc --release -- --emit asm. That is a lot more readable than the objdump output. Plus, I can actually see the jump table that is used. So this code does indeed have this structure:

preparation
look up entry in jump table
jump
case 1: well, this calls `Error::parse` and is special.
case 2:
  shuffle registers around
  call `KeyPressEvent::try_from`
  lots of code to copy around memory
  check if it failed
  more memory is copied around
  done
  [This is 28 lines of assembly for doing almost nothing]
case 3:
  Basically the same as case 2, but duplicated. The same for all the following cases
[More code follows, e.g. error handling, handling generic events, handling the "Unknown" case"]

Edit: Hm... This code does not look like it went through compiler optimisations. Nothing is inlined in KeyPressEvent::try_parse. Does --release not work for cargo rustc?
Edit: cargo rustc --release -- -Copt-level=s --emit asm does better.
Edit: as -o /tmp/t.o the-file-generated-by-the-previous-command.s can generate an object file. That way, one can look at the binary size (and run size on some file). Running strip /tmp/t.o still does some progress: The file size goes from 2.4 MiB to 1.3 MiB.

@jneem
Copy link

jneem commented Jun 16, 2020

Thanks for looking into this! It turns out that I wasn't using --release (I had assumed that cargo bloat defaults to it, because that would make sense...). My test cases have been the examples in druid: clone this and then do cargo bloat --release --crates --example flex --features=x11.

After adding --release, x11rb is taking up 5.5% of .text (before that, it was 12.7%). There's something puzzling, though, because total binary sizes went up 100kb after adding x11rb, but cargo bloat only shows x11rb taking up 50kb. I will look into it more.

You can see the size of individual functions with cargo bloat --release --filter x11rb --example flex --features=x11. After adding release, the biggest is Event::parse with 6.8K. The try_parses have gone down by a lot (Setup::try_parse is the biggest, at 1.5K)

@psychon
Copy link
Owner Author

psychon commented Jun 16, 2020

There's something puzzling, though, because total binary sizes went up 100kb after adding x11rb,

Random data point to make me feel better: libxcb.so is 163 KiB here and libxcb-randr.so has 67 KiB. Thus, switching to x11rb is already a net reduction for you. ;-)
(No, I am not serious; libxcb is already installed "everywhere" and thus does not really count towards the size of your binary.)

@jneem
Copy link

jneem commented Jun 18, 2020

Actually, it's worse than that: I need XCBConnection for cairo interop, so I'm linking libxcb.so anyway 🤷

psychon added a commit that referenced this issue Jun 19, 2020
Most TryParse implementations that we have contain a series of calls to
other try_parse() implementations. Most of these are for primitive
types, e.g. u32. Thus, some rough sketch of the code is this:

  let (a, remaining) = u32::try_parse(remaining)?;
  let (b, remaining) = u32::try_parse(remaining)?;
  let (c, remaining) = u32::try_parse(remaining)?;
  let (d, remaining) = u32::try_parse(remaining)?;

This is relatively readable code. We do not have to mess round with byte
offsets or something like that. However, every single call to
u32::try_parse() checks if there are another 32 bit of data available.
This leads to lots of repetitions in the assembly code, because these
length checks are not merged.

This commit changes the code generator to compute the minimum size that
some object can have on the wire. This minimum size is the sum of all
fixed-size data. Basically, this means that lists are assumed to be
empty.

Then, the generated code checks if at least the minimum number of bytes
is available. If not, an early returns is done.

This early length check allows the compiler (in release mode) to
optimise away most later length checks, thus resulting in less assembly
code.

The effect of this commit was measured in two ways.

1. "cargo rustc --release -- --Copt-level=s --emit asm" allows to get an
   assembly file for the library. Before this commit, this assembly file
   has 159662 lines. Afterwards, it has 157796 lines. This is a
   reduction of about 1.2 %.
2. "cargo build --release --example xclock_utc" and calling "strip" on
   the resulting binary previously produced a file with 503 KiB. After
   this commit, the file has only 499 KiB. Thus, about 4 KiB or 1 % is
   saved.

Related-to: #488
Signed-off-by: Uli Schlachter <psychon@znc.in>
@psychon
Copy link
Owner Author

psychon commented Jun 19, 2020

After adding release, the biggest is Event::parse with 6.8K. The try_parses have gone down by a lot (Setup::try_parse is the biggest, at 1.5K)

I am not sure how useful cargo bloat pointing at random functions really is for us. It seems to me this always points at whatever function llvm chose not to inline into its caller, but whose callees were inlined.

But okay, Setup::try_parse it is. That function has 373 lines of assembly for me (using objdump -S again). Most of this code is moving bytes around from somewhere in memory to somewhere on the stack. Perhaps we should just Box things up to avoid so much copying-around-on-the-stack? I do not really have any good ideas about this.

My attempts at smaller code size so far are documented in #491 and #492. Keeping each idea in its own PR perhaps leads to less of a mess than my comments above. The result is not much of a reduction.

@psychon
Copy link
Owner Author

psychon commented Jul 4, 2020

Another random idea: All the event parsing code "likely" is only called by Event::parse (only "likely", because it is public, but I fail to see why anyone would want to call specific bits directly). Some well-placed #[inline] annotations could end up inlining all the event parsing into Event::parse. The resulting "monster function" could then hopefully optimised well enough so that the pointless copies disappear and the parsed events are directly parsed into their final position.

Edit: Google suggests that LTO is a better idea than #[inline].

Edit: LTO seems to help. With x11rb = "0.6" and

[profile.release]
opt-level = 'z'  # Optimize for size.
lto = true

a simple println!("Hello world"); results in a 175 KiB binary with cargo build --release and strip run on the binary. When parsing an event as follows, the binary has 203 KiB, so only 28 KiB extra (which is still a lot for "doing nothing"). Without LTO, the numbers are 195 KiB / 247 KiB, so 52 KiB extra for x11rb parsing an event.

use x11rb::connection::RequestConnection as _;
use x11rb::rust_connection::RustConnection;

fn out_of_thin_air<T>() -> T {
    unsafe {
        std::ptr::read_volatile(0x1337 as _)
    }
}

fn main() {
    if false {
        println!("Hello World");
    } else {
        let conn: RustConnection = out_of_thin_air();
        let event_bytes: &[u8] = out_of_thin_air();
        println!("{}", conn.parse_event(event_bytes).is_ok());
    }
}

psychon added a commit that referenced this issue Jul 11, 2020
It is now unused. This should help with #488, I hope: Less code means
smaller binary size.

Signed-off-by: Uli Schlachter <psychon@znc.in>
@psychon
Copy link
Owner Author

psychon commented Oct 11, 2020

Some numbers to see how the new release is doing. I built the xclock_utc example (let's call that build "normal"). I also did a build with the following in Cargo.toml (let's call that "optimised"):

[profile.release]
opt-level = 'z'
lto = true
codegen-units = 1
panic = 'abort'

Results are: I need a different benchmark (well, 4k less in one case)

$ ls -Ss | cat
insgesamt 5576
1060 xclock_utc_0.7_normal
1060 xclock_utc_0.6_normal
 840 xclock_utc_0.7_optimised
 836 xclock_utc_0.6_optimised
 512 xclock_utc.stripped_0.6_normal
 512 xclock_utc.stripped_0.7_normal
 380 xclock_utc.stripped_0.7_optimised
 376 xclock_utc.stripped_0.6_optimised

@psychon
Copy link
Owner Author

psychon commented May 20, 2023

Random data point from neXromancers/shotgun#40 (thanks @9ary ): This is some code that doesn't do any event parsing, so most of what we already looked at in this issue doesn't apply. Only two functions from x11rb show up as being large:

 File  .text     Size          Crate Name
 0.1%   1.2%   6.0KiB          x11rb x11rb::rust_connection::RustConnection::connect
 0.1%   1.1%   5.8KiB x11rb_protocol x11rb_protocol::protocol::request_name

request_name() is used in parsing X11 errors (x11rb_protocol::x11_utils::X11Error::try_parse()) so that the Debug impl prints the name of the request and not just some random numbers. I'm actually surprised that this turns into so much binary code with just the randr feature enabled.

And connect() is likely the result of a lot of inlining. This would then be all the code to parse the $DISPLAY environment variable, to parse ~/.Xauthority, to send the connection setup and to receive & parse the Setup from the X11 server. I'm not saying that this size is fine, but at least I can understand why it has the size that it has.

Edit: Poor man's cargo bloat:

objdump -S target/release/examples/simple_window | c++filt | awk '/^0/ { print NR - last_start, last; last=$0 ; last_start = NR }' | sort -n

For current master this ends with

2217 0000000000152d50 <miniz_oxide::inflate::core::decompress::h563e858fbd038acc>:
2317 00000000000faba0 <x11rb_protocol::protocol::get_request_name::hd6cae530618148e9>:
6627 00000000000fdc10 <x11rb_protocol::protocol::request_name::hdaabd4a928272e17>:

With #838 this ends with

2217 000000000014aff0 <miniz_oxide::inflate::core::decompress::h563e858fbd038acc>:
2592 00000000000f9af0 <x11rb_protocol::protocol::get_request_name_internal::h3d92f71a778b3242>:

So... that PR gets rid of the largest function and the second-largest function doesn't get much larger.

(Yes, number of lines of output is a bad proxy for binary size - I am counting the number of instructions here and not the size of the function.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed P4 Priority Low
Projects
None yet
Development

No branches or pull requests

2 participants