Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline some Cursor calls for slices #60656

Merged
merged 1 commit into from
May 9, 2019

Conversation

petertodd
Copy link
Contributor

@petertodd petertodd commented May 9, 2019

(Partially) brings back #33921

I've noticed in some serialization code I was writing that writes to slices produce much, much, worse code than you'd expect even with optimizations turned on. For example, you'd expect something like this to be zero cost:

use std::io::{self, Cursor, Write};

pub fn serialize((a, b): (u64, u64)) -> [u8;8+8] {
    let mut r = [0u8;16];
    {
        let mut w = Cursor::new(&mut r[..]);

        w.write(&a.to_le_bytes()).unwrap();
        w.write(&b.to_le_bytes()).unwrap();
    }
    r
}

...but it compiles down to dozens of instructions because the slice_write() calls aren't inlined, which in turn means unwrap() can't be optimized away, and so on.

To be clear, this pull-req isn't sufficient by itself: if we want to go down that path we also need to add #[inline]'s to the default implementations for functions like write_all() in the Write trait and so on, or implement them separately in the Cursor impls. But I figured I'd start a conversation about what tradeoffs we're expecting here.

@rust-highfive
Copy link
Collaborator

r? @sfackler

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label May 9, 2019
@sfackler
Copy link
Member

sfackler commented May 9, 2019

These inlines seem plausible to me - what does the assembly look like after this change?

@petertodd
Copy link
Contributor Author

For that specific example:

$ cargo asm foo::serialize
foo::serialize:
 sub     rsp, 16
 mov     rax, rdi
 mov     qword, ptr, [rsp], rsi
 mov     qword, ptr, [rsp, +, 8], rdx
 mov     rcx, qword, ptr, [rsp]
 mov     qword, ptr, [rdi], rcx
 mov     rcx, qword, ptr, [rsp, +, 8]
 mov     qword, ptr, [rdi, +, 8], rcx
 add     rsp, 16
 ret

vs before:

$ cargo asm foo::serialize
foo::serialize:
 push    r15
 push    r14
 push    rbx
 sub     rsp, 96
 mov     r15, rdx
 mov     rbx, rdi
 xorps   xmm0, xmm0
 movaps  xmmword, ptr, [rsp, +, 80], xmm0
 lea     rdx, [rsp, +, 80]
 mov     qword, ptr, [rsp, +, 56], rdx
 lea     r14, [rsp, +, 72]
 mov     eax, 16
 movq    xmm0, rax
 movdqu  xmmword, ptr, [rsp, +, 64], xmm0
 mov     qword, ptr, [rsp, +, 8], rsi
 lea     rdi, [rsp, +, 32]
 lea     r8, [rsp, +, 8]
 mov     ecx, 16
 mov     r9d, 8
 mov     rsi, r14
 call    qword, ptr, [rip, +, _ZN3std2io6cursor11slice_write17h3193265db7206f0bE@GOTPCREL]
 cmp     qword, ptr, [rsp, +, 32], 1
 je      .LBB5_3
 mov     qword, ptr, [rsp, +, 8], r15
 mov     rdx, qword, ptr, [rsp, +, 56]
 mov     rcx, qword, ptr, [rsp, +, 64]
 lea     rdi, [rsp, +, 32]
 lea     r8, [rsp, +, 8]
 mov     r9d, 8
 mov     rsi, r14
 call    qword, ptr, [rip, +, _ZN3std2io6cursor11slice_write17h3193265db7206f0bE@GOTPCREL]
 cmp     qword, ptr, [rsp, +, 32], 1
 je      .LBB5_3
 movaps  xmm0, xmmword, ptr, [rsp, +, 80]
 movups  xmmword, ptr, [rbx], xmm0
 mov     rax, rbx
 add     rsp, 96
 pop     rbx
 pop     r14
 pop     r15
 ret
.LBB5_3:
 movups  xmm0, xmmword, ptr, [rsp, +, 40]
 movaps  xmmword, ptr, [rsp, +, 16], xmm0
 lea     rdi, [rsp, +, 16]
 call    core::result::unwrap_failed
 ud2

Like I said, if we're going to merge this we should decide what category in general should be inlined and I should go through the full set for Read, Seek, etc.

@sfackler
Copy link
Member

sfackler commented May 9, 2019

Ok awesome, that looks good.

The general policy is to inline functions in cases where we see concrete improvements from doing so.

@bors r+

@bors
Copy link
Contributor

bors commented May 9, 2019

📌 Commit b9c4301 has been approved by sfackler

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 9, 2019
Centril added a commit to Centril/rust that referenced this pull request May 9, 2019
…lice, r=sfackler

Inline some Cursor calls for slices

(Partially) brings back rust-lang#33921

I've noticed in some serialization code I was writing that writes to slices produce much, much, worse code than you'd expect even with optimizations turned on. For example, you'd expect something like this to be zero cost:

```
use std::io::{self, Cursor, Write};

pub fn serialize((a, b): (u64, u64)) -> [u8;8+8] {
    let mut r = [0u8;16];
    {
        let mut w = Cursor::new(&mut r[..]);

        w.write(&a.to_le_bytes()).unwrap();
        w.write(&b.to_le_bytes()).unwrap();
    }
    r
}
```

...but it compiles down to [dozens of instructions](https://rust.godbolt.org/z/bdwDzb) because the `slice_write()` calls aren't inlined, which in turn means `unwrap()` can't be optimized away, and so on.

To be clear, this pull-req isn't sufficient by itself: if we want to go down that path we also need to add `#[inline]`'s to the default implementations for functions like `write_all()` in the `Write` trait and so on, or implement them separately in the `Cursor` impls. But I figured I'd start a conversation about what tradeoffs we're expecting here.
bors added a commit that referenced this pull request May 9, 2019
Rollup of 5 pull requests

Successful merges:

 - #60601 (Add a `cast` method to raw pointers.)
 - #60638 (pin: make the to-module link more visible)
 - #60647 (cleanup: Remove `DefIndexAddressSpace`)
 - #60656 (Inline some Cursor calls for slices)
 - #60657 (Stabilize and re-export core::array in std)

Failed merges:

r? @ghost
@bors bors merged commit b9c4301 into rust-lang:master May 9, 2019
@petertodd
Copy link
Contributor Author

Thanks!

I reviewed the rest of the slice cursor API and it looks like it all optimizes fine actually with this change due to how it's written as generic implementations. So unless I find something else we're done re: slices.

Vec has the same issue too. But there the optimization story is more complex as you'd have to be pre-allocating and what not for optimizations to really kick in, at which point using slices probably makes more sense. Not a problem for my code at least. :)

@petertodd petertodd deleted the 2019-inline-cursor-over-slice branch May 9, 2019 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants