Generic code produce lots of no-ops compared to the monomorphic version. #8334

sebcrozet · 2013-08-06T12:37:46Z

Generic code should produce the same code as their monomorphic counterparts. However, a lot of additional nop are produced on generic code. For example:

#[inline(never)]
fn doit_not_generic(a: f32) -> f32 {
    let mut a = a;
    do 1000000000.times {
        a = a * a;
    }

    a
}

#[inline(never)]
fn doit<N: Mul<N, N>>(a: N) -> N {
    let mut a = a;
    do 1000000000.times {
        a = a * a;
    }

    a
}

When called with an f32, produced asm for doit has a lot of nop before the multiplication:

00000000004025e0 <_ZN9doit_437216_38a5b7ada5228707_0$x2e0E>:
  4025e0:   64 48 3b 24 25 70 00    cmp    %fs:0x70,%rsp
  4025e7:   00 00
  4025e9:   77 1a                   ja     402605 <_ZN9doit_437216_38a5b7ada5228707_0$x2e0E+0x25>
  4025eb:   49 ba 08 00 00 00 00    movabs $0x8,%r10
  4025f2:   00 00 00
  4025f5:   49 bb 00 00 00 00 00    movabs $0x0,%r11
  4025fc:   00 00 00
  4025ff:   e8 28 00 00 00          callq  40262c <__morestack>
  402604:   c3                      retq
  402605:   55                      push   %rbp
  402606:   48 89 e5                mov    %rsp,%rbp
  402609:   f3 0f 10 05 23 01 00    movss  0x123(%rip),%xmm0        # 402734 <_IO_stdin_used+0x14>
  402610:   00
  402611:   48 c7 c0 00 36 65 c4    mov    $0xffffffffc4653600,%rax
  402618:   90                      nop
  402619:   90                      nop
  40261a:   90                      nop
  40261b:   90                      nop
  40261c:   90                      nop
  40261d:   90                      nop
  40261e:   90                      nop
  40261f:   90                      nop
  402620:   f3 0f 59 c0             mulss  %xmm0,%xmm0
  402624:   48 ff c0                inc    %rax
  402627:   75 f7                   jne    402620 <_ZN9doit_437216_38a5b7ada5228707_0$x2e0E+0x40>
  402629:   5d                      pop    %rbp
  40262a:   c3                      retq
  40262b:   90                      nop

Produced asm for doit_not_generic is nop-free before the multiplication:

0000000000401380 <_ZN16doit_not_generic16_38a5b7ada5228707_0$x2e0E>:
  401380:   64 48 3b 24 25 70 00    cmp    %fs:0x70,%rsp
  401387:   00 00
  401389:   77 1a                   ja     4013a5 <_ZN16doit_not_generic16_38a5b7ada5228707_0$x2e0E+0x25>
  40138b:   49 ba 08 00 00 00 00    movabs $0x8,%r10
  401392:   00 00 00
  401395:   49 bb 00 00 00 00 00    movabs $0x0,%r11
  40139c:   00 00 00
  40139f:   e8 88 12 00 00          callq  40262c <__morestack>
  4013a4:   c3                      retq
  4013a5:   55                      push   %rbp
  4013a6:   48 89 e5                mov    %rsp,%rbp
  4013a9:   48 c7 c0 00 36 65 c4    mov    $0xffffffffc4653600,%rax
  4013b0:   f3 0f 59 c0             mulss  %xmm0,%xmm0
  4013b4:   48 ff c0                inc    %rax
  4013b7:   75 f7                   jne    4013b0 <_ZN16doit_not_generic16_38a5b7ada5228707_0$x2e0E+0x30>
  4013b9:   5d                      pop    %rbp
  4013ba:   c3                      retq
  4013bb:   90                      nop
  4013bc:   90                      nop
  4013bd:   90                      nop
  4013be:   90                      nop
  4013bf:   90                      nop

The text was updated successfully, but these errors were encountered:

bstrie · 2013-08-06T12:43:35Z

I also spy a movss that's present in the generic version, but not in the normal version.

Florob · 2013-08-09T21:56:58Z

I was curious about the amount of NOPs appearing in some code today, so I had a look around trying to deteremine where they originate. It turns out that the preferred alignment (on x86_64) for loop bodies is 16 Byte, so padding is introduced before loop bodies to ensure this. That is also what is happening here. The reason it's not aligned in the generic version is the additional movss. With current master I don't actually see that additional movss any more, so this is likely "fixed".
I do wonder why (unlike clang) we get a lot of 1 Byte NOPs instead of a multi-byte NOP though...

thestinger · 2013-08-23T00:33:39Z

@Florob: I think it's because we're doing target info wrong. @alexcrichton has some work in-progress that may fix that.

I'm going to close this issue since I can't duplicate it on master.

alexcrichton · 2013-08-23T04:29:21Z

@thestinger, are you sure you can't reproduce? If so, perhaps this is an OSX-specific problem because I was able to reproduce the extra nops on master.

Additionally, #8700 doesn't fix this :(

thestinger · 2013-08-23T04:44:48Z

What about -Z no-monomorphic-collapse?

alexcrichton · 2013-08-23T06:07:27Z

I still see the nops :(

Florob · 2013-08-23T08:29:58Z

@alexcrichton What code are you using exactly (i.e. how and how often do you call doit())?
It seems to me that rustc is "clever" here and instantiates doit() not for f32 in general, but for the specific argument you call it with. This adds the additional movss to get that argument into xmm0. Once that movss is there the nops are expected for alignment.
If I call doit() twice, with different arguments, or compile using --lib the movss (and nops) vanishes for me.

alexcrichton · 2013-08-23T16:21:48Z

Oh interesting, I using this code:

#[inline(never)]
fn doit_not_generic(a: f32) -> f32 {
    let mut a = a;
    do 1000000000.times {
        a = a * a;
    }

    a
}

#[inline(never)]
fn doit<N: Mul<N, N>>(a: N) -> N {
    let mut a = a;
    do 1000000000.times {
        a = a * a;
    }

    a
}


fn main() {
    assert!(doit_not_generic(2.0f32) == doit(2.0f32));
}

You are correct though that if I later call it with a different argument, the two codegens are the same. I'm a little surprised these aren't merged via the mergefunc pass, but that's for a later day!

bstrie mentioned this issue Aug 6, 2013

Numeric operators dont inline well on generic code. #8333

Closed

thestinger closed this as completed Aug 23, 2013

alexcrichton reopened this Aug 23, 2013

alexcrichton closed this as completed Aug 23, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic code produce lots of no-ops compared to the monomorphic version. #8334

Generic code produce lots of no-ops compared to the monomorphic version. #8334

sebcrozet commented Aug 6, 2013

bstrie commented Aug 6, 2013

Florob commented Aug 9, 2013

thestinger commented Aug 23, 2013

alexcrichton commented Aug 23, 2013

thestinger commented Aug 23, 2013

alexcrichton commented Aug 23, 2013

Florob commented Aug 23, 2013

alexcrichton commented Aug 23, 2013

Generic code produce lots of no-ops compared to the monomorphic version. #8334

Generic code produce lots of no-ops compared to the monomorphic version. #8334

Comments

sebcrozet commented Aug 6, 2013

bstrie commented Aug 6, 2013

Florob commented Aug 9, 2013

thestinger commented Aug 23, 2013

alexcrichton commented Aug 23, 2013

thestinger commented Aug 23, 2013

alexcrichton commented Aug 23, 2013

Florob commented Aug 23, 2013

alexcrichton commented Aug 23, 2013