Possible performance loss with f32 arithmetic #91447

Yuri6037 · 2021-12-02T10:03:56Z

I've tried, out of curiosity, a floating point arithmetic test and found quite a big difference between C++ and Rust.

The code used in rust

pub struct Stats
{
    x: f32,
    y: f32,
    z: f32
}

pub fn sum(a: &Stats, b: &Stats) -> Stats
{
    Stats {
        x: a.x + b.x,
        y: a.y + b.y,
        z: a.z + b.z
    }
}

The code used in C++

struct Stats
{
    float x;
    float y;
    float z;
};

Stats sum(const Stats &a, const Stats &b)
{
    return Stats {
        a.x + b.x,
        a.y + b.y,
        a.z + b.z
    };
}

Here is a link to a godbolt for side-by-side comparision of assembly output: https://godbolt.org/z/dqc4b74rv

Rust seem to absolutely want the floats back into e* registers instead of keeping them in xmm registers, C++ leaves them into the xmm registers. In some cases it might more advantageous to leave the floats in xmm registers for future operations on them rather then passing them back into the e* registers.

Urgau · 2021-12-02T10:56:28Z

After a quick godbolt I find that clang and rustc use the same representation (%Stats = type { float, float, float }) expect for rustc <= 1.47 (%Stats = type { [0 x i32], float, [0 x i32], float, [0 x i32], float, [0 x i32] }). Not sure if that change something because their all at 0.
Another difference between the two rustc is the return type: void for rustc <= 1.47 and i96 for rustc > 1.47 which cause some bitcast and load at the end, which might explain the difference in codegen. Also why is this i96 and not float like the rest ?
The last noticeable difference is that clang process the data in <2 x float>, float instead of float, float, float which allow more optimizations to be done.

My hypothesis is that this codegen issue is probably cause by the i96 return type, which is probably cause by a regression in codegen in rustc > 1.47 but I'm not an expert in rustc_codegen_llvm or llvm itself so this needs to be confirm by the rustc llvm workgroup.

clang (trunk with `-O0`)

Assembly with -O3

sum(Stats const&, Stats const&):                      # @sum(Stats const&, Stats const&)
        movss   xmm0, dword ptr [rdi]           # xmm0 = mem[0],zero,zero,zero
        movss   xmm1, dword ptr [rdi + 4]       # xmm1 = mem[0],zero,zero,zero
        addss   xmm0, dword ptr [rsi]
        addss   xmm1, dword ptr [rsi + 4]
        unpcklps        xmm0, xmm1                      # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
        movss   xmm1, dword ptr [rdi + 8]       # xmm1 = mem[0],zero,zero,zero
        addss   xmm1, dword ptr [rsi + 8]
        ret

%struct.Stats = type { float, float, float }

define dso_local { <2 x float>, float } @_Z3sumRK5StatsS1_(%struct.Stats* nonnull align 4 dereferenceable(12) %0, %struct.Stats* nonnull align 4 dereferenceable(12) %1) #0 !dbg !105 {
  %3 = alloca %struct.Stats, align 4
  %4 = alloca %struct.Stats*, align 8
  %5 = alloca %struct.Stats*, align 8
  %6 = alloca { <2 x float>, float }, align 8
  store %struct.Stats* %0, %struct.Stats** %4, align 8
  call void @llvm.dbg.declare(metadata %struct.Stats** %4, metadata !111, metadata !DIExpression()), !dbg !112
  store %struct.Stats* %1, %struct.Stats** %5, align 8
  call void @llvm.dbg.declare(metadata %struct.Stats** %5, metadata !113, metadata !DIExpression()), !dbg !114
  %7 = getelementptr inbounds %struct.Stats, %struct.Stats* %3, i32 0, i32 0, !dbg !115
  %8 = load %struct.Stats*, %struct.Stats** %4, align 8, !dbg !116
  %9 = getelementptr inbounds %struct.Stats, %struct.Stats* %8, i32 0, i32 0, !dbg !117
  %10 = load float, float* %9, align 4, !dbg !117
  %11 = load %struct.Stats*, %struct.Stats** %5, align 8, !dbg !118
  %12 = getelementptr inbounds %struct.Stats, %struct.Stats* %11, i32 0, i32 0, !dbg !119
  %13 = load float, float* %12, align 4, !dbg !119
  %14 = fadd float %10, %13, !dbg !120
  store float %14, float* %7, align 4, !dbg !115
  %15 = getelementptr inbounds %struct.Stats, %struct.Stats* %3, i32 0, i32 1, !dbg !115
  %16 = load %struct.Stats*, %struct.Stats** %4, align 8, !dbg !121
  %17 = getelementptr inbounds %struct.Stats, %struct.Stats* %16, i32 0, i32 1, !dbg !122
  %18 = load float, float* %17, align 4, !dbg !122
  %19 = load %struct.Stats*, %struct.Stats** %5, align 8, !dbg !123
  %20 = getelementptr inbounds %struct.Stats, %struct.Stats* %19, i32 0, i32 1, !dbg !124
  %21 = load float, float* %20, align 4, !dbg !124
  %22 = fadd float %18, %21, !dbg !125
  store float %22, float* %15, align 4, !dbg !115
  %23 = getelementptr inbounds %struct.Stats, %struct.Stats* %3, i32 0, i32 2, !dbg !115
  %24 = load %struct.Stats*, %struct.Stats** %4, align 8, !dbg !126
  %25 = getelementptr inbounds %struct.Stats, %struct.Stats* %24, i32 0, i32 2, !dbg !127
  %26 = load float, float* %25, align 4, !dbg !127
  %27 = load %struct.Stats*, %struct.Stats** %5, align 8, !dbg !128
  %28 = getelementptr inbounds %struct.Stats, %struct.Stats* %27, i32 0, i32 2, !dbg !129
  %29 = load float, float* %28, align 4, !dbg !129
  %30 = fadd float %26, %29, !dbg !130
  store float %30, float* %23, align 4, !dbg !115
  %31 = bitcast { <2 x float>, float }* %6 to i8*, !dbg !131
  %32 = bitcast %struct.Stats* %3 to i8*, !dbg !131
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %31, i8* align 4 %32, i64 12, i1 false), !dbg !131
  %33 = load { <2 x float>, float }, { <2 x float>, float }* %6, align 8, !dbg !131
  ret { <2 x float>, float } %33, !dbg !131
}

rustc (1.56 with `-C opt-level=0 --emit=llvm-ir`)

Assembly with -O

example::sum:
        movss   xmm0, dword ptr [rdi]
        addss   xmm0, dword ptr [rsi]
        movsd   xmm1, qword ptr [rdi + 4]
        movsd   xmm2, qword ptr [rsi + 4]
        addps   xmm2, xmm1
        movd    eax, xmm0
        movd    ecx, xmm2
        shufps  xmm2, xmm2, 85
        movd    edx, xmm2
        shl     rcx, 32
        or      rax, rcx
        ret

%Stats = type { float, float, float }

define i96 @_ZN7example3sum17h2dbae3ec3349ec5dE(%Stats* align 4 dereferenceable(12) %a, %Stats* align 4 dereferenceable(12) %b) unnamed_addr #0 !dbg !6 {
  %0 = alloca %Stats, align 4
  %1 = bitcast %Stats* %a to float*, !dbg !10
  %_4 = load float, float* %1, align 4, !dbg !10
  %2 = bitcast %Stats* %b to float*, !dbg !11
  %_5 = load float, float* %2, align 4, !dbg !11
  %_3 = fadd float %_4, %_5, !dbg !10
  %3 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 1, !dbg !12
  %_7 = load float, float* %3, align 4, !dbg !12
  %4 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 1, !dbg !13
  %_8 = load float, float* %4, align 4, !dbg !13
  %_6 = fadd float %_7, %_8, !dbg !12
  %5 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 2, !dbg !14
  %_10 = load float, float* %5, align 4, !dbg !14
  %6 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 2, !dbg !15
  %_11 = load float, float* %6, align 4, !dbg !15
  %_9 = fadd float %_10, %_11, !dbg !14
  %7 = bitcast %Stats* %0 to float*, !dbg !16
  store float %_3, float* %7, align 4, !dbg !16
  %8 = getelementptr inbounds %Stats, %Stats* %0, i32 0, i32 1, !dbg !16
  store float %_6, float* %8, align 4, !dbg !16
  %9 = getelementptr inbounds %Stats, %Stats* %0, i32 0, i32 2, !dbg !16
  store float %_9, float* %9, align 4, !dbg !16
  %10 = bitcast %Stats* %0 to i96*, !dbg !17
  %11 = load i96, i96* %10, align 4, !dbg !17
  ret i96 %11, !dbg !17
}

rustc (1.47 with `-C opt-level=0 --emit=llvm-ir`)

Assembly with -O

example::sum:
        movss   xmm0, dword ptr [rsi]
        movss   xmm1, dword ptr [rsi + 4]
        addss   xmm0, dword ptr [rdx]
        addss   xmm1, dword ptr [rdx + 4]
        movss   xmm2, dword ptr [rsi + 8]
        addss   xmm2, dword ptr [rdx + 8]
        mov     rax, rdi
        movss   dword ptr [rdi], xmm0
        movss   dword ptr [rdi + 4], xmm1
        movss   dword ptr [rdi + 8], xmm2
        ret

%Stats = type { [0 x i32], float, [0 x i32], float, [0 x i32], float, [0 x i32] }

define void @_ZN7example3sum17hdebae303c906e224E(%Stats* noalias nocapture sret dereferenceable(12) %0, %Stats* noalias readonly align 4 dereferenceable(12) %a, %Stats* noalias readonly align 4 dereferenceable(12) %b) unnamed_addr #0 !dbg !6 {
  %1 = bitcast %Stats* %a to float*, !dbg !10
  %_4 = load float, float* %1, align 4, !dbg !10
  %2 = bitcast %Stats* %b to float*, !dbg !11
  %_5 = load float, float* %2, align 4, !dbg !11
  %_3 = fadd float %_4, %_5, !dbg !10
  %3 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 3, !dbg !12
  %_7 = load float, float* %3, align 4, !dbg !12
  %4 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 3, !dbg !13
  %_8 = load float, float* %4, align 4, !dbg !13
  %_6 = fadd float %_7, %_8, !dbg !12
  %5 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 5, !dbg !14
  %_10 = load float, float* %5, align 4, !dbg !14
  %6 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 5, !dbg !15
  %_11 = load float, float* %6, align 4, !dbg !15
  %_9 = fadd float %_10, %_11, !dbg !14
  %7 = bitcast %Stats* %0 to float*, !dbg !16
  store float %_3, float* %7, align 4, !dbg !16
  %8 = getelementptr inbounds %Stats, %Stats* %0, i32 0, i32 3, !dbg !16
  store float %_6, float* %8, align 4, !dbg !16
  %9 = getelementptr inbounds %Stats, %Stats* %0, i32 0, i32 5, !dbg !16
  store float %_9, float* %9, align 4, !dbg !16
  ret void, !dbg !17
}

Urgau · 2021-12-02T11:07:47Z

I can confirm that the problem here is the return type i96 in the LLVM IR because if I use an out parameter the assembly is what I would expect (without the vector optimization done by clang) (godbolt):

pub fn sum(a: &Stats, b: &Stats, res: &mut Stats)
{
    res.x = a.x + b.x;
    res.y = a.y + b.y;
    res.z = a.z + b.z;
}

example::sum:
        movss   xmm0, dword ptr [rdi]
        addss   xmm0, dword ptr [rsi]
        movss   dword ptr [rdx], xmm0
        movss   xmm0, dword ptr [rdi + 4]
        addss   xmm0, dword ptr [rsi + 4]
        movss   dword ptr [rdx + 4], xmm0
        movss   xmm0, dword ptr [rdi + 8]
        addss   xmm0, dword ptr [rsi + 8]
        movss   dword ptr [rdx + 8], xmm0
        ret

LLVM IR at -C opt-level = 0

%Stats = type { float, float, float }

define void @_ZN7example3sum17h258bf76ba01aa602E(%Stats* align 4 dereferenceable(12) %a, %Stats* align 4 dereferenceable(12) %b, %Stats* align 4 dereferenceable(12) %res) unnamed_addr #0 !dbg !6 {
  %0 = bitcast %Stats* %a to float*, !dbg !10
  %_4 = load float, float* %0, align 4, !dbg !10
  %1 = bitcast %Stats* %b to float*, !dbg !11
  %_5 = load float, float* %1, align 4, !dbg !11
  %2 = bitcast %Stats* %res to float*, !dbg !12
  %3 = fadd float %_4, %_5, !dbg !12
  store float %3, float* %2, align 4, !dbg !12
  %4 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 1, !dbg !13
  %_6 = load float, float* %4, align 4, !dbg !13
  %5 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 1, !dbg !14
  %_7 = load float, float* %5, align 4, !dbg !14
  %6 = getelementptr inbounds %Stats, %Stats* %res, i32 0, i32 1, !dbg !15
  %7 = fadd float %_6, %_7, !dbg !15
  store float %7, float* %6, align 4, !dbg !15
  %8 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 2, !dbg !16
  %_8 = load float, float* %8, align 4, !dbg !16
  %9 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 2, !dbg !17
  %_9 = load float, float* %9, align 4, !dbg !17
  %10 = getelementptr inbounds %Stats, %Stats* %res, i32 0, i32 2, !dbg !18
  %11 = fadd float %_8, %_9, !dbg !18
  store float %11, float* %10, align 4, !dbg !18
  ret void, !dbg !19
}

Yuri6037 · 2021-12-02T13:38:33Z

The question then becomes why is rustc forcing an i96 as return type?

MSxDOS · 2021-12-03T03:00:14Z

This seems to be the same regression as #85265

Yuri6037 · 2022-01-27T08:38:53Z

I think I just got hit by that in a real app: with nalgebra to perform gaussian blur, I noticed a slowdown of cargo build --release when f32 was used instead of f64. Of course when f32 is used the return type for nalgebra is less than 128 bits however with f64 it's larger.

This really needs to be fixed otherwise using rust for real time computer graphics is a no go!

workingjubilee · 2022-03-04T18:42:07Z

This should be fixed in #94570, can we confirm that?

Yuri6037 · 2022-03-04T20:53:29Z

This should be fixed in #94570, can we confirm that?

When will this PR be available in Nightly or is it already available? What about stable?

scottmcm · 2022-03-05T00:24:43Z

What about stable?

The milestone on the PR shows the first stable release in which it's projected to be included:

I'm not certain when the nightlies are generated, but the 2022-03-06 or later nighlies should be a safe bet.

workingjubilee · 2022-03-05T01:14:46Z

It should be on nightly as of this exact moment.

Yuri6037 · 2022-03-05T10:32:44Z

I've tested on the playground version with assembly output and it attempts to return values to the stack whereas the C++ variant leaves them in xmm registers. It does however look a bit better than godbolt which uses an even older nightly.

Here is current assembly output from playground (with 2022-03-04):

playground::sum: # @playground::sum
# %bb.0:
	movss	xmm0, dword ptr [rsi]           # xmm0 = mem[0],zero,zero,zero
	movss	xmm1, dword ptr [rsi + 4]       # xmm1 = mem[0],zero,zero,zero
	addss	xmm0, dword ptr [rdx]
	addss	xmm1, dword ptr [rdx + 4]
	movss	xmm2, dword ptr [rsi + 8]       # xmm2 = mem[0],zero,zero,zero
	addss	xmm2, dword ptr [rdx + 8]
	mov	rax, rdi
	movss	dword ptr [rdi], xmm0
	movss	dword ptr [rdi + 4], xmm1
	movss	dword ptr [rdi + 8], xmm2
	ret
                                        # -- End function

Here is the C++ assembly for comparison (obtained from godbolt):

sum(Stats const&, Stats const&):                      # @sum(Stats const&, Stats const&)
        movsd   xmm1, qword ptr [rdi]           # xmm1 = mem[0],zero
        movsd   xmm0, qword ptr [rsi]           # xmm0 = mem[0],zero
        addps   xmm0, xmm1
        movss   xmm1, dword ptr [rdi + 8]       # xmm1 = mem[0],zero,zero,zero
        addss   xmm1, dword ptr [rsi + 8]
        ret

workingjubilee · 2022-03-05T21:51:56Z

Ah, the memory-passing thing is a known bug which has its own issues tracking its nuances, essentially, so if this has been reduced to that, it is now a duplicate of those and I shall close this. In the meantime, it usually is a concern that is significantly reduced by inlining.

Thank you for investigating, however!

scottmcm · 2022-03-05T22:20:28Z

Here's a quick demonstration that the extra movupss go away when used, rather than when looking at the function in isolation: https://play.rust-lang.org/?version=nightly&mode=release&edition=2021&gist=70f02a31bb33e4fce6b9de2ac29e4326

.LBB1_4:                                # =>This Inner Loop Header: Depth=1
	movups	xmm1, xmmword ptr [rsi + rdx]
	addps	xmm0, xmm1
	add	rdx, 16
	cmp	rcx, rdx
	jne	.LBB1_4

This was referenced Dec 3, 2021

Basic vectorization performance regression from 1.48.0 onwards #85265

Closed

[Experiment] Aggregate layout optimization #91507

Closed

AngelicosPhosphoros mentioned this issue Dec 9, 2021

if Result::is_ok() followed immediately by Result::unwrap() fails to optimize out the second err check #85771

Open

Urgau mentioned this issue Jan 28, 2022

Don't over-optimize the abi layout #93405

Closed

Urgau mentioned this issue Feb 1, 2022

Don't aggregate homogeneous floats in the Rust ABI #93564

Closed

workingjubilee mentioned this issue Feb 13, 2022

Rust should pass vectors by vector register #93490

Closed

workingjubilee closed this as completed Mar 5, 2022

workingjubilee mentioned this issue Jul 1, 2022

Result that uses niches resulting in a final size of 16 bytes emits poor LLVM IR due to ABI #97540

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible performance loss with f32 arithmetic #91447

Possible performance loss with f32 arithmetic #91447

Yuri6037 commented Dec 2, 2021 •

edited

Loading

Urgau commented Dec 2, 2021 •

edited

Loading

Uh oh!

Urgau commented Dec 2, 2021

Uh oh!

Yuri6037 commented Dec 2, 2021

Uh oh!

MSxDOS commented Dec 3, 2021

Uh oh!

Yuri6037 commented Jan 27, 2022

Uh oh!

workingjubilee commented Mar 4, 2022

Uh oh!

Yuri6037 commented Mar 4, 2022

Uh oh!

scottmcm commented Mar 5, 2022

Uh oh!

workingjubilee commented Mar 5, 2022

Uh oh!

Yuri6037 commented Mar 5, 2022 •

edited

Loading

Uh oh!

workingjubilee commented Mar 5, 2022

Uh oh!

scottmcm commented Mar 5, 2022

Uh oh!

Possible performance loss with f32 arithmetic #91447

Possible performance loss with f32 arithmetic #91447

Comments

Yuri6037 commented Dec 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Urgau commented Dec 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

clang (trunk with -O0)

rustc (1.56 with -C opt-level=0 --emit=llvm-ir)

rustc (1.47 with -C opt-level=0 --emit=llvm-ir)

Uh oh!

Urgau commented Dec 2, 2021

Uh oh!

Yuri6037 commented Dec 2, 2021

Uh oh!

MSxDOS commented Dec 3, 2021

Uh oh!

Yuri6037 commented Jan 27, 2022

Uh oh!

workingjubilee commented Mar 4, 2022

Uh oh!

Yuri6037 commented Mar 4, 2022

Uh oh!

scottmcm commented Mar 5, 2022

Uh oh!

workingjubilee commented Mar 5, 2022

Uh oh!

Yuri6037 commented Mar 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

workingjubilee commented Mar 5, 2022

Uh oh!

scottmcm commented Mar 5, 2022

Uh oh!

Yuri6037 commented Dec 2, 2021 •

edited

Loading

Urgau commented Dec 2, 2021 •

edited

Loading

clang (trunk with `-O0`)

rustc (1.56 with `-C opt-level=0 --emit=llvm-ir`)

rustc (1.47 with `-C opt-level=0 --emit=llvm-ir`)

Yuri6037 commented Mar 5, 2022 •

edited

Loading