Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible performance loss with f32 arithmetic #91447

Closed
Yuri6037 opened this issue Dec 2, 2021 · 12 comments
Closed

Possible performance loss with f32 arithmetic #91447

Yuri6037 opened this issue Dec 2, 2021 · 12 comments
Labels
A-codegen Area: Code generation A-floating-point Area: Floating point numbers and arithmetic A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Yuri6037
Copy link

Yuri6037 commented Dec 2, 2021

I've tried, out of curiosity, a floating point arithmetic test and found quite a big difference between C++ and Rust.

The code used in rust

pub struct Stats
{
    x: f32,
    y: f32,
    z: f32
}

pub fn sum(a: &Stats, b: &Stats) -> Stats
{
    Stats {
        x: a.x + b.x,
        y: a.y + b.y,
        z: a.z + b.z
    }
}

The code used in C++

struct Stats
{
    float x;
    float y;
    float z;
};

Stats sum(const Stats &a, const Stats &b)
{
    return Stats {
        a.x + b.x,
        a.y + b.y,
        a.z + b.z
    };
}

Here is a link to a godbolt for side-by-side comparision of assembly output: https://godbolt.org/z/dqc4b74rv

Rust seem to absolutely want the floats back into e* registers instead of keeping them in xmm registers, C++ leaves them into the xmm registers. In some cases it might more advantageous to leave the floats in xmm registers for future operations on them rather then passing them back into the e* registers.

@Urgau
Copy link
Member

Urgau commented Dec 2, 2021

After a quick godbolt I find that clang and rustc use the same representation (%Stats = type { float, float, float }) expect for rustc <= 1.47 (%Stats = type { [0 x i32], float, [0 x i32], float, [0 x i32], float, [0 x i32] }). Not sure if that change something because their all at 0.
Another difference between the two rustc is the return type: void for rustc <= 1.47 and i96 for rustc > 1.47 which cause some bitcast and load at the end, which might explain the difference in codegen. Also why is this i96 and not float like the rest ?
The last noticeable difference is that clang process the data in <2 x float>, float instead of float, float, float which allow more optimizations to be done.

My hypothesis is that this codegen issue is probably cause by the i96 return type, which is probably cause by a regression in codegen in rustc > 1.47 but I'm not an expert in rustc_codegen_llvm or llvm itself so this needs to be confirm by the rustc llvm workgroup.

clang (trunk with -O0)

Assembly with -O3
sum(Stats const&, Stats const&):                      # @sum(Stats const&, Stats const&)
        movss   xmm0, dword ptr [rdi]           # xmm0 = mem[0],zero,zero,zero
        movss   xmm1, dword ptr [rdi + 4]       # xmm1 = mem[0],zero,zero,zero
        addss   xmm0, dword ptr [rsi]
        addss   xmm1, dword ptr [rsi + 4]
        unpcklps        xmm0, xmm1                      # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
        movss   xmm1, dword ptr [rdi + 8]       # xmm1 = mem[0],zero,zero,zero
        addss   xmm1, dword ptr [rsi + 8]
        ret
%struct.Stats = type { float, float, float }

define dso_local { <2 x float>, float } @_Z3sumRK5StatsS1_(%struct.Stats* nonnull align 4 dereferenceable(12) %0, %struct.Stats* nonnull align 4 dereferenceable(12) %1) #0 !dbg !105 {
  %3 = alloca %struct.Stats, align 4
  %4 = alloca %struct.Stats*, align 8
  %5 = alloca %struct.Stats*, align 8
  %6 = alloca { <2 x float>, float }, align 8
  store %struct.Stats* %0, %struct.Stats** %4, align 8
  call void @llvm.dbg.declare(metadata %struct.Stats** %4, metadata !111, metadata !DIExpression()), !dbg !112
  store %struct.Stats* %1, %struct.Stats** %5, align 8
  call void @llvm.dbg.declare(metadata %struct.Stats** %5, metadata !113, metadata !DIExpression()), !dbg !114
  %7 = getelementptr inbounds %struct.Stats, %struct.Stats* %3, i32 0, i32 0, !dbg !115
  %8 = load %struct.Stats*, %struct.Stats** %4, align 8, !dbg !116
  %9 = getelementptr inbounds %struct.Stats, %struct.Stats* %8, i32 0, i32 0, !dbg !117
  %10 = load float, float* %9, align 4, !dbg !117
  %11 = load %struct.Stats*, %struct.Stats** %5, align 8, !dbg !118
  %12 = getelementptr inbounds %struct.Stats, %struct.Stats* %11, i32 0, i32 0, !dbg !119
  %13 = load float, float* %12, align 4, !dbg !119
  %14 = fadd float %10, %13, !dbg !120
  store float %14, float* %7, align 4, !dbg !115
  %15 = getelementptr inbounds %struct.Stats, %struct.Stats* %3, i32 0, i32 1, !dbg !115
  %16 = load %struct.Stats*, %struct.Stats** %4, align 8, !dbg !121
  %17 = getelementptr inbounds %struct.Stats, %struct.Stats* %16, i32 0, i32 1, !dbg !122
  %18 = load float, float* %17, align 4, !dbg !122
  %19 = load %struct.Stats*, %struct.Stats** %5, align 8, !dbg !123
  %20 = getelementptr inbounds %struct.Stats, %struct.Stats* %19, i32 0, i32 1, !dbg !124
  %21 = load float, float* %20, align 4, !dbg !124
  %22 = fadd float %18, %21, !dbg !125
  store float %22, float* %15, align 4, !dbg !115
  %23 = getelementptr inbounds %struct.Stats, %struct.Stats* %3, i32 0, i32 2, !dbg !115
  %24 = load %struct.Stats*, %struct.Stats** %4, align 8, !dbg !126
  %25 = getelementptr inbounds %struct.Stats, %struct.Stats* %24, i32 0, i32 2, !dbg !127
  %26 = load float, float* %25, align 4, !dbg !127
  %27 = load %struct.Stats*, %struct.Stats** %5, align 8, !dbg !128
  %28 = getelementptr inbounds %struct.Stats, %struct.Stats* %27, i32 0, i32 2, !dbg !129
  %29 = load float, float* %28, align 4, !dbg !129
  %30 = fadd float %26, %29, !dbg !130
  store float %30, float* %23, align 4, !dbg !115
  %31 = bitcast { <2 x float>, float }* %6 to i8*, !dbg !131
  %32 = bitcast %struct.Stats* %3 to i8*, !dbg !131
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %31, i8* align 4 %32, i64 12, i1 false), !dbg !131
  %33 = load { <2 x float>, float }, { <2 x float>, float }* %6, align 8, !dbg !131
  ret { <2 x float>, float } %33, !dbg !131
}

rustc (1.56 with -C opt-level=0 --emit=llvm-ir)

Assembly with -O
example::sum:
        movss   xmm0, dword ptr [rdi]
        addss   xmm0, dword ptr [rsi]
        movsd   xmm1, qword ptr [rdi + 4]
        movsd   xmm2, qword ptr [rsi + 4]
        addps   xmm2, xmm1
        movd    eax, xmm0
        movd    ecx, xmm2
        shufps  xmm2, xmm2, 85
        movd    edx, xmm2
        shl     rcx, 32
        or      rax, rcx
        ret
%Stats = type { float, float, float }

define i96 @_ZN7example3sum17h2dbae3ec3349ec5dE(%Stats* align 4 dereferenceable(12) %a, %Stats* align 4 dereferenceable(12) %b) unnamed_addr #0 !dbg !6 {
  %0 = alloca %Stats, align 4
  %1 = bitcast %Stats* %a to float*, !dbg !10
  %_4 = load float, float* %1, align 4, !dbg !10
  %2 = bitcast %Stats* %b to float*, !dbg !11
  %_5 = load float, float* %2, align 4, !dbg !11
  %_3 = fadd float %_4, %_5, !dbg !10
  %3 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 1, !dbg !12
  %_7 = load float, float* %3, align 4, !dbg !12
  %4 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 1, !dbg !13
  %_8 = load float, float* %4, align 4, !dbg !13
  %_6 = fadd float %_7, %_8, !dbg !12
  %5 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 2, !dbg !14
  %_10 = load float, float* %5, align 4, !dbg !14
  %6 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 2, !dbg !15
  %_11 = load float, float* %6, align 4, !dbg !15
  %_9 = fadd float %_10, %_11, !dbg !14
  %7 = bitcast %Stats* %0 to float*, !dbg !16
  store float %_3, float* %7, align 4, !dbg !16
  %8 = getelementptr inbounds %Stats, %Stats* %0, i32 0, i32 1, !dbg !16
  store float %_6, float* %8, align 4, !dbg !16
  %9 = getelementptr inbounds %Stats, %Stats* %0, i32 0, i32 2, !dbg !16
  store float %_9, float* %9, align 4, !dbg !16
  %10 = bitcast %Stats* %0 to i96*, !dbg !17
  %11 = load i96, i96* %10, align 4, !dbg !17
  ret i96 %11, !dbg !17
}

rustc (1.47 with -C opt-level=0 --emit=llvm-ir)

Assembly with -O
example::sum:
        movss   xmm0, dword ptr [rsi]
        movss   xmm1, dword ptr [rsi + 4]
        addss   xmm0, dword ptr [rdx]
        addss   xmm1, dword ptr [rdx + 4]
        movss   xmm2, dword ptr [rsi + 8]
        addss   xmm2, dword ptr [rdx + 8]
        mov     rax, rdi
        movss   dword ptr [rdi], xmm0
        movss   dword ptr [rdi + 4], xmm1
        movss   dword ptr [rdi + 8], xmm2
        ret
%Stats = type { [0 x i32], float, [0 x i32], float, [0 x i32], float, [0 x i32] }

define void @_ZN7example3sum17hdebae303c906e224E(%Stats* noalias nocapture sret dereferenceable(12) %0, %Stats* noalias readonly align 4 dereferenceable(12) %a, %Stats* noalias readonly align 4 dereferenceable(12) %b) unnamed_addr #0 !dbg !6 {
  %1 = bitcast %Stats* %a to float*, !dbg !10
  %_4 = load float, float* %1, align 4, !dbg !10
  %2 = bitcast %Stats* %b to float*, !dbg !11
  %_5 = load float, float* %2, align 4, !dbg !11
  %_3 = fadd float %_4, %_5, !dbg !10
  %3 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 3, !dbg !12
  %_7 = load float, float* %3, align 4, !dbg !12
  %4 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 3, !dbg !13
  %_8 = load float, float* %4, align 4, !dbg !13
  %_6 = fadd float %_7, %_8, !dbg !12
  %5 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 5, !dbg !14
  %_10 = load float, float* %5, align 4, !dbg !14
  %6 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 5, !dbg !15
  %_11 = load float, float* %6, align 4, !dbg !15
  %_9 = fadd float %_10, %_11, !dbg !14
  %7 = bitcast %Stats* %0 to float*, !dbg !16
  store float %_3, float* %7, align 4, !dbg !16
  %8 = getelementptr inbounds %Stats, %Stats* %0, i32 0, i32 3, !dbg !16
  store float %_6, float* %8, align 4, !dbg !16
  %9 = getelementptr inbounds %Stats, %Stats* %0, i32 0, i32 5, !dbg !16
  store float %_9, float* %9, align 4, !dbg !16
  ret void, !dbg !17
}

@Urgau
Copy link
Member

Urgau commented Dec 2, 2021

I can confirm that the problem here is the return type i96 in the LLVM IR because if I use an out parameter the assembly is what I would expect (without the vector optimization done by clang) (godbolt):

pub fn sum(a: &Stats, b: &Stats, res: &mut Stats)
{
    res.x = a.x + b.x;
    res.y = a.y + b.y;
    res.z = a.z + b.z;
}
example::sum:
        movss   xmm0, dword ptr [rdi]
        addss   xmm0, dword ptr [rsi]
        movss   dword ptr [rdx], xmm0
        movss   xmm0, dword ptr [rdi + 4]
        addss   xmm0, dword ptr [rsi + 4]
        movss   dword ptr [rdx + 4], xmm0
        movss   xmm0, dword ptr [rdi + 8]
        addss   xmm0, dword ptr [rsi + 8]
        movss   dword ptr [rdx + 8], xmm0
        ret
LLVM IR at -C opt-level = 0
%Stats = type { float, float, float }

define void @_ZN7example3sum17h258bf76ba01aa602E(%Stats* align 4 dereferenceable(12) %a, %Stats* align 4 dereferenceable(12) %b, %Stats* align 4 dereferenceable(12) %res) unnamed_addr #0 !dbg !6 {
  %0 = bitcast %Stats* %a to float*, !dbg !10
  %_4 = load float, float* %0, align 4, !dbg !10
  %1 = bitcast %Stats* %b to float*, !dbg !11
  %_5 = load float, float* %1, align 4, !dbg !11
  %2 = bitcast %Stats* %res to float*, !dbg !12
  %3 = fadd float %_4, %_5, !dbg !12
  store float %3, float* %2, align 4, !dbg !12
  %4 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 1, !dbg !13
  %_6 = load float, float* %4, align 4, !dbg !13
  %5 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 1, !dbg !14
  %_7 = load float, float* %5, align 4, !dbg !14
  %6 = getelementptr inbounds %Stats, %Stats* %res, i32 0, i32 1, !dbg !15
  %7 = fadd float %_6, %_7, !dbg !15
  store float %7, float* %6, align 4, !dbg !15
  %8 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 2, !dbg !16
  %_8 = load float, float* %8, align 4, !dbg !16
  %9 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 2, !dbg !17
  %_9 = load float, float* %9, align 4, !dbg !17
  %10 = getelementptr inbounds %Stats, %Stats* %res, i32 0, i32 2, !dbg !18
  %11 = fadd float %_8, %_9, !dbg !18
  store float %11, float* %10, align 4, !dbg !18
  ret void, !dbg !19
}

@Yuri6037
Copy link
Author

Yuri6037 commented Dec 2, 2021

The question then becomes why is rustc forcing an i96 as return type?

@inquisitivecrystal inquisitivecrystal added A-codegen Area: Code generation A-floating-point Area: Floating point numbers and arithmetic A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Dec 2, 2021
@MSxDOS
Copy link

MSxDOS commented Dec 3, 2021

This seems to be the same regression as #85265

@Yuri6037
Copy link
Author

I think I just got hit by that in a real app: with nalgebra to perform gaussian blur, I noticed a slowdown of cargo build --release when f32 was used instead of f64. Of course when f32 is used the return type for nalgebra is less than 128 bits however with f64 it's larger.

This really needs to be fixed otherwise using rust for real time computer graphics is a no go!

@workingjubilee
Copy link
Member

This should be fixed in #94570, can we confirm that?

@Yuri6037
Copy link
Author

Yuri6037 commented Mar 4, 2022

This should be fixed in #94570, can we confirm that?

When will this PR be available in Nightly or is it already available? What about stable?

@scottmcm
Copy link
Member

scottmcm commented Mar 5, 2022

What about stable?

The milestone on the PR shows the first stable release in which it's projected to be included:
image

I'm not certain when the nightlies are generated, but the 2022-03-06 or later nighlies should be a safe bet.

@workingjubilee
Copy link
Member

It should be on nightly as of this exact moment.

@Yuri6037
Copy link
Author

Yuri6037 commented Mar 5, 2022

I've tested on the playground version with assembly output and it attempts to return values to the stack whereas the C++ variant leaves them in xmm registers. It does however look a bit better than godbolt which uses an even older nightly.

Here is current assembly output from playground (with 2022-03-04):

playground::sum: # @playground::sum
# %bb.0:
	movss	xmm0, dword ptr [rsi]           # xmm0 = mem[0],zero,zero,zero
	movss	xmm1, dword ptr [rsi + 4]       # xmm1 = mem[0],zero,zero,zero
	addss	xmm0, dword ptr [rdx]
	addss	xmm1, dword ptr [rdx + 4]
	movss	xmm2, dword ptr [rsi + 8]       # xmm2 = mem[0],zero,zero,zero
	addss	xmm2, dword ptr [rdx + 8]
	mov	rax, rdi
	movss	dword ptr [rdi], xmm0
	movss	dword ptr [rdi + 4], xmm1
	movss	dword ptr [rdi + 8], xmm2
	ret
                                        # -- End function

Here is the C++ assembly for comparison (obtained from godbolt):

sum(Stats const&, Stats const&):                      # @sum(Stats const&, Stats const&)
        movsd   xmm1, qword ptr [rdi]           # xmm1 = mem[0],zero
        movsd   xmm0, qword ptr [rsi]           # xmm0 = mem[0],zero
        addps   xmm0, xmm1
        movss   xmm1, dword ptr [rdi + 8]       # xmm1 = mem[0],zero,zero,zero
        addss   xmm1, dword ptr [rsi + 8]
        ret

@workingjubilee
Copy link
Member

Ah, the memory-passing thing is a known bug which has its own issues tracking its nuances, essentially, so if this has been reduced to that, it is now a duplicate of those and I shall close this. In the meantime, it usually is a concern that is significantly reduced by inlining.

Thank you for investigating, however!

@scottmcm
Copy link
Member

scottmcm commented Mar 5, 2022

Here's a quick demonstration that the extra movupss go away when used, rather than when looking at the function in isolation: https://play.rust-lang.org/?version=nightly&mode=release&edition=2021&gist=70f02a31bb33e4fce6b9de2ac29e4326

.LBB1_4:                                # =>This Inner Loop Header: Depth=1
	movups	xmm1, xmmword ptr [rsi + rdx]
	addps	xmm0, xmm1
	add	rdx, 16
	cmp	rcx, rdx
	jne	.LBB1_4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-floating-point Area: Floating point numbers and arithmetic A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
6 participants