Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3-arg hypot slower than 4-arg #47917

Open
sethaxen opened this issue Dec 17, 2022 · 14 comments
Open

3-arg hypot slower than 4-arg #47917

sethaxen opened this issue Dec 17, 2022 · 14 comments
Labels
maths Mathematical functions performance Must go faster

Comments

@sethaxen
Copy link
Contributor

sethaxen commented Dec 17, 2022

For some reason calling hypot with 3 args on Julia v1.9.0-alpha1 is ~4x slower than calling it with 4 args:

julia> using BenchmarkTools

julia> foo(x) = hypot(x, x, x, x);

julia> bar(x) = hypot(x, x, x);

julia> @btime foo($(Ref(1.0))[]);
  8.309 ns (0 allocations: 0 bytes)

julia> @btime bar($(Ref(1.0))[]);
  25.234 ns (0 allocations: 0 bytes)

version info:

julia> versioninfo()
Julia Version 1.9.0-alpha1
Commit 0540f9d7394 (2022-11-15 14:37 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)
  Threads: 8 on 8 virtual cores
@sethaxen
Copy link
Contributor Author

But weirdly enough, this doesn't help:

julia> baz(x) = hypot(x, x, x, zero(x));

julia> @btime baz($(Ref(1.0))[]);
  29.569 ns (0 allocations: 0 bytes)

@giordano
Copy link
Contributor

giordano commented Dec 18, 2022

Can't reproduce

julia> using BenchmarkTools

julia> foo(x) = hypot(x, x, x, x);

julia> bar(x) = hypot(x, x, x);

julia> baz(x) = hypot(x, x, x, zero(x));

julia> @btime foo($(Ref(1.0))[]);
  7.125 ns (0 allocations: 0 bytes)

julia> @btime bar($(Ref(1.0))[]);
  5.250 ns (0 allocations: 0 bytes)

julia> @btime baz($(Ref(1.0))[]);
  7.125 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.10.0-DEV.102
Commit 63830a6f20* (2022-12-06 19:15 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.6.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores

@brenhinkeller brenhinkeller added performance Must go faster maths Mathematical functions labels Dec 18, 2022
@horst5000
Copy link

Reproducible with Julia 1.8.3 (with drastically worse results then noted above)

julia> foo(x) = hypot(x, x, x, x);

julia> bar(x) = hypot(x, x, x);

julia> @btime foo($(Ref(1.0))[]);
  625.306 ns (13 allocations: 336 bytes)

julia> @btime bar($(Ref(1.0))[]);
  8.048 ns (0 allocations: 0 bytes)

@giordano
Copy link
Contributor

giordano commented Dec 19, 2022

Reproducible with Julia 1.8.3 (with drastically worse results then noted above)

Reproducible what? You're showing opposite timings compared to what was reported by OP (for you foo is much slower than bar). You're likely seeing #44336, fixed by #44357 in v1.9

@horst5000
Copy link

Right, I was just testing whether the two are equal in performance. Need to test with 1.9 / 1.10-dev again

@sethaxen
Copy link
Contributor Author

Reproducible with Julia 1.8.3 (with drastically worse results then noted above)

julia> foo(x) = hypot(x, x, x, x);

julia> bar(x) = hypot(x, x, x);

julia> @btime foo($(Ref(1.0))[]);
  625.306 ns (13 allocations: 336 bytes)

julia> @btime bar($(Ref(1.0))[]);
  8.048 ns (0 allocations: 0 bytes)

This will not be reproducible pre-1.9 due to #44336, which was only fixed on 1.9. To be clear, I have only tested this on the 1.9 alpha, not the nightly.

@jaakkor2
Copy link
Contributor

Reproducible, but with smaller margin. However, hypot with unequal arguments shows 4 args being slower.

julia> using BenchmarkTools

julia> foo(x) = hypot(x, x, x, x);

julia> bar(x) = hypot(x, x, x);

julia> @btime foo($(Ref(1.0))[]);
  14.414 ns (0 allocations: 0 bytes)

julia> @btime bar($(Ref(1.0))[]);
  15.916 ns (0 allocations: 0 bytes)

julia> @btime hypot($(Ref(1.0))[], $(Ref(2.0))[], $(Ref(3.0))[], $(Ref(4.0))[]);
  23.494 ns (0 allocations: 0 bytes)

julia> @btime hypot($(Ref(1.0))[], $(Ref(2.0))[], $(Ref(3.0))[]);
  15.415 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.9.0-alpha1
Commit 0540f9d739 (2022-11-15 14:37 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 12 on 12 virtual cores

@sethaxen
Copy link
Contributor Author

Reproducible, but with smaller margin. However, hypot with unequal arguments shows 4 args being slower.

It reappears though if we wrap the values in a struct. I originally encountered this in JuliaGeometry/Quaternions.jl#122.

julia> using BenchmarkTools

julia> struct Quaternion{T}
           a::T
           b::T
           c::T
           d::T
       end

julia> foo(x::Quaternion) = hypot(x.a, x.b, x.c, x.d);

julia> bar(x::Quaternion) = hypot(x.b, x.c, x.d);

julia> x = Quaternion(randn(4)...);

julia> @btime foo($x)
  8.466 ns (0 allocations: 0 bytes)
0.9158524958493485

julia> @btime bar($x)
  25.262 ns (0 allocations: 0 bytes)
0.9124982984827957

julia> @btime foo($(Ref(x))[])
  8.433 ns (0 allocations: 0 bytes)
0.9158524958493485

julia> @btime bar($(Ref(x))[])
  25.248 ns (0 allocations: 0 bytes)
0.9124982984827957

@sethaxen
Copy link
Contributor Author

sethaxen commented Jan 7, 2023

I just checked, and I'm still seeing the issues in #47917 (comment) and #47917 (comment) on both v1.9-beta2 and the latest nightly.

@giordano
Copy link
Contributor

giordano commented Jan 8, 2023

@sethaxen do you see any particular difference between code_llvm(Base.Math._hypot, (NTuple{3, Float64},); debuginfo=:none) and code_llvm(Base.Math._hypot, (NTuple{4, Float64},); debuginfo=:none)?

@sethaxen
Copy link
Contributor Author

sethaxen commented Jan 9, 2023

@sethaxen do you see any particular difference between code_llvm(Base.Math._hypot, (NTuple{3, Float64},); debuginfo=:none) and code_llvm(Base.Math._hypot, (NTuple{4, Float64},); debuginfo=:none)?

Here are the outputs. I'm not very good at reading these:

julia> code_llvm(Base.Math._hypot, (NTuple{3, Float64},); debuginfo=:none)
define double @julia__hypot_234([3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %0) #0 {
top:
  %1 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 0
  %2 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 1
  %3 = load double, double* %1, align 8
  %4 = call double @llvm.fabs.f64(double %3)
  %5 = bitcast double* %2 to <2 x double>*
  %6 = load <2 x double>, <2 x double>* %5, align 8
  %7 = extractelement <2 x double> %6, i64 0
  %8 = call double @llvm.fabs.f64(double %7)
  %9 = fsub double %4, %8
  %10 = bitcast double %9 to i64
  %11 = icmp slt i64 %10, 0
  %12 = select i1 %11, double %8, double %4
  %13 = fcmp ord double %3, %7
  %14 = select i1 %13, double %12, double %9
  %15 = extractelement <2 x double> %6, i64 1
  %16 = call double @llvm.fabs.f64(double %15)
  %17 = fsub double %14, %16
  %18 = bitcast double %17 to i64
  %19 = icmp slt i64 %18, 0
  %20 = select i1 %19, double %16, double %14
  %21 = fcmp ord double %14, %15
  %22 = select i1 %21, double %20, double %17
  %23 = fcmp ord double %22, 0.000000e+00
  br i1 %23, label %L61, label %L25

L25:                                              ; preds = %top
  %24 = fcmp uno double %3, 0.000000e+00
  %25 = fsub double %3, %3
  %26 = fcmp oeq double %25, 0.000000e+00
  %27 = or i1 %24, %26
  br i1 %27, label %L50, label %common.ret

L50:                                              ; preds = %L25
  %28 = fcmp uno double %7, 0.000000e+00
  %29 = fsub double %7, %7
  %30 = fcmp oeq double %29, 0.000000e+00
  %31 = or i1 %28, %30
  br i1 %31, label %L50.1, label %common.ret

L50.1:                                            ; preds = %L50
  %32 = fcmp uno double %15, 0.000000e+00
  %33 = fsub double %15, %15
  %34 = fcmp oeq double %33, 0.000000e+00
  %35 = or i1 %32, %34
  br i1 %35, label %L61, label %common.ret

common.ret:                                       ; preds = %L89, %L64, %L61, %L50.1, %L50, %L25
  %common.ret.op = phi double [ %52, %L89 ], [ %22, %L64 ], [ %22, %L61 ], [ 0x7FF0000000000000, %L25 ], [ 0x7FF0000000000000, %L50.1 ], [ 0x7FF0000000000000, %L50 ]
  ret double %common.ret.op

L61:                                              ; preds = %L50.1, %top
  %36 = fcmp une double %22, 0.000000e+00
  br i1 %36, label %L64, label %common.ret

L64:                                              ; preds = %L61
  %37 = fcmp uno double %22, 0.000000e+00
  %38 = fsub double %22, %22
  %39 = fcmp oeq double %38, 0.000000e+00
  %40 = or i1 %37, %39
  br i1 %40, label %L89, label %common.ret

L89:                                              ; preds = %L64
  %41 = fdiv double %3, %22
  %42 = fmul double %41, %41
  %43 = insertelement <2 x double> poison, double %22, i64 0
  %44 = shufflevector <2 x double> %43, <2 x double> poison, <2 x i32> zeroinitializer
  %45 = fdiv <2 x double> %6, %44
  %46 = fmul <2 x double> %45, %45
  %47 = extractelement <2 x double> %46, i64 0
  %48 = fadd double %42, %47
  %49 = extractelement <2 x double> %46, i64 1
  %50 = fadd double %48, %49
  %51 = call double @llvm.sqrt.f64(double %50)
  %52 = fmul double %22, %51
  br label %common.ret
}
julia> code_llvm(Base.Math._hypot, (NTuple{4, Float64},); debuginfo=:none)
define double @julia__hypot_270([4 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(32) %0) #0 {
top:
  %1 = getelementptr inbounds [4 x double], [4 x double]* %0, i64 0, i64 2
  %2 = bitcast [4 x double]* %0 to <2 x double>*
  %3 = load <2 x double>, <2 x double>* %2, align 8
  %4 = call <2 x double> @llvm.fabs.v2f64(<2 x double> %3)
  %5 = extractelement <2 x double> %4, i64 0
  %6 = extractelement <2 x double> %4, i64 1
  %7 = fsub double %5, %6
  %8 = bitcast double %7 to i64
  %9 = icmp slt i64 %8, 0
  %10 = select i1 %9, double %6, double %5
  %11 = extractelement <2 x double> %3, i64 0
  %12 = extractelement <2 x double> %3, i64 1
  %13 = fcmp ord double %11, %12
  %14 = select i1 %13, double %10, double %7
  # diff
  %15 = bitcast double* %1 to <2 x double>*
  %16 = load <2 x double>, <2 x double>* %15, align 8
  %17 = extractelement <2 x double> %16, i64 0
  %18 = call double @llvm.fabs.f64(double %17)
  %19 = fsub double %14, %18
  %20 = bitcast double %19 to i64
  %21 = icmp slt i64 %20, 0
  %22 = select i1 %21, double %18, double %14
  %23 = fcmp ord double %14, %17
  %24 = select i1 %23, double %22, double %19
  %25 = extractelement <2 x double> %16, i64 1
  %26 = call double @llvm.fabs.f64(double %25)
  %27 = fsub double %24, %26
  %28 = bitcast double %27 to i64
  %29 = icmp slt i64 %28, 0
  %30 = select i1 %29, double %26, double %24
  %31 = fcmp ord double %24, %25
  %32 = select i1 %31, double %30, double %27
  %33 = fcmp ord double %32, 0.000000e+00
  br i1 %33, label %L71, label %L35

L35:                                              ; preds = %top
  %34 = fcmp uno double %11, 0.000000e+00
  %35 = fsub double %11, %11
  %36 = fcmp oeq double %35, 0.000000e+00
  %37 = or i1 %34, %36
  br i1 %37, label %L60, label %common.ret

L60:                                              ; preds = %L35
  %38 = fcmp uno double %12, 0.000000e+00
  %39 = fsub double %12, %12
  %40 = fcmp oeq double %39, 0.000000e+00
  %41 = or i1 %38, %40
  br i1 %41, label %L60.1, label %common.ret

L60.1:                                            ; preds = %L60
  %42 = fcmp uno double %17, 0.000000e+00
  %43 = fsub double %17, %17
  %44 = fcmp oeq double %43, 0.000000e+00
  %45 = or i1 %42, %44
  br i1 %45, label %L60.2, label %common.ret

L60.2:                                            ; preds = %L60.1
  %46 = fcmp uno double %25, 0.000000e+00
  %47 = fsub double %25, %25
  %48 = fcmp oeq double %47, 0.000000e+00
  %49 = or i1 %46, %48
  br i1 %49, label %L71, label %common.ret

common.ret:                                       ; preds = %L103, %L74, %L71, %L60.2, %L60.1, %L60, %L35
  %common.ret.op = phi double [ %69, %L103 ], [ %32, %L74 ], [ %32, %L71 ], [ 0x7FF0000000000000, %L35 ], [ 0x7FF0000000000000, %L60.2 ], [ 0x7FF0000000000000, %L60.1 ], [ 0x7FF0000000000000, %L60 ]
  ret double %common.ret.op

L71:                                              ; preds = %L60.2, %top
  %50 = fcmp une double %32, 0.000000e+00
  br i1 %50, label %L74, label %common.ret

L74:                                              ; preds = %L71
  %51 = fcmp uno double %32, 0.000000e+00
  %52 = fsub double %32, %32
  %53 = fcmp oeq double %52, 0.000000e+00
  %54 = or i1 %51, %53
  br i1 %54, label %L103, label %common.ret

L103:                                             ; preds = %L74
  %55 = insertelement <2 x double> poison, double %32, i64 0
  %56 = shufflevector <2 x double> %55, <2 x double> poison, <2 x i32> zeroinitializer
  %57 = fdiv <2 x double> %3, %56
  %58 = fmul <2 x double> %57, %57
  %59 = extractelement <2 x double> %58, i64 0
  %60 = extractelement <2 x double> %58, i64 1
  %61 = fadd double %59, %60
  %62 = fdiv <2 x double> %16, %56
  %63 = fmul <2 x double> %62, %62
  %64 = extractelement <2 x double> %63, i64 0
  %65 = fadd double %61, %64
  %66 = extractelement <2 x double> %63, i64 1
  %67 = fadd double %65, %66
  %68 = call double @llvm.sqrt.f64(double %67)
  %69 = fmul double %32, %68
  br label %common.ret
}

Here's maybe a simpler case that highlights what's going on:

julia> foo(y) = Base.Math._hypot(Base.tail(y));

julia> x = (1.0, 2.0, 3.0);

julia> y = (0.0, x...);

julia> @btime $(Base.Math._hypot)($x);
  5.540 ns (0 allocations: 0 bytes)

julia> @btime foo($y);
  25.037 ns (0 allocations: 0 bytes)

@giordano
Copy link
Contributor

giordano commented Jan 9, 2023

Ok, I can reproduce the slowdown on x86_64 linux, but not aarch64 darwin. I reduced it a little bit to

julia> function test(x::NTuple{N,<:Number}) where {N}
           maxabs = maximum(abs, x)
           return sqrt(sum(abs2, x))
       end
test (generic function with 1 method)

julia> three(x) = test((x, x, x));

julia> four(x) = test((x, x, x, x));

julia> @btime three($(Ref(1.0))[]); @btime four($(Ref(1.0))[]);
  3.810 ns (0 allocations: 0 bytes)
  2.730 ns (0 allocations: 0 bytes)

@KristofferC
Copy link
Member

If this uses AVX it actually isn't very surprising that a Tuple of length four can be faster than three since it fits exactly into a simd lane. See eg JuliaArrays/StaticArrays.jl#512

@giordano
Copy link
Contributor

Yes, code_native on these functions is much shorter than the full hypot example:

julia> code_native(three, (Float64,); debuginfo=:none)
        .text
        .file   "three"
        .globl  julia_three_1227                # -- Begin function julia_three_1227
        .p2align        4, 0x90
        .type   julia_three_1227,@function
julia_three_1227:                       # @julia_three_1227
        .cfi_startproc
# %bb.0:                                # %top
        vmulsd  %xmm0, %xmm0, %xmm0
        vaddsd  %xmm0, %xmm0, %xmm1
        vaddsd  %xmm1, %xmm0, %xmm0
        vsqrtsd %xmm0, %xmm0, %xmm0
        retq
.Lfunc_end0:
        .size   julia_three_1227, .Lfunc_end0-julia_three_1227
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits
julia> code_native(four, (Float64,); debuginfo=:none)
        .text
        .file   "four"
        .globl  julia_four_1326                 # -- Begin function julia_four_1326
        .p2align        4, 0x90
        .type   julia_four_1326,@function
julia_four_1326:                        # @julia_four_1326
        .cfi_startproc
# %bb.0:                                # %top
        vmulsd  %xmm0, %xmm0, %xmm0
        vaddsd  %xmm0, %xmm0, %xmm1
        vaddsd  %xmm1, %xmm0, %xmm1
        vaddsd  %xmm1, %xmm0, %xmm0
        vsqrtsd %xmm0, %xmm0, %xmm0
        retq
.Lfunc_end0:
        .size   julia_four_1326, .Lfunc_end0-julia_four_1326
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

This is indeed probably AVX magic, rather than a julia bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maths Mathematical functions performance Must go faster
Projects
None yet
Development

No branches or pull requests

6 participants