3-arg hypot slower than 4-arg #47917

sethaxen · 2022-12-17T08:12:38Z

For some reason calling hypot with 3 args on Julia v1.9.0-alpha1 is ~4x slower than calling it with 4 args:

julia> using BenchmarkTools

julia> foo(x) = hypot(x, x, x, x);

julia> bar(x) = hypot(x, x, x);

julia> @btime foo($(Ref(1.0))[]);
  8.309 ns (0 allocations: 0 bytes)

julia> @btime bar($(Ref(1.0))[]);
  25.234 ns (0 allocations: 0 bytes)

version info:

julia> versioninfo()
Julia Version 1.9.0-alpha1
Commit 0540f9d7394 (2022-11-15 14:37 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)
  Threads: 8 on 8 virtual cores

The text was updated successfully, but these errors were encountered:

sethaxen · 2022-12-17T08:23:52Z

But weirdly enough, this doesn't help:

julia> baz(x) = hypot(x, x, x, zero(x));

julia> @btime baz($(Ref(1.0))[]);
  29.569 ns (0 allocations: 0 bytes)

giordano · 2022-12-18T20:57:15Z

Can't reproduce

julia> using BenchmarkTools

julia> foo(x) = hypot(x, x, x, x);

julia> bar(x) = hypot(x, x, x);

julia> baz(x) = hypot(x, x, x, zero(x));

julia> @btime foo($(Ref(1.0))[]);
  7.125 ns (0 allocations: 0 bytes)

julia> @btime bar($(Ref(1.0))[]);
  5.250 ns (0 allocations: 0 bytes)

julia> @btime baz($(Ref(1.0))[]);
  7.125 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.10.0-DEV.102
Commit 63830a6f20* (2022-12-06 19:15 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.6.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores

horst5000 · 2022-12-19T13:59:05Z

Reproducible with Julia 1.8.3 (with drastically worse results then noted above)

julia> foo(x) = hypot(x, x, x, x);

julia> bar(x) = hypot(x, x, x);

julia> @btime foo($(Ref(1.0))[]);
  625.306 ns (13 allocations: 336 bytes)

julia> @btime bar($(Ref(1.0))[]);
  8.048 ns (0 allocations: 0 bytes)

giordano · 2022-12-19T14:04:04Z

Reproducible with Julia 1.8.3 (with drastically worse results then noted above)

Reproducible what? You're showing opposite timings compared to what was reported by OP (for you foo is much slower than bar). You're likely seeing #44336, fixed by #44357 in v1.9

horst5000 · 2022-12-19T14:08:37Z

Right, I was just testing whether the two are equal in performance. Need to test with 1.9 / 1.10-dev again

sethaxen · 2022-12-19T14:31:53Z

Reproducible with Julia 1.8.3 (with drastically worse results then noted above)

julia> foo(x) = hypot(x, x, x, x);

julia> bar(x) = hypot(x, x, x);

julia> @btime foo($(Ref(1.0))[]);
  625.306 ns (13 allocations: 336 bytes)

julia> @btime bar($(Ref(1.0))[]);
  8.048 ns (0 allocations: 0 bytes)

This will not be reproducible pre-1.9 due to #44336, which was only fixed on 1.9. To be clear, I have only tested this on the 1.9 alpha, not the nightly.

jaakkor2 · 2022-12-19T20:31:21Z

Reproducible, but with smaller margin. However, hypot with unequal arguments shows 4 args being slower.

julia> using BenchmarkTools

julia> foo(x) = hypot(x, x, x, x);

julia> bar(x) = hypot(x, x, x);

julia> @btime foo($(Ref(1.0))[]);
  14.414 ns (0 allocations: 0 bytes)

julia> @btime bar($(Ref(1.0))[]);
  15.916 ns (0 allocations: 0 bytes)

julia> @btime hypot($(Ref(1.0))[], $(Ref(2.0))[], $(Ref(3.0))[], $(Ref(4.0))[]);
  23.494 ns (0 allocations: 0 bytes)

julia> @btime hypot($(Ref(1.0))[], $(Ref(2.0))[], $(Ref(3.0))[]);
  15.415 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.9.0-alpha1
Commit 0540f9d739 (2022-11-15 14:37 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 12 on 12 virtual cores

sethaxen · 2022-12-19T21:02:55Z

Reproducible, but with smaller margin. However, hypot with unequal arguments shows 4 args being slower.

It reappears though if we wrap the values in a struct. I originally encountered this in JuliaGeometry/Quaternions.jl#122.

julia> using BenchmarkTools

julia> struct Quaternion{T}
           a::T
           b::T
           c::T
           d::T
       end

julia> foo(x::Quaternion) = hypot(x.a, x.b, x.c, x.d);

julia> bar(x::Quaternion) = hypot(x.b, x.c, x.d);

julia> x = Quaternion(randn(4)...);

julia> @btime foo($x)
  8.466 ns (0 allocations: 0 bytes)
0.9158524958493485

julia> @btime bar($x)
  25.262 ns (0 allocations: 0 bytes)
0.9124982984827957

julia> @btime foo($(Ref(x))[])
  8.433 ns (0 allocations: 0 bytes)
0.9158524958493485

julia> @btime bar($(Ref(x))[])
  25.248 ns (0 allocations: 0 bytes)
0.9124982984827957

sethaxen · 2023-01-07T21:22:38Z

I just checked, and I'm still seeing the issues in #47917 (comment) and #47917 (comment) on both v1.9-beta2 and the latest nightly.

giordano · 2023-01-08T14:51:11Z

@sethaxen do you see any particular difference between code_llvm(Base.Math._hypot, (NTuple{3, Float64},); debuginfo=:none) and code_llvm(Base.Math._hypot, (NTuple{4, Float64},); debuginfo=:none)?

sethaxen · 2023-01-09T08:41:19Z

@sethaxen do you see any particular difference between code_llvm(Base.Math._hypot, (NTuple{3, Float64},); debuginfo=:none) and code_llvm(Base.Math._hypot, (NTuple{4, Float64},); debuginfo=:none)?

Here are the outputs. I'm not very good at reading these:

julia> code_llvm(Base.Math._hypot, (NTuple{3, Float64},); debuginfo=:none)
define double @julia__hypot_234([3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %0) #0 {
top:
  %1 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 0
  %2 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 1
  %3 = load double, double* %1, align 8
  %4 = call double @llvm.fabs.f64(double %3)
  %5 = bitcast double* %2 to <2 x double>*
  %6 = load <2 x double>, <2 x double>* %5, align 8
  %7 = extractelement <2 x double> %6, i64 0
  %8 = call double @llvm.fabs.f64(double %7)
  %9 = fsub double %4, %8
  %10 = bitcast double %9 to i64
  %11 = icmp slt i64 %10, 0
  %12 = select i1 %11, double %8, double %4
  %13 = fcmp ord double %3, %7
  %14 = select i1 %13, double %12, double %9
  %15 = extractelement <2 x double> %6, i64 1
  %16 = call double @llvm.fabs.f64(double %15)
  %17 = fsub double %14, %16
  %18 = bitcast double %17 to i64
  %19 = icmp slt i64 %18, 0
  %20 = select i1 %19, double %16, double %14
  %21 = fcmp ord double %14, %15
  %22 = select i1 %21, double %20, double %17
  %23 = fcmp ord double %22, 0.000000e+00
  br i1 %23, label %L61, label %L25

L25:                                              ; preds = %top
  %24 = fcmp uno double %3, 0.000000e+00
  %25 = fsub double %3, %3
  %26 = fcmp oeq double %25, 0.000000e+00
  %27 = or i1 %24, %26
  br i1 %27, label %L50, label %common.ret

L50:                                              ; preds = %L25
  %28 = fcmp uno double %7, 0.000000e+00
  %29 = fsub double %7, %7
  %30 = fcmp oeq double %29, 0.000000e+00
  %31 = or i1 %28, %30
  br i1 %31, label %L50.1, label %common.ret

L50.1:                                            ; preds = %L50
  %32 = fcmp uno double %15, 0.000000e+00
  %33 = fsub double %15, %15
  %34 = fcmp oeq double %33, 0.000000e+00
  %35 = or i1 %32, %34
  br i1 %35, label %L61, label %common.ret

common.ret:                                       ; preds = %L89, %L64, %L61, %L50.1, %L50, %L25
  %common.ret.op = phi double [ %52, %L89 ], [ %22, %L64 ], [ %22, %L61 ], [ 0x7FF0000000000000, %L25 ], [ 0x7FF0000000000000, %L50.1 ], [ 0x7FF0000000000000, %L50 ]
  ret double %common.ret.op

L61:                                              ; preds = %L50.1, %top
  %36 = fcmp une double %22, 0.000000e+00
  br i1 %36, label %L64, label %common.ret

L64:                                              ; preds = %L61
  %37 = fcmp uno double %22, 0.000000e+00
  %38 = fsub double %22, %22
  %39 = fcmp oeq double %38, 0.000000e+00
  %40 = or i1 %37, %39
  br i1 %40, label %L89, label %common.ret

L89:                                              ; preds = %L64
  %41 = fdiv double %3, %22
  %42 = fmul double %41, %41
  %43 = insertelement <2 x double> poison, double %22, i64 0
  %44 = shufflevector <2 x double> %43, <2 x double> poison, <2 x i32> zeroinitializer
  %45 = fdiv <2 x double> %6, %44
  %46 = fmul <2 x double> %45, %45
  %47 = extractelement <2 x double> %46, i64 0
  %48 = fadd double %42, %47
  %49 = extractelement <2 x double> %46, i64 1
  %50 = fadd double %48, %49
  %51 = call double @llvm.sqrt.f64(double %50)
  %52 = fmul double %22, %51
  br label %common.ret
}

julia> code_llvm(Base.Math._hypot, (NTuple{4, Float64},); debuginfo=:none)
define double @julia__hypot_270([4 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(32) %0) #0 {
top:
  %1 = getelementptr inbounds [4 x double], [4 x double]* %0, i64 0, i64 2
  %2 = bitcast [4 x double]* %0 to <2 x double>*
  %3 = load <2 x double>, <2 x double>* %2, align 8
  %4 = call <2 x double> @llvm.fabs.v2f64(<2 x double> %3)
  %5 = extractelement <2 x double> %4, i64 0
  %6 = extractelement <2 x double> %4, i64 1
  %7 = fsub double %5, %6
  %8 = bitcast double %7 to i64
  %9 = icmp slt i64 %8, 0
  %10 = select i1 %9, double %6, double %5
  %11 = extractelement <2 x double> %3, i64 0
  %12 = extractelement <2 x double> %3, i64 1
  %13 = fcmp ord double %11, %12
  %14 = select i1 %13, double %10, double %7
  # diff
  %15 = bitcast double* %1 to <2 x double>*
  %16 = load <2 x double>, <2 x double>* %15, align 8
  %17 = extractelement <2 x double> %16, i64 0
  %18 = call double @llvm.fabs.f64(double %17)
  %19 = fsub double %14, %18
  %20 = bitcast double %19 to i64
  %21 = icmp slt i64 %20, 0
  %22 = select i1 %21, double %18, double %14
  %23 = fcmp ord double %14, %17
  %24 = select i1 %23, double %22, double %19
  %25 = extractelement <2 x double> %16, i64 1
  %26 = call double @llvm.fabs.f64(double %25)
  %27 = fsub double %24, %26
  %28 = bitcast double %27 to i64
  %29 = icmp slt i64 %28, 0
  %30 = select i1 %29, double %26, double %24
  %31 = fcmp ord double %24, %25
  %32 = select i1 %31, double %30, double %27
  %33 = fcmp ord double %32, 0.000000e+00
  br i1 %33, label %L71, label %L35

L35:                                              ; preds = %top
  %34 = fcmp uno double %11, 0.000000e+00
  %35 = fsub double %11, %11
  %36 = fcmp oeq double %35, 0.000000e+00
  %37 = or i1 %34, %36
  br i1 %37, label %L60, label %common.ret

L60:                                              ; preds = %L35
  %38 = fcmp uno double %12, 0.000000e+00
  %39 = fsub double %12, %12
  %40 = fcmp oeq double %39, 0.000000e+00
  %41 = or i1 %38, %40
  br i1 %41, label %L60.1, label %common.ret

L60.1:                                            ; preds = %L60
  %42 = fcmp uno double %17, 0.000000e+00
  %43 = fsub double %17, %17
  %44 = fcmp oeq double %43, 0.000000e+00
  %45 = or i1 %42, %44
  br i1 %45, label %L60.2, label %common.ret

L60.2:                                            ; preds = %L60.1
  %46 = fcmp uno double %25, 0.000000e+00
  %47 = fsub double %25, %25
  %48 = fcmp oeq double %47, 0.000000e+00
  %49 = or i1 %46, %48
  br i1 %49, label %L71, label %common.ret

common.ret:                                       ; preds = %L103, %L74, %L71, %L60.2, %L60.1, %L60, %L35
  %common.ret.op = phi double [ %69, %L103 ], [ %32, %L74 ], [ %32, %L71 ], [ 0x7FF0000000000000, %L35 ], [ 0x7FF0000000000000, %L60.2 ], [ 0x7FF0000000000000, %L60.1 ], [ 0x7FF0000000000000, %L60 ]
  ret double %common.ret.op

L71:                                              ; preds = %L60.2, %top
  %50 = fcmp une double %32, 0.000000e+00
  br i1 %50, label %L74, label %common.ret

L74:                                              ; preds = %L71
  %51 = fcmp uno double %32, 0.000000e+00
  %52 = fsub double %32, %32
  %53 = fcmp oeq double %52, 0.000000e+00
  %54 = or i1 %51, %53
  br i1 %54, label %L103, label %common.ret

L103:                                             ; preds = %L74
  %55 = insertelement <2 x double> poison, double %32, i64 0
  %56 = shufflevector <2 x double> %55, <2 x double> poison, <2 x i32> zeroinitializer
  %57 = fdiv <2 x double> %3, %56
  %58 = fmul <2 x double> %57, %57
  %59 = extractelement <2 x double> %58, i64 0
  %60 = extractelement <2 x double> %58, i64 1
  %61 = fadd double %59, %60
  %62 = fdiv <2 x double> %16, %56
  %63 = fmul <2 x double> %62, %62
  %64 = extractelement <2 x double> %63, i64 0
  %65 = fadd double %61, %64
  %66 = extractelement <2 x double> %63, i64 1
  %67 = fadd double %65, %66
  %68 = call double @llvm.sqrt.f64(double %67)
  %69 = fmul double %32, %68
  br label %common.ret
}

Here's maybe a simpler case that highlights what's going on:

julia> foo(y) = Base.Math._hypot(Base.tail(y));

julia> x = (1.0, 2.0, 3.0);

julia> y = (0.0, x...);

julia> @btime $(Base.Math._hypot)($x);
  5.540 ns (0 allocations: 0 bytes)

julia> @btime foo($y);
  25.037 ns (0 allocations: 0 bytes)

giordano · 2023-01-09T22:30:54Z

Ok, I can reproduce the slowdown on x86_64 linux, but not aarch64 darwin. I reduced it a little bit to

julia> function test(x::NTuple{N,<:Number}) where {N}
           maxabs = maximum(abs, x)
           return sqrt(sum(abs2, x))
       end
test (generic function with 1 method)

julia> three(x) = test((x, x, x));

julia> four(x) = test((x, x, x, x));

julia> @btime three($(Ref(1.0))[]); @btime four($(Ref(1.0))[]);
  3.810 ns (0 allocations: 0 bytes)
  2.730 ns (0 allocations: 0 bytes)

KristofferC · 2023-01-10T06:03:04Z

If this uses AVX it actually isn't very surprising that a Tuple of length four can be faster than three since it fits exactly into a simd lane. See eg JuliaArrays/StaticArrays.jl#512

giordano · 2023-01-10T09:40:10Z

Yes, code_native on these functions is much shorter than the full hypot example:

julia> code_native(three, (Float64,); debuginfo=:none)

        .text
        .file   "three"
        .globl  julia_three_1227                # -- Begin function julia_three_1227
        .p2align        4, 0x90
        .type   julia_three_1227,@function
julia_three_1227:                       # @julia_three_1227
        .cfi_startproc
# %bb.0:                                # %top
        vmulsd  %xmm0, %xmm0, %xmm0
        vaddsd  %xmm0, %xmm0, %xmm1
        vaddsd  %xmm1, %xmm0, %xmm0
        vsqrtsd %xmm0, %xmm0, %xmm0
        retq
.Lfunc_end0:
        .size   julia_three_1227, .Lfunc_end0-julia_three_1227
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

julia> code_native(four, (Float64,); debuginfo=:none)

        .text
        .file   "four"
        .globl  julia_four_1326                 # -- Begin function julia_four_1326
        .p2align        4, 0x90
        .type   julia_four_1326,@function
julia_four_1326:                        # @julia_four_1326
        .cfi_startproc
# %bb.0:                                # %top
        vmulsd  %xmm0, %xmm0, %xmm0
        vaddsd  %xmm0, %xmm0, %xmm1
        vaddsd  %xmm1, %xmm0, %xmm1
        vaddsd  %xmm1, %xmm0, %xmm0
        vsqrtsd %xmm0, %xmm0, %xmm0
        retq
.Lfunc_end0:
        .size   julia_four_1326, .Lfunc_end0-julia_four_1326
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

This is indeed probably AVX magic, rather than a julia bug.

brenhinkeller added performance Must go faster maths Mathematical functions labels Dec 18, 2022

giordano mentioned this issue Jan 5, 2023

faster hypot(::IEEEFloat...) #48130

Merged

sethaxen mentioned this issue Jan 7, 2023

Make abs, abs_imag, inv, and / resistant to under/overflow JuliaGeometry/Quaternions.jl#122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3-arg hypot slower than 4-arg #47917

3-arg hypot slower than 4-arg #47917

sethaxen commented Dec 17, 2022 •

edited

Loading

sethaxen commented Dec 17, 2022

giordano commented Dec 18, 2022 •

edited

Loading

horst5000 commented Dec 19, 2022

giordano commented Dec 19, 2022 •

edited

Loading

horst5000 commented Dec 19, 2022

sethaxen commented Dec 19, 2022

jaakkor2 commented Dec 19, 2022

sethaxen commented Dec 19, 2022

sethaxen commented Jan 7, 2023

giordano commented Jan 8, 2023

sethaxen commented Jan 9, 2023

giordano commented Jan 9, 2023

KristofferC commented Jan 10, 2023

giordano commented Jan 10, 2023

3-arg hypot slower than 4-arg #47917

3-arg hypot slower than 4-arg #47917

Comments

sethaxen commented Dec 17, 2022 • edited Loading

sethaxen commented Dec 17, 2022

giordano commented Dec 18, 2022 • edited Loading

horst5000 commented Dec 19, 2022

giordano commented Dec 19, 2022 • edited Loading

horst5000 commented Dec 19, 2022

sethaxen commented Dec 19, 2022

jaakkor2 commented Dec 19, 2022

sethaxen commented Dec 19, 2022

sethaxen commented Jan 7, 2023

giordano commented Jan 8, 2023

sethaxen commented Jan 9, 2023

giordano commented Jan 9, 2023

KristofferC commented Jan 10, 2023

giordano commented Jan 10, 2023

sethaxen commented Dec 17, 2022 •

edited

Loading

giordano commented Dec 18, 2022 •

edited

Loading

giordano commented Dec 19, 2022 •

edited

Loading