-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3-arg hypot slower than 4-arg #47917
Comments
But weirdly enough, this doesn't help: julia> baz(x) = hypot(x, x, x, zero(x));
julia> @btime baz($(Ref(1.0))[]);
29.569 ns (0 allocations: 0 bytes) |
Can't reproduce
|
Reproducible with Julia 1.8.3 (with drastically worse results then noted above)
|
Right, I was just testing whether the two are equal in performance. Need to test with 1.9 / 1.10-dev again |
This will not be reproducible pre-1.9 due to #44336, which was only fixed on 1.9. To be clear, I have only tested this on the 1.9 alpha, not the nightly. |
Reproducible, but with smaller margin. However,
|
It reappears though if we wrap the values in a struct. I originally encountered this in JuliaGeometry/Quaternions.jl#122. julia> using BenchmarkTools
julia> struct Quaternion{T}
a::T
b::T
c::T
d::T
end
julia> foo(x::Quaternion) = hypot(x.a, x.b, x.c, x.d);
julia> bar(x::Quaternion) = hypot(x.b, x.c, x.d);
julia> x = Quaternion(randn(4)...);
julia> @btime foo($x)
8.466 ns (0 allocations: 0 bytes)
0.9158524958493485
julia> @btime bar($x)
25.262 ns (0 allocations: 0 bytes)
0.9124982984827957
julia> @btime foo($(Ref(x))[])
8.433 ns (0 allocations: 0 bytes)
0.9158524958493485
julia> @btime bar($(Ref(x))[])
25.248 ns (0 allocations: 0 bytes)
0.9124982984827957 |
I just checked, and I'm still seeing the issues in #47917 (comment) and #47917 (comment) on both v1.9-beta2 and the latest nightly. |
@sethaxen do you see any particular difference between |
Here are the outputs. I'm not very good at reading these: julia> code_llvm(Base.Math._hypot, (NTuple{3, Float64},); debuginfo=:none)
define double @julia__hypot_234([3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %0) #0 {
top:
%1 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 0
%2 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 1
%3 = load double, double* %1, align 8
%4 = call double @llvm.fabs.f64(double %3)
%5 = bitcast double* %2 to <2 x double>*
%6 = load <2 x double>, <2 x double>* %5, align 8
%7 = extractelement <2 x double> %6, i64 0
%8 = call double @llvm.fabs.f64(double %7)
%9 = fsub double %4, %8
%10 = bitcast double %9 to i64
%11 = icmp slt i64 %10, 0
%12 = select i1 %11, double %8, double %4
%13 = fcmp ord double %3, %7
%14 = select i1 %13, double %12, double %9
%15 = extractelement <2 x double> %6, i64 1
%16 = call double @llvm.fabs.f64(double %15)
%17 = fsub double %14, %16
%18 = bitcast double %17 to i64
%19 = icmp slt i64 %18, 0
%20 = select i1 %19, double %16, double %14
%21 = fcmp ord double %14, %15
%22 = select i1 %21, double %20, double %17
%23 = fcmp ord double %22, 0.000000e+00
br i1 %23, label %L61, label %L25
L25: ; preds = %top
%24 = fcmp uno double %3, 0.000000e+00
%25 = fsub double %3, %3
%26 = fcmp oeq double %25, 0.000000e+00
%27 = or i1 %24, %26
br i1 %27, label %L50, label %common.ret
L50: ; preds = %L25
%28 = fcmp uno double %7, 0.000000e+00
%29 = fsub double %7, %7
%30 = fcmp oeq double %29, 0.000000e+00
%31 = or i1 %28, %30
br i1 %31, label %L50.1, label %common.ret
L50.1: ; preds = %L50
%32 = fcmp uno double %15, 0.000000e+00
%33 = fsub double %15, %15
%34 = fcmp oeq double %33, 0.000000e+00
%35 = or i1 %32, %34
br i1 %35, label %L61, label %common.ret
common.ret: ; preds = %L89, %L64, %L61, %L50.1, %L50, %L25
%common.ret.op = phi double [ %52, %L89 ], [ %22, %L64 ], [ %22, %L61 ], [ 0x7FF0000000000000, %L25 ], [ 0x7FF0000000000000, %L50.1 ], [ 0x7FF0000000000000, %L50 ]
ret double %common.ret.op
L61: ; preds = %L50.1, %top
%36 = fcmp une double %22, 0.000000e+00
br i1 %36, label %L64, label %common.ret
L64: ; preds = %L61
%37 = fcmp uno double %22, 0.000000e+00
%38 = fsub double %22, %22
%39 = fcmp oeq double %38, 0.000000e+00
%40 = or i1 %37, %39
br i1 %40, label %L89, label %common.ret
L89: ; preds = %L64
%41 = fdiv double %3, %22
%42 = fmul double %41, %41
%43 = insertelement <2 x double> poison, double %22, i64 0
%44 = shufflevector <2 x double> %43, <2 x double> poison, <2 x i32> zeroinitializer
%45 = fdiv <2 x double> %6, %44
%46 = fmul <2 x double> %45, %45
%47 = extractelement <2 x double> %46, i64 0
%48 = fadd double %42, %47
%49 = extractelement <2 x double> %46, i64 1
%50 = fadd double %48, %49
%51 = call double @llvm.sqrt.f64(double %50)
%52 = fmul double %22, %51
br label %common.ret
} julia> code_llvm(Base.Math._hypot, (NTuple{4, Float64},); debuginfo=:none)
define double @julia__hypot_270([4 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(32) %0) #0 {
top:
%1 = getelementptr inbounds [4 x double], [4 x double]* %0, i64 0, i64 2
%2 = bitcast [4 x double]* %0 to <2 x double>*
%3 = load <2 x double>, <2 x double>* %2, align 8
%4 = call <2 x double> @llvm.fabs.v2f64(<2 x double> %3)
%5 = extractelement <2 x double> %4, i64 0
%6 = extractelement <2 x double> %4, i64 1
%7 = fsub double %5, %6
%8 = bitcast double %7 to i64
%9 = icmp slt i64 %8, 0
%10 = select i1 %9, double %6, double %5
%11 = extractelement <2 x double> %3, i64 0
%12 = extractelement <2 x double> %3, i64 1
%13 = fcmp ord double %11, %12
%14 = select i1 %13, double %10, double %7
# diff
%15 = bitcast double* %1 to <2 x double>*
%16 = load <2 x double>, <2 x double>* %15, align 8
%17 = extractelement <2 x double> %16, i64 0
%18 = call double @llvm.fabs.f64(double %17)
%19 = fsub double %14, %18
%20 = bitcast double %19 to i64
%21 = icmp slt i64 %20, 0
%22 = select i1 %21, double %18, double %14
%23 = fcmp ord double %14, %17
%24 = select i1 %23, double %22, double %19
%25 = extractelement <2 x double> %16, i64 1
%26 = call double @llvm.fabs.f64(double %25)
%27 = fsub double %24, %26
%28 = bitcast double %27 to i64
%29 = icmp slt i64 %28, 0
%30 = select i1 %29, double %26, double %24
%31 = fcmp ord double %24, %25
%32 = select i1 %31, double %30, double %27
%33 = fcmp ord double %32, 0.000000e+00
br i1 %33, label %L71, label %L35
L35: ; preds = %top
%34 = fcmp uno double %11, 0.000000e+00
%35 = fsub double %11, %11
%36 = fcmp oeq double %35, 0.000000e+00
%37 = or i1 %34, %36
br i1 %37, label %L60, label %common.ret
L60: ; preds = %L35
%38 = fcmp uno double %12, 0.000000e+00
%39 = fsub double %12, %12
%40 = fcmp oeq double %39, 0.000000e+00
%41 = or i1 %38, %40
br i1 %41, label %L60.1, label %common.ret
L60.1: ; preds = %L60
%42 = fcmp uno double %17, 0.000000e+00
%43 = fsub double %17, %17
%44 = fcmp oeq double %43, 0.000000e+00
%45 = or i1 %42, %44
br i1 %45, label %L60.2, label %common.ret
L60.2: ; preds = %L60.1
%46 = fcmp uno double %25, 0.000000e+00
%47 = fsub double %25, %25
%48 = fcmp oeq double %47, 0.000000e+00
%49 = or i1 %46, %48
br i1 %49, label %L71, label %common.ret
common.ret: ; preds = %L103, %L74, %L71, %L60.2, %L60.1, %L60, %L35
%common.ret.op = phi double [ %69, %L103 ], [ %32, %L74 ], [ %32, %L71 ], [ 0x7FF0000000000000, %L35 ], [ 0x7FF0000000000000, %L60.2 ], [ 0x7FF0000000000000, %L60.1 ], [ 0x7FF0000000000000, %L60 ]
ret double %common.ret.op
L71: ; preds = %L60.2, %top
%50 = fcmp une double %32, 0.000000e+00
br i1 %50, label %L74, label %common.ret
L74: ; preds = %L71
%51 = fcmp uno double %32, 0.000000e+00
%52 = fsub double %32, %32
%53 = fcmp oeq double %52, 0.000000e+00
%54 = or i1 %51, %53
br i1 %54, label %L103, label %common.ret
L103: ; preds = %L74
%55 = insertelement <2 x double> poison, double %32, i64 0
%56 = shufflevector <2 x double> %55, <2 x double> poison, <2 x i32> zeroinitializer
%57 = fdiv <2 x double> %3, %56
%58 = fmul <2 x double> %57, %57
%59 = extractelement <2 x double> %58, i64 0
%60 = extractelement <2 x double> %58, i64 1
%61 = fadd double %59, %60
%62 = fdiv <2 x double> %16, %56
%63 = fmul <2 x double> %62, %62
%64 = extractelement <2 x double> %63, i64 0
%65 = fadd double %61, %64
%66 = extractelement <2 x double> %63, i64 1
%67 = fadd double %65, %66
%68 = call double @llvm.sqrt.f64(double %67)
%69 = fmul double %32, %68
br label %common.ret
} Here's maybe a simpler case that highlights what's going on: julia> foo(y) = Base.Math._hypot(Base.tail(y));
julia> x = (1.0, 2.0, 3.0);
julia> y = (0.0, x...);
julia> @btime $(Base.Math._hypot)($x);
5.540 ns (0 allocations: 0 bytes)
julia> @btime foo($y);
25.037 ns (0 allocations: 0 bytes) |
Ok, I can reproduce the slowdown on x86_64 linux, but not aarch64 darwin. I reduced it a little bit to julia> function test(x::NTuple{N,<:Number}) where {N}
maxabs = maximum(abs, x)
return sqrt(sum(abs2, x))
end
test (generic function with 1 method)
julia> three(x) = test((x, x, x));
julia> four(x) = test((x, x, x, x));
julia> @btime three($(Ref(1.0))[]); @btime four($(Ref(1.0))[]);
3.810 ns (0 allocations: 0 bytes)
2.730 ns (0 allocations: 0 bytes) |
If this uses AVX it actually isn't very surprising that a Tuple of length four can be faster than three since it fits exactly into a simd lane. See eg JuliaArrays/StaticArrays.jl#512 |
Yes, julia> code_native(three, (Float64,); debuginfo=:none) .text
.file "three"
.globl julia_three_1227 # -- Begin function julia_three_1227
.p2align 4, 0x90
.type julia_three_1227,@function
julia_three_1227: # @julia_three_1227
.cfi_startproc
# %bb.0: # %top
vmulsd %xmm0, %xmm0, %xmm0
vaddsd %xmm0, %xmm0, %xmm1
vaddsd %xmm1, %xmm0, %xmm0
vsqrtsd %xmm0, %xmm0, %xmm0
retq
.Lfunc_end0:
.size julia_three_1227, .Lfunc_end0-julia_three_1227
.cfi_endproc
# -- End function
.section ".note.GNU-stack","",@progbits julia> code_native(four, (Float64,); debuginfo=:none) .text
.file "four"
.globl julia_four_1326 # -- Begin function julia_four_1326
.p2align 4, 0x90
.type julia_four_1326,@function
julia_four_1326: # @julia_four_1326
.cfi_startproc
# %bb.0: # %top
vmulsd %xmm0, %xmm0, %xmm0
vaddsd %xmm0, %xmm0, %xmm1
vaddsd %xmm1, %xmm0, %xmm1
vaddsd %xmm1, %xmm0, %xmm0
vsqrtsd %xmm0, %xmm0, %xmm0
retq
.Lfunc_end0:
.size julia_four_1326, .Lfunc_end0-julia_four_1326
.cfi_endproc
# -- End function
.section ".note.GNU-stack","",@progbits This is indeed probably AVX magic, rather than a julia bug. |
For some reason calling
hypot
with 3 args on Julia v1.9.0-alpha1 is ~4x slower than calling it with 4 args:version info:
The text was updated successfully, but these errors were encountered: