-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance regression in sum(a) #16185
Comments
Generated code can be seen from 0.5: 0.4 (much simpler generated code): |
Smaller repo: function g(f, op, A, i)
if i == 0
return 0.0
else
return g(f, op, A, i-1)
end
end 0.4: julia> @code_llvm g(IdFun(), AddFun(), a, 5)
define double @julia_g_21672(%jl_value_t*, i64) {
top:
%2 = icmp eq i64 %1, 0
br i1 %2, label %if, label %L
if: ; preds = %top
ret double 0.000000e+00
L: ; preds = %top
%3 = add i64 %1, -1
%4 = call double @julia_g_21672(%jl_value_t* %0, i64 %3)
ret double %4
} 0.5: julia> @code_llvm g(identity, +, a, 5)
define double @julia_g_54844(%jl_value_t*, %jl_value_t*, %jl_value_t*, i64) #0 {
top:
%4 = call %jl_value_t*** @jl_get_ptls_states()
%5 = alloca [8 x %jl_value_t*], align 8
%.sub = getelementptr inbounds [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 0
%6 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 3
%7 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 2
store %jl_value_t* null, %jl_value_t** %7, align 8
store %jl_value_t* null, %jl_value_t** %6, align 8
%8 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 4
store %jl_value_t* null, %jl_value_t** %8, align 8
%9 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 5
store %jl_value_t* null, %jl_value_t** %9, align 8
%10 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 6
store %jl_value_t* null, %jl_value_t** %10, align 8
%11 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 7
store %jl_value_t* null, %jl_value_t** %11, align 8
%12 = bitcast [8 x %jl_value_t*]* %5 to i64*
store i64 12, i64* %12, align 8
%13 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 1
%14 = bitcast %jl_value_t*** %4 to i64*
%15 = load i64, i64* %14, align 8
%16 = bitcast %jl_value_t** %13 to i64*
store i64 %15, i64* %16, align 8
store %jl_value_t** %.sub, %jl_value_t*** %4, align 8
%17 = icmp eq i64 %3, 0
br i1 %17, label %if, label %L
L: ; preds = %top
%18 = add i64 %3, -1
store %jl_value_t* inttoptr (i64 140694599406320 to %jl_value_t*), %jl_value_t** %6, align 8
store %jl_value_t* %0, %jl_value_t** %8, align 8
store %jl_value_t* %1, %jl_value_t** %9, align 8
store %jl_value_t* %2, %jl_value_t** %10, align 8
%19 = call %jl_value_t* @jl_box_int64(i64 signext %18)
store %jl_value_t* %19, %jl_value_t** %11, align 8
%20 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %6, i32 5)
store %jl_value_t* %20, %jl_value_t** %7, align 8
%21 = bitcast %jl_value_t* %20 to double*
%22 = load double, double* %21, align 16
%23 = load i64, i64* %16, align 8
store i64 %23, i64* %14, align 8
ret double %22
if: ; preds = %top
%24 = load i64, i64* %16, align 8
store i64 %24, i64* %14, align 8
ret double 0.000000e+00
} |
This seems to be the function specialization heuristic kicking in. |
this is possible now that Functors are not needed for performance fixes #16185
I see an even faster speedup (2.4x for function simd_sum(a)
s = zero(eltype(a))
@simd for x in a
@inbounds s += x
end
return s
end |
#16217 changes the summation order from pairwise to linear, avoiding the underlying performance regression due involving calling BLAS. However, we chose pairwise summation for its accuracy, so merging #16217 would negate that. We should probably not bother with calling BLAS and just do the summation in pure Julia, probably using |
Actually, #16217 does still use pairwise summation, it mostly just simplifies the structure (and avoids one level of function calls). |
Ok, that was not clear. Jameson said that it changed summation order, however. @vtjnash, can you clarify how it changes the order? |
The BLAS calls are only for |
this is possible now that Functors are not needed for performance fixes #16185
OK, looks like the blas calls were not buying us any performance, so I've just used the native implementation for everything, which now gives us the same answers just with a different implementation. |
@lindahua's unrolling of the loop is no longer useful? |
I think the unrolling is handled by |
@vtjnash I don't care too much, but did you test with @stevengj I got rid of the manually unrolled loop a long time ago when |
@simonster I forgot to test that, thanks. It seems like we also beat dotc also, because we can avoid the overhead of calling out to blas. |
It seems like
sum
has gotten a lot slower. Back when I first implemented pairwise summation in #4039, it was essentially as fast (within a few %) as naive summation. In #5205, @lindahua unrolled the loop to make it significantly faster than a naive summation loop. Now, it seems to be significantly slower than naive code:gives a ratio of about 1.6 on my machine with the latest Julia 0.5, whereas the ratio is 0.8 with Julia 0.4.
The text was updated successfully, but these errors were encountered: