Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance regression in sum(a) #16185

Closed
stevengj opened this issue May 3, 2016 · 13 comments · Fixed by #16217
Closed

performance regression in sum(a) #16185

stevengj opened this issue May 3, 2016 · 13 comments · Fixed by #16217
Assignees
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Milestone

Comments

@stevengj
Copy link
Member

stevengj commented May 3, 2016

It seems like sum has gotten a lot slower. Back when I first implemented pairwise summation in #4039, it was essentially as fast (within a few %) as naive summation. In #5205, @lindahua unrolled the loop to make it significantly faster than a naive summation loop. Now, it seems to be significantly slower than naive code:

function mysum(a)
    s = a[1]
    for i = 2:length(a)
        @inbounds s += a[i]
    end
    return s
end

a = rand(10^7);
@elapsed(sum(a)) / @elapsed(mysum(a))

gives a ratio of about 1.6 on my machine with the latest Julia 0.5, whereas the ratio is 0.8 with Julia 0.4.

@stevengj stevengj added performance Must go faster regression Regression in behavior compared to a previous version labels May 3, 2016
@KristofferC
Copy link
Member

KristofferC commented May 3, 2016

Generated code can be seen from

0.5:
@code_llvm Base.mapreduce_pairwise_impl(identity, +, a, 1, length(a), Base.sum_pairwise_blocksize(+))

0.4 (much simpler generated code):
@code_llvm Base.mapreduce_pairwise_impl(IdFun(), AddFun(), a, 1, length(a), Base.sum_pairwise_blocksize(IdFun()))

@KristofferC
Copy link
Member

Smaller repo:

function g(f, op, A, i)
    if i == 0
        return 0.0
    else
        return g(f, op, A, i-1)
    end
end 

0.4:

julia> @code_llvm g(IdFun(), AddFun(), a, 5)

define double @julia_g_21672(%jl_value_t*, i64) {
top:
  %2 = icmp eq i64 %1, 0
  br i1 %2, label %if, label %L

if:                                               ; preds = %top
  ret double 0.000000e+00

L:                                                ; preds = %top
  %3 = add i64 %1, -1
  %4 = call double @julia_g_21672(%jl_value_t* %0, i64 %3)
  ret double %4
}

0.5:

julia> @code_llvm g(identity, +, a, 5)

define double @julia_g_54844(%jl_value_t*, %jl_value_t*, %jl_value_t*, i64) #0 {
top:
  %4 = call %jl_value_t*** @jl_get_ptls_states()
  %5 = alloca [8 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 0
  %6 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 3
  %7 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 2
  store %jl_value_t* null, %jl_value_t** %7, align 8
  store %jl_value_t* null, %jl_value_t** %6, align 8
  %8 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %8, align 8
  %9 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 5
  store %jl_value_t* null, %jl_value_t** %9, align 8
  %10 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 6
  store %jl_value_t* null, %jl_value_t** %10, align 8
  %11 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 7
  store %jl_value_t* null, %jl_value_t** %11, align 8
  %12 = bitcast [8 x %jl_value_t*]* %5 to i64*
  store i64 12, i64* %12, align 8
  %13 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 1
  %14 = bitcast %jl_value_t*** %4 to i64*
  %15 = load i64, i64* %14, align 8
  %16 = bitcast %jl_value_t** %13 to i64*
  store i64 %15, i64* %16, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %4, align 8
  %17 = icmp eq i64 %3, 0
  br i1 %17, label %if, label %L

L:                                                ; preds = %top
  %18 = add i64 %3, -1
  store %jl_value_t* inttoptr (i64 140694599406320 to %jl_value_t*), %jl_value_t** %6, align 8
  store %jl_value_t* %0, %jl_value_t** %8, align 8
  store %jl_value_t* %1, %jl_value_t** %9, align 8
  store %jl_value_t* %2, %jl_value_t** %10, align 8
  %19 = call %jl_value_t* @jl_box_int64(i64 signext %18)
  store %jl_value_t* %19, %jl_value_t** %11, align 8
  %20 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %6, i32 5)
  store %jl_value_t* %20, %jl_value_t** %7, align 8
  %21 = bitcast %jl_value_t* %20 to double*
  %22 = load double, double* %21, align 16
  %23 = load i64, i64* %16, align 8
  store i64 %23, i64* %14, align 8
  ret double %22

if:                                               ; preds = %top
  %24 = load i64, i64* %16, align 8
  store i64 %24, i64* %14, align 8
  ret double 0.000000e+00
}

@yuyichao
Copy link
Contributor

yuyichao commented May 3, 2016

This seems to be the function specialization heuristic kicking in.

@ViralBShah ViralBShah added this to the 0.5.0 milestone May 4, 2016
vtjnash added a commit that referenced this issue May 5, 2016
this is possible now that Functors are not needed for performance
fixes #16185
@simonbyrne
Copy link
Contributor

I see an even faster speedup (2.4x for Float64, 3x for Float32) just using @simd:

function simd_sum(a)
    s = zero(eltype(a))
    @simd for x in a
        @inbounds s += x
    end
    return s
end

@StefanKarpinski
Copy link
Member

#16217 changes the summation order from pairwise to linear, avoiding the underlying performance regression due involving calling BLAS. However, we chose pairwise summation for its accuracy, so merging #16217 would negate that. We should probably not bother with calling BLAS and just do the summation in pure Julia, probably using @simd.

@simonbyrne
Copy link
Contributor

Actually, #16217 does still use pairwise summation, it mostly just simplifies the structure (and avoids one level of function calls).

@StefanKarpinski
Copy link
Member

Ok, that was not clear. Jameson said that it changed summation order, however. @vtjnash, can you clarify how it changes the order?

@simonbyrne
Copy link
Contributor

The BLAS calls are only for abs and abs2, it would seem.

vtjnash added a commit that referenced this issue Jun 2, 2016
this is possible now that Functors are not needed for performance
fixes #16185
@vtjnash
Copy link
Member

vtjnash commented Jun 2, 2016

OK, looks like the blas calls were not buying us any performance, so I've just used the native implementation for everything, which now gives us the same answers just with a different implementation.

@stevengj
Copy link
Member Author

stevengj commented Jun 2, 2016

@lindahua's unrolling of the loop is no longer useful?

@simonbyrne
Copy link
Contributor

I think the unrolling is handled by @simd, when the compiler decides its useful.

@simonster
Copy link
Member

@vtjnash I don't care too much, but did you test with Complex? I'm not surprised we can match dot (I think we were close before, and LLVM 3.7 presumably generates better code) but I would be pleasantly surprised if we're actually matching dotc.

@stevengj I got rid of the manually unrolled loop a long time ago when @simd became faster.

@vtjnash
Copy link
Member

vtjnash commented Jun 3, 2016

@simonster I forgot to test that, thanks. It seems like we also beat dotc also, because we can avoid the overhead of calling out to blas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants