performance regression in sum(a) #16185

stevengj · 2016-05-03T18:06:56Z

It seems like sum has gotten a lot slower. Back when I first implemented pairwise summation in #4039, it was essentially as fast (within a few %) as naive summation. In #5205, @lindahua unrolled the loop to make it significantly faster than a naive summation loop. Now, it seems to be significantly slower than naive code:

function mysum(a)
    s = a[1]
    for i = 2:length(a)
        @inbounds s += a[i]
    end
    return s
end

a = rand(10^7);
@elapsed(sum(a)) / @elapsed(mysum(a))

gives a ratio of about 1.6 on my machine with the latest Julia 0.5, whereas the ratio is 0.8 with Julia 0.4.

The text was updated successfully, but these errors were encountered:

KristofferC · 2016-05-03T18:38:11Z

Generated code can be seen from

0.5:
@code_llvm Base.mapreduce_pairwise_impl(identity, +, a, 1, length(a), Base.sum_pairwise_blocksize(+))

0.4 (much simpler generated code):
@code_llvm Base.mapreduce_pairwise_impl(IdFun(), AddFun(), a, 1, length(a), Base.sum_pairwise_blocksize(IdFun()))

KristofferC · 2016-05-03T19:35:30Z

Smaller repo:

function g(f, op, A, i)
    if i == 0
        return 0.0
    else
        return g(f, op, A, i-1)
    end
end

0.4:

julia> @code_llvm g(IdFun(), AddFun(), a, 5)

define double @julia_g_21672(%jl_value_t*, i64) {
top:
  %2 = icmp eq i64 %1, 0
  br i1 %2, label %if, label %L

if:                                               ; preds = %top
  ret double 0.000000e+00

L:                                                ; preds = %top
  %3 = add i64 %1, -1
  %4 = call double @julia_g_21672(%jl_value_t* %0, i64 %3)
  ret double %4
}

0.5:

julia> @code_llvm g(identity, +, a, 5)

define double @julia_g_54844(%jl_value_t*, %jl_value_t*, %jl_value_t*, i64) #0 {
top:
  %4 = call %jl_value_t*** @jl_get_ptls_states()
  %5 = alloca [8 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 0
  %6 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 3
  %7 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 2
  store %jl_value_t* null, %jl_value_t** %7, align 8
  store %jl_value_t* null, %jl_value_t** %6, align 8
  %8 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %8, align 8
  %9 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 5
  store %jl_value_t* null, %jl_value_t** %9, align 8
  %10 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 6
  store %jl_value_t* null, %jl_value_t** %10, align 8
  %11 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 7
  store %jl_value_t* null, %jl_value_t** %11, align 8
  %12 = bitcast [8 x %jl_value_t*]* %5 to i64*
  store i64 12, i64* %12, align 8
  %13 = getelementptr [8 x %jl_value_t*], [8 x %jl_value_t*]* %5, i64 0, i64 1
  %14 = bitcast %jl_value_t*** %4 to i64*
  %15 = load i64, i64* %14, align 8
  %16 = bitcast %jl_value_t** %13 to i64*
  store i64 %15, i64* %16, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %4, align 8
  %17 = icmp eq i64 %3, 0
  br i1 %17, label %if, label %L

L:                                                ; preds = %top
  %18 = add i64 %3, -1
  store %jl_value_t* inttoptr (i64 140694599406320 to %jl_value_t*), %jl_value_t** %6, align 8
  store %jl_value_t* %0, %jl_value_t** %8, align 8
  store %jl_value_t* %1, %jl_value_t** %9, align 8
  store %jl_value_t* %2, %jl_value_t** %10, align 8
  %19 = call %jl_value_t* @jl_box_int64(i64 signext %18)
  store %jl_value_t* %19, %jl_value_t** %11, align 8
  %20 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %6, i32 5)
  store %jl_value_t* %20, %jl_value_t** %7, align 8
  %21 = bitcast %jl_value_t* %20 to double*
  %22 = load double, double* %21, align 16
  %23 = load i64, i64* %16, align 8
  store i64 %23, i64* %14, align 8
  ret double %22

if:                                               ; preds = %top
  %24 = load i64, i64* %16, align 8
  store i64 %24, i64* %14, align 8
  ret double 0.000000e+00
}

yuyichao · 2016-05-03T19:52:20Z

This seems to be the function specialization heuristic kicking in.

this is possible now that Functors are not needed for performance fixes #16185

simonbyrne · 2016-05-27T11:21:24Z

I see an even faster speedup (2.4x for Float64, 3x for Float32) just using @simd:

function simd_sum(a)
    s = zero(eltype(a))
    @simd for x in a
        @inbounds s += x
    end
    return s
end

StefanKarpinski · 2016-06-02T16:37:07Z

#16217 changes the summation order from pairwise to linear, avoiding the underlying performance regression due involving calling BLAS. However, we chose pairwise summation for its accuracy, so merging #16217 would negate that. We should probably not bother with calling BLAS and just do the summation in pure Julia, probably using @simd.

simonbyrne · 2016-06-02T16:50:17Z

Actually, #16217 does still use pairwise summation, it mostly just simplifies the structure (and avoids one level of function calls).

StefanKarpinski · 2016-06-02T18:52:32Z

Ok, that was not clear. Jameson said that it changed summation order, however. @vtjnash, can you clarify how it changes the order?

simonbyrne · 2016-06-02T19:25:10Z

The BLAS calls are only for abs and abs2, it would seem.

this is possible now that Functors are not needed for performance fixes #16185

vtjnash · 2016-06-02T20:07:55Z

OK, looks like the blas calls were not buying us any performance, so I've just used the native implementation for everything, which now gives us the same answers just with a different implementation.

stevengj · 2016-06-02T20:19:09Z

@lindahua's unrolling of the loop is no longer useful?

simonbyrne · 2016-06-02T20:34:33Z

I think the unrolling is handled by @simd, when the compiler decides its useful.

simonster · 2016-06-03T00:30:54Z

@vtjnash I don't care too much, but did you test with Complex? I'm not surprised we can match dot (I think we were close before, and LLVM 3.7 presumably generates better code) but I would be pleasantly surprised if we're actually matching dotc.

@stevengj I got rid of the manually unrolled loop a long time ago when @simd became faster.

vtjnash · 2016-06-03T03:51:43Z

@simonster I forgot to test that, thanks. It seems like we also beat dotc also, because we can avoid the overhead of calling out to blas.

stevengj added performance Must go faster regression Regression in behavior compared to a previous version labels May 3, 2016

KristofferC mentioned this issue May 4, 2016

SIMD regression tests #13777

Closed

ViralBShah added this to the 0.5.0 milestone May 4, 2016

vtjnash added a commit that referenced this issue May 5, 2016

simplify mapreduce_impl

fbaa2c0

this is possible now that Functors are not needed for performance fixes #16185

vtjnash mentioned this issue May 5, 2016

simplify mapreduce_impl #16217

Merged

tkelman assigned simonbyrne Jun 2, 2016

vtjnash added a commit that referenced this issue Jun 2, 2016

simplify mapreduce_impl

cbd77c2

this is possible now that Functors are not needed for performance fixes #16185

vtjnash closed this as completed in #16217 Jun 3, 2016

vtjnash mentioned this issue Jun 7, 2016

Major regression in in #16809

Closed

KristofferC mentioned this issue Aug 23, 2016

::Function argument in recursive function is expensive #18207

Closed

felixrehren mentioned this issue Feb 8, 2017

sum|mapreduce versus unrolled for-loop. performance disparity #20517

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance regression in sum(a) #16185

performance regression in sum(a) #16185

stevengj commented May 3, 2016 •

edited by tkelman

Loading

KristofferC commented May 3, 2016 •

edited

Loading

KristofferC commented May 3, 2016

yuyichao commented May 3, 2016

simonbyrne commented May 27, 2016

StefanKarpinski commented Jun 2, 2016

simonbyrne commented Jun 2, 2016

StefanKarpinski commented Jun 2, 2016

simonbyrne commented Jun 2, 2016

vtjnash commented Jun 2, 2016

stevengj commented Jun 2, 2016

simonbyrne commented Jun 2, 2016

simonster commented Jun 3, 2016

vtjnash commented Jun 3, 2016

performance regression in sum(a) #16185

performance regression in sum(a) #16185

Comments

stevengj commented May 3, 2016 • edited by tkelman Loading

KristofferC commented May 3, 2016 • edited Loading

KristofferC commented May 3, 2016

yuyichao commented May 3, 2016

simonbyrne commented May 27, 2016

StefanKarpinski commented Jun 2, 2016

simonbyrne commented Jun 2, 2016

StefanKarpinski commented Jun 2, 2016

simonbyrne commented Jun 2, 2016

vtjnash commented Jun 2, 2016

stevengj commented Jun 2, 2016

simonbyrne commented Jun 2, 2016

simonster commented Jun 3, 2016

vtjnash commented Jun 3, 2016

stevengj commented May 3, 2016 •

edited by tkelman

Loading

KristofferC commented May 3, 2016 •

edited

Loading