sum|mapreduce versus unrolled for-loop. performance disparity #20517

felixrehren · 2017-02-08T08:32:05Z

mapreduce can be much slower than the equivalent for-loop. A small (real-life) example:

function dsum(A::Matrix)
    z = zero(A[1,1])
    n = Base.LinAlg.checksquare(A)
    B = Vector{typeof(z)}(n)

    @inbounds for j in 1:n
        B[j] = mapreduce(k -> A[j,k]*A[k,j], +, z, 1:j)
    end
    B
end
function dfor(A::Matrix)
    z = zero(A[1,1])
    n = Base.LinAlg.checksquare(A)
    B = Vector{typeof(z)}(n)

    @inbounds for j in 1:n
        d = z
        for k in 1:j
            d += A[j,k]*A[k,j]
        end
        B[j] = d
    end
    B
end

A = randn(127,127)
time(median(@benchmark dsum(A)))/time(median(@benchmark dfor(A)))

gives me a performance ratio of about x50 on Julia 0.5, juliabox.com. I think this could be because the for-loop can be automatically simd, and the mapreduce isn't? When A = randn(N,N) and N is 16, the gap is around x75, and for N = 10000, the gap is around x25. Replacing the array access A[j,k] with A[rand(1:size(A,1)),rand(1:size(A,2))] destroys the performance on both, but the ratio becomes x1.

Is simd the reason why one is x50 faster?
Should this be described in Performance Tips? mapreduce underlies sum, so this could be a popular trap that isn't currently mentioned
Would this be a useful benchmark? on nanosoldier?
Could the performance gap be smaller?

(Benchmarking mapreduce versus for-loops without array access, I still see a x2 performance gap. E.g. mapreduce(identity, +, 0, i for i in 1:n) versus the equivalent integer-summing for loop. It looks like this gap used to be smaller? Worth another benchmark in CI?)

The text was updated successfully, but these errors were encountered:

KristofferC · 2017-02-08T09:00:08Z

Looks like dup of #15276.

Doing:

function dsum(A::Matrix)
    z = zero(A[1,1])
    n = Base.LinAlg.checksquare(A)
    B = Vector{typeof(z)}(n)
    @inbounds for j::Int in 1:n
        B[j] = _help(A, j, z)
    end
    B
end

_help(A, j, z) = mapreduce(k -> A[j,k]*A[k,j], +, z, 1:j)

gives

julia> time(median(@benchmark dsum(A)))/time(median(@benchmark dfor(A)))
1.0013213312412255

You can see the problem by @code_warntype and looking for Core.Box.

stevengj · 2017-02-10T02:13:15Z

+1 for putting a benchmark of this in BaseBenchmarks; a PR there would be great.

KristofferC · 2018-04-05T12:05:14Z

Closing as dup of #15276.

kshyatt added the performance Must go faster label Feb 8, 2017

stevengj added the potential benchmark Could make a good benchmark in BaseBenchmarks label Feb 10, 2017

felixrehren mentioned this issue Feb 10, 2017

bench mapreduce with array accesses JuliaCI/BaseBenchmarks.jl#63

Merged

KristofferC closed this as completed Apr 5, 2018

KristofferC mentioned this issue Aug 17, 2018

add some regression tests from base JuliaCI/BaseBenchmarks.jl#226

Merged

KristofferC removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sum|mapreduce versus unrolled for-loop. performance disparity #20517

sum|mapreduce versus unrolled for-loop. performance disparity #20517

felixrehren commented Feb 8, 2017 •

edited by KristofferC

Loading

KristofferC commented Feb 8, 2017 •

edited

Loading

stevengj commented Feb 10, 2017

KristofferC commented Apr 5, 2018

sum|mapreduce versus unrolled for-loop. performance disparity #20517

sum|mapreduce versus unrolled for-loop. performance disparity #20517

Comments

felixrehren commented Feb 8, 2017 • edited by KristofferC Loading

KristofferC commented Feb 8, 2017 • edited Loading

stevengj commented Feb 10, 2017

KristofferC commented Apr 5, 2018

felixrehren commented Feb 8, 2017 •

edited by KristofferC

Loading

KristofferC commented Feb 8, 2017 •

edited

Loading