Performance hit with broadcasting over tuples #20802

KristofferC · 2017-02-25T10:56:39Z

Consider:

using BenchmarkTools

f(v) = round.(Int, v)
fv2(v) = [round(Int, v[1]), round(Int, v[2]), round(Int, v[3])]
ft2(v) =  round(Int, v[1]), round(Int, v[2]), round(Int, v[3])

const t = (1.0, 2.0, 3.0)
const v = [1.0, 2.0, 3.0]

# Tuples
ft = @belapsed f(t);
f2t = @belapsed ft2(t);
println("Slowdown tuples: ", ft / f2t)

# Arrays
fv = @belapsed f(v);
f2v = @belapsed fv2(v);
println("Slowdown array: ", fv / f2v)

This gives:

Slowdown tuples: 437.08484801155777
Slowdown array: 0.6573552863698431

Is there any way to make the tuple broadcasting faster? Note the type instability, maybe from the type argument Int?:

> @code_warntype f(t)
Variables:
  #self#::#f
  v::Tuple{Float64,Float64,Float64}
  #1::##1#2

Body:
  begin 
      #1::##1#2 = $(Expr(:new, :(Main.##1#2)))
      SSAValue(0) = #1::##1#2
      return $(Expr(:invoke, MethodInstance for broadcast_c(::Function, ::Type{Tuple}, ::Type{T} where T, ::Tuple{Float64,Float64,Float64}, ::Vararg{Tuple{Float64,Float64,Float64},N} where N), :(Base.Broadcast.broadcast_c), SSAValue(0), :($(QuoteNode(Tuple))), :(Main.Int), :(v)))
  end::Tuple

The text was updated successfully, but these errors were encountered:

Sacha0 · 2017-02-25T18:35:51Z

Though not nearly enough to address this issue, the following small tweak to broadcast_c for tuples cuts allocation count and volume roughly in half and reduces runtime by about 20%:

julia> using Base.Broadcast: broadcast_indices, _broadcast_getindex, broadcast_c

julia> begin
           function tuplebc(f, As...)
               shape = broadcast_indices(As...)
               n = length(shape[1])
               return ntuple(k -> f(_gatherargs(As, k)...), n)
           end
           Base.@propagate_inbounds _gatherargs(::Tuple{}, k) = ()
           Base.@propagate_inbounds _gatherargs(As, k) =
               (_broadcast_getindex(first(As), k), _gatherargs(Base.tail(As), k)...)
       end
_gatherargs (generic function with 2 methods)

julia> @benchmark broadcast_c(round, Tuple, Int, (1., 2., 3.))
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     5
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  1.05 kb
  allocs estimate:  31
  minimum time:     6.20 μs (0.00% GC)
  median time:      6.42 μs (0.00% GC)
  mean time:        6.96 μs (2.13% GC)
  maximum time:     845.40 μs (97.07% GC)

julia> @benchmark tuplebc(round, Int, (1., 2., 3.))
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     6
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  448.00 bytes
  allocs estimate:  16
  minimum time:     5.07 μs (0.00% GC)
  median time:      5.21 μs (0.00% GC)
  mean time:        5.56 μs (1.24% GC)
  maximum time:     717.70 μs (96.35% GC)

Sacha0 · 2017-02-25T19:30:06Z

An order of magnitude faster solution:

julia> begin
           tuplebc(f, As...) = ntuple(k -> f(_gatherargs(As, k)...), max(tuplens(As)...))
           @inline tuplens(As) = tuplens((), first(As), tail(As))
           @inline tuplens(lens, A, ::Tuple{}) =  lens
           @inline tuplens(lens, A::Tuple, ::Tuple{}) = (lens..., length(A))
           @inline tuplens(lens, A, Bs) = tuplens(lens, first(Bs), tail(Bs))
           @inline tuplens(lens, A::Tuple, Bs) = tuplens((lens..., length(A)), first(Bs), tail(Bs))
           Base.@propagate_inbounds _gatherargs(::Tuple{}, k) = ()
           Base.@propagate_inbounds _gatherargs(As, k) =
               (_broadcast_getindex(first(As), k), _gatherargs(Base.tail(As), k)...)
       end
_gatherargs (generic function with 2 methods)

julia> @benchmark broadcast_c(round, Tuple, Int, (1., 2., 3.))
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     5
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  1.05 kb
  allocs estimate:  31
  minimum time:     6.13 μs (0.00% GC)
  median time:      6.34 μs (0.00% GC)
  mean time:        6.95 μs (2.43% GC)
  maximum time:     973.96 μs (97.34% GC)

julia> @benchmark tuplebc(round, Int, (1., 2., 3.))
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     188
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  96.00 bytes
  allocs estimate:  5
  minimum time:     545.00 ns (0.00% GC)
  median time:      564.00 ns (0.00% GC)
  mean time:        595.34 ns (1.42% GC)
  maximum time:     17.02 μs (94.05% GC)

Will prepare a PR. Best!

Sacha0 · 2017-02-26T21:51:38Z

On #20817, the OP test case yields

Slowdown tuples: 76.57872034398932
Slowdown array: 0.8105171823957056

from local baseline (master)

Slowdown tuples: 954.1730769230769
Slowdown array: 0.7802518143399944

See #20817 for more re. the remaining performance gap. Best!

Edit: After adding the method explicitly specialized on type-first-arguments to #20817:

Slowdown tuples: 2.571969364314141

KristofferC · 2017-03-06T16:00:50Z

The tuple improvement is amazing (enough to close the issue) but I am seeing:;

Slowdown array: 6.948580419361614

on master now. Can anyone confirm?

Sacha0 · 2017-03-06T16:56:01Z

Can confirm, seeing

Slowdown array: 7.296097732573789

locally. Having a brief look now. Best!

Sacha0 · 2017-03-06T17:07:31Z

The regression is related to the _broadcast_eltype call here. Evidence:

julia> @inline function broadcast_c2(f, ::Type{Array}, A, Bs...)
           T = _broadcast_eltype(f, A, Bs...)
           shape = broadcast_indices(A, Bs...)
           iter = CartesianRange(shape)
           if isleaftype(T)
               return broadcast_t(f, T, shape, iter, A, Bs...)
           end
           if isempty(iter)
               return similar(Array{T}, shape)
           end
           return broadcast_t(f, Any, shape, iter, A, Bs...)
       end
broadcast_c2 (generic function with 1 method)

julia> @inline function broadcast_c3(f, ::Type{Array}, A, Bs...)
           T = Int64
           shape = broadcast_indices(A, Bs...)
           iter = CartesianRange(shape)
           if isleaftype(T)
               return broadcast_t(f, T, shape, iter, A, Bs...)
           end
           if isempty(iter)
               return similar(Array{T}, shape)
           end
           return broadcast_t(f, Any, shape, iter, A, Bs...)
       end
broadcast_c3 (generic function with 1 method)

julia> @benchmark broadcast_c2(round, Array, Int, $v)
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     465.777 ns (0.00% GC)
  median time:      484.152 ns (0.00% GC)
  mean time:        531.657 ns (1.36% GC)
  maximum time:     10.513 μs (89.39% GC)
  --------------
  samples:          10000
  evals/sample:     197
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark broadcast_c3(round, Array, Int, $v)
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     44.370 ns (0.00% GC)
  median time:      47.942 ns (0.00% GC)
  mean time:        57.643 ns (10.04% GC)
  maximum time:     1.906 μs (73.53% GC)
  --------------
  samples:          10000
  evals/sample:     992
  time tolerance:   5.00%
  memory tolerance: 1.00%

IIUC _broadcast_eltype might've been mentioned recently in a compiler-related thread... searching.

Sacha0 · 2017-03-06T17:12:10Z

The comment I recalled is #20726 (comment). The issue discussed in the surrounding comments seems likely to be the cause of this regression. Best!

KristofferC · 2017-03-07T07:48:27Z

Fixed now.

tkelman · 2017-03-07T12:50:10Z

is there a perf tracking benchmark for this?

KristofferC · 2017-03-07T12:57:30Z

JuliaCI/BaseBenchmarks.jl#66

KristofferC added performance Must go faster broadcast Applying a function over a collection labels Feb 25, 2017

Sacha0 mentioned this issue Feb 26, 2017

fast broadcast over combinations of tuples and scalars #20817

Merged

Sacha0 mentioned this issue Mar 6, 2017

fix #20704, pure annotation should not skip method errors #20726

Merged

KristofferC closed this as completed Mar 7, 2017

tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Mar 7, 2017

KristofferC removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance hit with broadcasting over tuples #20802

Performance hit with broadcasting over tuples #20802

KristofferC commented Feb 25, 2017 •

edited

Loading

Sacha0 commented Feb 25, 2017

Sacha0 commented Feb 25, 2017

Sacha0 commented Feb 26, 2017 •

edited

Loading

KristofferC commented Mar 6, 2017

Sacha0 commented Mar 6, 2017

Sacha0 commented Mar 6, 2017

Sacha0 commented Mar 6, 2017

KristofferC commented Mar 7, 2017

tkelman commented Mar 7, 2017

KristofferC commented Mar 7, 2017

Performance hit with broadcasting over tuples #20802

Performance hit with broadcasting over tuples #20802

Comments

KristofferC commented Feb 25, 2017 • edited Loading

Sacha0 commented Feb 25, 2017

Sacha0 commented Feb 25, 2017

Sacha0 commented Feb 26, 2017 • edited Loading

KristofferC commented Mar 6, 2017

Sacha0 commented Mar 6, 2017

Sacha0 commented Mar 6, 2017

Sacha0 commented Mar 6, 2017

KristofferC commented Mar 7, 2017

tkelman commented Mar 7, 2017

KristofferC commented Mar 7, 2017

KristofferC commented Feb 25, 2017 •

edited

Loading

Sacha0 commented Feb 26, 2017 •

edited

Loading