Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance hit with broadcasting over tuples #20802

Closed
KristofferC opened this issue Feb 25, 2017 · 10 comments
Closed

Performance hit with broadcasting over tuples #20802

KristofferC opened this issue Feb 25, 2017 · 10 comments
Labels
broadcast Applying a function over a collection performance Must go faster

Comments

@KristofferC
Copy link
Member

KristofferC commented Feb 25, 2017

Consider:

using BenchmarkTools

f(v) = round.(Int, v)
fv2(v) = [round(Int, v[1]), round(Int, v[2]), round(Int, v[3])]
ft2(v) =  round(Int, v[1]), round(Int, v[2]), round(Int, v[3])

const t = (1.0, 2.0, 3.0)
const v = [1.0, 2.0, 3.0]

# Tuples
ft = @belapsed f(t);
f2t = @belapsed ft2(t);
println("Slowdown tuples: ", ft / f2t)

# Arrays
fv = @belapsed f(v);
f2v = @belapsed fv2(v);
println("Slowdown array: ", fv / f2v)

This gives:

Slowdown tuples: 437.08484801155777
Slowdown array: 0.6573552863698431

Is there any way to make the tuple broadcasting faster? Note the type instability, maybe from the type argument Int?:

> @code_warntype f(t)
Variables:
  #self#::#f
  v::Tuple{Float64,Float64,Float64}
  #1::##1#2

Body:
  begin 
      #1::##1#2 = $(Expr(:new, :(Main.##1#2)))
      SSAValue(0) = #1::##1#2
      return $(Expr(:invoke, MethodInstance for broadcast_c(::Function, ::Type{Tuple}, ::Type{T} where T, ::Tuple{Float64,Float64,Float64}, ::Vararg{Tuple{Float64,Float64,Float64},N} where N), :(Base.Broadcast.broadcast_c), SSAValue(0), :($(QuoteNode(Tuple))), :(Main.Int), :(v)))
  end::Tuple
@KristofferC KristofferC added performance Must go faster broadcast Applying a function over a collection labels Feb 25, 2017
@Sacha0
Copy link
Member

Sacha0 commented Feb 25, 2017

Though not nearly enough to address this issue, the following small tweak to broadcast_c for tuples cuts allocation count and volume roughly in half and reduces runtime by about 20%:

julia> using Base.Broadcast: broadcast_indices, _broadcast_getindex, broadcast_c

julia> begin
           function tuplebc(f, As...)
               shape = broadcast_indices(As...)
               n = length(shape[1])
               return ntuple(k -> f(_gatherargs(As, k)...), n)
           end
           Base.@propagate_inbounds _gatherargs(::Tuple{}, k) = ()
           Base.@propagate_inbounds _gatherargs(As, k) =
               (_broadcast_getindex(first(As), k), _gatherargs(Base.tail(As), k)...)
       end
_gatherargs (generic function with 2 methods)

julia> @benchmark broadcast_c(round, Tuple, Int, (1., 2., 3.))
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     5
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  1.05 kb
  allocs estimate:  31
  minimum time:     6.20 μs (0.00% GC)
  median time:      6.42 μs (0.00% GC)
  mean time:        6.96 μs (2.13% GC)
  maximum time:     845.40 μs (97.07% GC)

julia> @benchmark tuplebc(round, Int, (1., 2., 3.))
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     6
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  448.00 bytes
  allocs estimate:  16
  minimum time:     5.07 μs (0.00% GC)
  median time:      5.21 μs (0.00% GC)
  mean time:        5.56 μs (1.24% GC)
  maximum time:     717.70 μs (96.35% GC)

@Sacha0
Copy link
Member

Sacha0 commented Feb 25, 2017

An order of magnitude faster solution:

julia> begin
           tuplebc(f, As...) = ntuple(k -> f(_gatherargs(As, k)...), max(tuplens(As)...))
           @inline tuplens(As) = tuplens((), first(As), tail(As))
           @inline tuplens(lens, A, ::Tuple{}) =  lens
           @inline tuplens(lens, A::Tuple, ::Tuple{}) = (lens..., length(A))
           @inline tuplens(lens, A, Bs) = tuplens(lens, first(Bs), tail(Bs))
           @inline tuplens(lens, A::Tuple, Bs) = tuplens((lens..., length(A)), first(Bs), tail(Bs))
           Base.@propagate_inbounds _gatherargs(::Tuple{}, k) = ()
           Base.@propagate_inbounds _gatherargs(As, k) =
               (_broadcast_getindex(first(As), k), _gatherargs(Base.tail(As), k)...)
       end
_gatherargs (generic function with 2 methods)

julia> @benchmark broadcast_c(round, Tuple, Int, (1., 2., 3.))
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     5
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  1.05 kb
  allocs estimate:  31
  minimum time:     6.13 μs (0.00% GC)
  median time:      6.34 μs (0.00% GC)
  mean time:        6.95 μs (2.43% GC)
  maximum time:     973.96 μs (97.34% GC)

julia> @benchmark tuplebc(round, Int, (1., 2., 3.))
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     188
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  96.00 bytes
  allocs estimate:  5
  minimum time:     545.00 ns (0.00% GC)
  median time:      564.00 ns (0.00% GC)
  mean time:        595.34 ns (1.42% GC)
  maximum time:     17.02 μs (94.05% GC)

Will prepare a PR. Best!

@Sacha0
Copy link
Member

Sacha0 commented Feb 26, 2017

On #20817, the OP test case yields

Slowdown tuples: 76.57872034398932
Slowdown array: 0.8105171823957056

from local baseline (master)

Slowdown tuples: 954.1730769230769
Slowdown array: 0.7802518143399944

See #20817 for more re. the remaining performance gap. Best!

Edit: After adding the method explicitly specialized on type-first-arguments to #20817:

Slowdown tuples: 2.571969364314141

@KristofferC
Copy link
Member Author

The tuple improvement is amazing (enough to close the issue) but I am seeing:;

Slowdown array: 6.948580419361614

on master now. Can anyone confirm?

@Sacha0
Copy link
Member

Sacha0 commented Mar 6, 2017

Can confirm, seeing

Slowdown array: 7.296097732573789

locally. Having a brief look now. Best!

@Sacha0
Copy link
Member

Sacha0 commented Mar 6, 2017

The regression is related to the _broadcast_eltype call here. Evidence:

julia> @inline function broadcast_c2(f, ::Type{Array}, A, Bs...)
           T = _broadcast_eltype(f, A, Bs...)
           shape = broadcast_indices(A, Bs...)
           iter = CartesianRange(shape)
           if isleaftype(T)
               return broadcast_t(f, T, shape, iter, A, Bs...)
           end
           if isempty(iter)
               return similar(Array{T}, shape)
           end
           return broadcast_t(f, Any, shape, iter, A, Bs...)
       end
broadcast_c2 (generic function with 1 method)

julia> @inline function broadcast_c3(f, ::Type{Array}, A, Bs...)
           T = Int64
           shape = broadcast_indices(A, Bs...)
           iter = CartesianRange(shape)
           if isleaftype(T)
               return broadcast_t(f, T, shape, iter, A, Bs...)
           end
           if isempty(iter)
               return similar(Array{T}, shape)
           end
           return broadcast_t(f, Any, shape, iter, A, Bs...)
       end
broadcast_c3 (generic function with 1 method)

julia> @benchmark broadcast_c2(round, Array, Int, $v)
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     465.777 ns (0.00% GC)
  median time:      484.152 ns (0.00% GC)
  mean time:        531.657 ns (1.36% GC)
  maximum time:     10.513 μs (89.39% GC)
  --------------
  samples:          10000
  evals/sample:     197
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark broadcast_c3(round, Array, Int, $v)
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     44.370 ns (0.00% GC)
  median time:      47.942 ns (0.00% GC)
  mean time:        57.643 ns (10.04% GC)
  maximum time:     1.906 μs (73.53% GC)
  --------------
  samples:          10000
  evals/sample:     992
  time tolerance:   5.00%
  memory tolerance: 1.00%

IIUC _broadcast_eltype might've been mentioned recently in a compiler-related thread... searching.

@Sacha0
Copy link
Member

Sacha0 commented Mar 6, 2017

The comment I recalled is #20726 (comment). The issue discussed in the surrounding comments seems likely to be the cause of this regression. Best!

@KristofferC
Copy link
Member Author

Fixed now.

@tkelman
Copy link
Contributor

tkelman commented Mar 7, 2017

is there a perf tracking benchmark for this?

@tkelman tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Mar 7, 2017
@KristofferC
Copy link
Member Author

@KristofferC KristofferC removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection performance Must go faster
Projects
None yet
Development

No branches or pull requests

3 participants