-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance hit with broadcasting over tuples #20802
Comments
Though not nearly enough to address this issue, the following small tweak to julia> using Base.Broadcast: broadcast_indices, _broadcast_getindex, broadcast_c
julia> begin
function tuplebc(f, As...)
shape = broadcast_indices(As...)
n = length(shape[1])
return ntuple(k -> f(_gatherargs(As, k)...), n)
end
Base.@propagate_inbounds _gatherargs(::Tuple{}, k) = ()
Base.@propagate_inbounds _gatherargs(As, k) =
(_broadcast_getindex(first(As), k), _gatherargs(Base.tail(As), k)...)
end
_gatherargs (generic function with 2 methods)
julia> @benchmark broadcast_c(round, Tuple, Int, (1., 2., 3.))
BenchmarkTools.Trial:
samples: 10000
evals/sample: 5
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 1.05 kb
allocs estimate: 31
minimum time: 6.20 μs (0.00% GC)
median time: 6.42 μs (0.00% GC)
mean time: 6.96 μs (2.13% GC)
maximum time: 845.40 μs (97.07% GC)
julia> @benchmark tuplebc(round, Int, (1., 2., 3.))
BenchmarkTools.Trial:
samples: 10000
evals/sample: 6
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 448.00 bytes
allocs estimate: 16
minimum time: 5.07 μs (0.00% GC)
median time: 5.21 μs (0.00% GC)
mean time: 5.56 μs (1.24% GC)
maximum time: 717.70 μs (96.35% GC) |
An order of magnitude faster solution: julia> begin
tuplebc(f, As...) = ntuple(k -> f(_gatherargs(As, k)...), max(tuplens(As)...))
@inline tuplens(As) = tuplens((), first(As), tail(As))
@inline tuplens(lens, A, ::Tuple{}) = lens
@inline tuplens(lens, A::Tuple, ::Tuple{}) = (lens..., length(A))
@inline tuplens(lens, A, Bs) = tuplens(lens, first(Bs), tail(Bs))
@inline tuplens(lens, A::Tuple, Bs) = tuplens((lens..., length(A)), first(Bs), tail(Bs))
Base.@propagate_inbounds _gatherargs(::Tuple{}, k) = ()
Base.@propagate_inbounds _gatherargs(As, k) =
(_broadcast_getindex(first(As), k), _gatherargs(Base.tail(As), k)...)
end
_gatherargs (generic function with 2 methods)
julia> @benchmark broadcast_c(round, Tuple, Int, (1., 2., 3.))
BenchmarkTools.Trial:
samples: 10000
evals/sample: 5
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 1.05 kb
allocs estimate: 31
minimum time: 6.13 μs (0.00% GC)
median time: 6.34 μs (0.00% GC)
mean time: 6.95 μs (2.43% GC)
maximum time: 973.96 μs (97.34% GC)
julia> @benchmark tuplebc(round, Int, (1., 2., 3.))
BenchmarkTools.Trial:
samples: 10000
evals/sample: 188
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 96.00 bytes
allocs estimate: 5
minimum time: 545.00 ns (0.00% GC)
median time: 564.00 ns (0.00% GC)
mean time: 595.34 ns (1.42% GC)
maximum time: 17.02 μs (94.05% GC) Will prepare a PR. Best! |
On #20817, the OP test case yields Slowdown tuples: 76.57872034398932
Slowdown array: 0.8105171823957056 from local baseline (master) Slowdown tuples: 954.1730769230769
Slowdown array: 0.7802518143399944 See #20817 for more re. the remaining performance gap. Best! Edit: After adding the method explicitly specialized on type-first-arguments to #20817: Slowdown tuples: 2.571969364314141 |
The tuple improvement is amazing (enough to close the issue) but I am seeing:;
on master now. Can anyone confirm? |
Can confirm, seeing Slowdown array: 7.296097732573789 locally. Having a brief look now. Best! |
The regression is related to the julia> @inline function broadcast_c2(f, ::Type{Array}, A, Bs...)
T = _broadcast_eltype(f, A, Bs...)
shape = broadcast_indices(A, Bs...)
iter = CartesianRange(shape)
if isleaftype(T)
return broadcast_t(f, T, shape, iter, A, Bs...)
end
if isempty(iter)
return similar(Array{T}, shape)
end
return broadcast_t(f, Any, shape, iter, A, Bs...)
end
broadcast_c2 (generic function with 1 method)
julia> @inline function broadcast_c3(f, ::Type{Array}, A, Bs...)
T = Int64
shape = broadcast_indices(A, Bs...)
iter = CartesianRange(shape)
if isleaftype(T)
return broadcast_t(f, T, shape, iter, A, Bs...)
end
if isempty(iter)
return similar(Array{T}, shape)
end
return broadcast_t(f, Any, shape, iter, A, Bs...)
end
broadcast_c3 (generic function with 1 method)
julia> @benchmark broadcast_c2(round, Array, Int, $v)
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 1
--------------
minimum time: 465.777 ns (0.00% GC)
median time: 484.152 ns (0.00% GC)
mean time: 531.657 ns (1.36% GC)
maximum time: 10.513 μs (89.39% GC)
--------------
samples: 10000
evals/sample: 197
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark broadcast_c3(round, Array, Int, $v)
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 1
--------------
minimum time: 44.370 ns (0.00% GC)
median time: 47.942 ns (0.00% GC)
mean time: 57.643 ns (10.04% GC)
maximum time: 1.906 μs (73.53% GC)
--------------
samples: 10000
evals/sample: 992
time tolerance: 5.00%
memory tolerance: 1.00% IIUC |
The comment I recalled is #20726 (comment). The issue discussed in the surrounding comments seems likely to be the cause of this regression. Best! |
Fixed now. |
is there a perf tracking benchmark for this? |
Consider:
This gives:
Is there any way to make the tuple broadcasting faster? Note the type instability, maybe from the type argument
Int
?:The text was updated successfully, but these errors were encountered: