-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow broadcast copy in 2D #41
Comments
I suspect this is because ; @ /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:55 within `broadcast_kernel`
; ┌ @ /Users/kichi/.julia/dev/GPUArrays/src/device/indexing.jl:81 within `macro expansion`
; │┌ @ multidimensional.jl:262 within `CartesianIndices`
; ││┌ @ abstractarray.jl:95 within `axes`
; │││┌ @ tuple.jl:222 within `map`
; ││││┌ @ range.jl:455 within `oneto`
; │││││┌ @ range.jl:453 within `OneTo` @ range.jl:440
; ││││││┌ @ promotion.jl:488 within `max`
; │││││││┌ @ essentials.jl:489 within `ifelse`
%8 = icmp sgt i64 %.unpack5.unpack, 0
; │└└└└└└└
; │┌ @ abstractarray.jl:1241 within `getindex`
; ││┌ @ abstractarray.jl:1286 within `_getindex`
; │││┌ @ abstractarray.jl:1293 within `_to_subscript_indices`
; ││││┌ @ abstractarray.jl:1315 within `_unsafe_ind2sub`
; │││││┌ @ abstractarray.jl:2639 within `_ind2sub` @ abstractarray.jl:2677
; ││││││┌ @ int.jl:86 within `-`
%9 = add nsw i64 %6, -1
; ││││││└
; ││││││┌ @ abstractarray.jl:2690 within `_ind2sub_recurse`
; │││││││┌ @ abstractarray.jl:2697 within `_div`
; ││││││││┌ @ int.jl:288 within `div`
br i1 %8, label %pass, label %fail If I change the broadcast kernel to use |
Yes, it does an integer division in order to go from a linear index (which And the reason we use Cartesian indices and not linear ones is that you can pass non-contiguous GPU inputs to the broadcast kernel (say, a non-contiguous view) in which case there's no linear index that would work. |
I assumed the non-contiguous view would still take care of translating linear indices to whatever indices underneath, or is that not true in general? julia> a = MtlArray(Float32.(reshape(1:100, 10, 10)))
10×10 MtlMatrix{Float32}:
1.0 11.0 21.0 31.0 41.0 51.0 61.0 71.0 81.0 91.0
2.0 12.0 22.0 32.0 42.0 52.0 62.0 72.0 82.0 92.0
3.0 13.0 23.0 33.0 43.0 53.0 63.0 73.0 83.0 93.0
4.0 14.0 24.0 34.0 44.0 54.0 64.0 74.0 84.0 94.0
5.0 15.0 25.0 35.0 45.0 55.0 65.0 75.0 85.0 95.0
6.0 16.0 26.0 36.0 46.0 56.0 66.0 76.0 86.0 96.0
7.0 17.0 27.0 37.0 47.0 57.0 67.0 77.0 87.0 97.0
8.0 18.0 28.0 38.0 48.0 58.0 68.0 78.0 88.0 98.0
9.0 19.0 29.0 39.0 49.0 59.0 69.0 79.0 89.0 99.0
10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0
julia> a1 = @view a[[1,3,5], [2,4,6]]
3×3 view(::MtlMatrix{Float32}, [1, 3, 5], [2, 4, 6]) with eltype Float32:
11.0 31.0 51.0
13.0 33.0 53.0
15.0 35.0 55.0
julia> a1[1:9]'
1×9 adjoint(::MtlVector{Float32}) with eltype Float32:
11.0 13.0 15.0 31.0 33.0 35.0 51.0 53.0 55.0 Or can we specialize the kernel if the Just curious, does CUDA have this slowdown as well? |
CUDA suffers from this as well, but recent generations of the hardware have been much much better at dealing with more complex operations like this. That's also probably why we haven't reimplemented broadcast ourselves yet. IIRC, another potential slowdown is how the |
Can we do something like "if For |
That would make for a good fast path, yes. Alternatively, or in addition, we have the power of a JIT, of course. Hacker's Delight has two chapters on integer division, including one on integer division by known constants. We could probably use that to generate a kernel which performs the indexing without any actual division. We'd need to generate a new kernel for each size of input, but maybe that isn't too bad for GPU applications.
julia> A = CUDA.rand(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.427609 0.264936
0.841486 0.732059
# contiguous
julia> view(A, :, 1)
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.42760944
0.84148616
# strided
julia> view(A, 1, :)
2-element view(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, 1, :) with eltype Float32:
0.42760944
0.2649361
# arbitrary
julia> view(A, cu([1, 2, 4]))
3-element view(::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, [1, 2, 4]) with eltype Float32:
0.42760944
0.84148616
0.73205924
Similar to the above, IIRC |
@vchuravy mentioned that LLVM is actually smart enough to optimize the CartesianIndices getindex away (which causes the integer division) as long as the CartesianIndex is constant. KernelAbstraction does this by making the group size getters (blockDim etc) constant and constructing a CartesianIndices around those, however I'm not sure that's what we want (I'm not sure how that deals with ND input sizes, and IIUC it causes idle threads when the shape of the inputs doesn't match the shape of the launch). However, that also means it might just be as simple as making sure that the input iterators are passed to the kernel as a constant, i.e., as a Lots of things to try here 🙂 |
Thanks Tim for the pointers! I'll experiment a bit. |
I tried passing the iterator dimensions as a Here are my logs: https://gist.github.com/maxwindiff/8d530dd43b3d626bf0aac3fe4f943ee9 I also have another question about indexing non-contiguous views... if I change 22:21:25 GPUArrays (master) % git diff
diff --git a/src/host/broadcast.jl b/src/host/broadcast.jl
index 9a44fb8..1cba162 100644
--- a/src/host/broadcast.jl
+++ b/src/host/broadcast.jl
@@ -55,7 +55,7 @@ end
i = 0
while i < nelem
i += 1
- I = @cartesianidx(dest, i)
+ I = @linearidx(dest, i)
@inbounds dest[I] = bc′[I]
end
return julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, :, 1) .*= 10; A
3×3 MtlMatrix{Int64}:
10 2 3
40 5 6
70 8 9
julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, 1, :) .*= 10; A
3×3 MtlMatrix{Int64}:
10 20 30
4 5 6
7 8 9
julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, [1,3,4,5]) .*= 10; A
3×3 MtlMatrix{Int64}:
10 20 3
4 50 6
70 8 9 And broadcasting on contiguous arrays is now fast (the julia> @btime @Metal.sync begin
global a .= b
end
11.805 ms (196 allocations: 5.95 KiB) So it seems this lets us pay for the expensive calculations only when needed. Is there any reason we can't do this? |
Yeah, the sizes probably need to constant too. And yes, it might be a little tricky to pass the necessary information to the kernel, but in principle I think it should work (or at least if LLVM knows how to optimize given constant denominators, which @vchuravy suggested it does).
Huh, yes that appears to be the case. If we're using non-contiguous arrays, the SubArray can still be indexed linearly, so I guess we shouldn't be using cartesian indices here. That would make the solution much easier :-) I'll create a PR to run CI on this. |
Actually, arbitrary broadcastable objects don't appear to be linearly indexable: julia> bc = Base.Broadcast.broadcasted(identity, rand(2,2))
Base.Broadcast.Broadcasted(identity, ([0.7110622323877549 0.5733414791299075; 0.620167447659585 0.017047464877951835],))
julia> IndexStyle(bc)
IndexCartesian()
julia> bc[4]
ERROR: BoundsError: attempt to access Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}} at index [4]
Stacktrace:
[1] throw_boundserror(A::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}}, I::Tuple{Int64})
@ Base ./abstractarray.jl:744
[2] checkbounds
@ ./broadcast.jl:621 [inlined]
[3] getindex(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}}, I::Int64)
@ Base.Broadcast ./broadcast.jl:609
[4] top-level scope
@ REPL[17]:1 |
JuliaGPU/GPUArrays.jl#454 should help a bit, but most broadcast objects apparently have IndexStyle==Cartesian |
I'm curious, what makes the 0:18:34 GPUArrays (tb/avoid_cartesian) % git diff
diff --git a/src/host/broadcast.jl b/src/host/broadcast.jl
index 5a8da77..64b46f3 100644
--- a/src/host/broadcast.jl
+++ b/src/host/broadcast.jl
@@ -58,15 +58,14 @@ end
# HACK: cartesian iteration is slow, so avoid it if possible
# TODO: generalize this, much like how `eachindex` picks an appropriate iterator
# (::AnyGPUArray methods shouldn't be hardcoded to use linear indexing)
- I = if isa(IndexStyle(dest), IndexLinear) && isa(IndexStyle(bc′), IndexLinear)
- @linearidx(dest, i)
- else
- @cartesianidx(dest, i)
- end
+ I = @linearidx(dest, i)
@inbounds dest[I] = bc′[I]
end
return
end
+ @info "IndexStyle" IndexStyle(dest) IndexStyle(bc) IndexStyle(bc′)
+ @info "typeof" typeof(dest) typeof(bc) typeof(bc′)
+ @info "bc′" bc′
+ @show @inbounds bc′[10]
elements = length(dest)
elements_per_thread = typemax(Int)
heuristic = launch_heuristic(backend(dest), broadcast_kernel, dest, bc′, 1; julia> a .= b;
┌ Info: IndexStyle
│ IndexStyle(dest) = IndexLinear()
│ IndexStyle(bc) = IndexCartesian()
└ IndexStyle(bc′) = IndexCartesian()
┌ Info: typeof
│ typeof(dest) = MtlMatrix{Float32} (alias for MtlArray{Float32, 2})
│ typeof(bc) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{MtlMatrix{Float32}}}
└ typeof(bc′) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{MtlMatrix{Float32}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}
┌ Info: bc′
└ bc′ = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}}(identity, (Base.Broadcast.Extruded{MtlMatrix{Float32}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}(Float32[-0.45997995 -0.9323782 … -0.78047013 -0.10763502; -0.41987753 -0.958211 … -0.11566621 -0.01383841; … ; -0.21164703 -0.6332897 … -0.13726139 -0.76514477; -0.71918046 -0.14217287 … -0.42983288 -0.99601007], (true, true), (1, 1)),))
#= /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:69 =# @inbounds(bc′[10]) = -0.624596f0
julia> a .= 10;
┌ Info: IndexStyle
│ IndexStyle(dest) = IndexLinear()
│ IndexStyle(bc) = IndexCartesian()
└ IndexStyle(bc′) = IndexCartesian()
┌ Info: typeof
│ typeof(dest) = MtlMatrix{Float32} (alias for MtlArray{Float32, 2})
│ typeof(bc) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}
└ typeof(bc′) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}
┌ Info: bc′
└ bc′ = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}}(identity, (10,))
#= /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:69 =# @inbounds(bc′[10]) = 10 |
I think it may start to matter when adding singleton dimensions into the mix. |
Did some more experimentation, and with JuliaGPU/GPUArrays.jl#454 the kernel is pretty fast now, getting 120GBs out of 180GBs (previously it only reacher 8GBs). I guess that getting rid of the |
Just a note here: the GPUArrays.jl change got reverted, JuliaGPU/GPUArrays.jl#463, so performance is bad again. |
The following code evaluates the performance of the copy of 2 2D square MTL arrays
a
andb
.It gives a good performance (GBs: 360 GBs) using the kernel version (commented line) but a poor performance (GBs: 46.2) using the broadcast expression (
a .= b
)...Note that the broadcast expression is OK (equivalent to kernel copy) for 1D arrays since the last bug fix.
The text was updated successfully, but these errors were encountered: