Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow broadcast copy in 2D #41

Closed
LaurentPlagne opened this issue Jul 11, 2022 · 16 comments · Fixed by #304
Closed

slow broadcast copy in 2D #41

LaurentPlagne opened this issue Jul 11, 2022 · 16 comments · Fixed by #304
Labels
arrays Things about the array abstraction. performance Gotta go fast.

Comments

@LaurentPlagne
Copy link

LaurentPlagne commented Jul 11, 2022

The following code evaluates the performance of the copy of 2 2D square MTL arrays a and b.
It gives a good performance (GBs: 360 GBs) using the kernel version (commented line) but a poor performance (GBs: 46.2) using the broadcast expression (a .= b)...

Note that the broadcast expression is OK (equivalent to kernel copy) for 1D arrays since the last bug fix.

using Metal

function kernel_copy!(a, b)
    (i,j) = thread_position_in_grid_2d()
    @inbounds a[i,j] = b[i,j]
    return
end

function device_copy(n=2^14,nsample=10)

    a = MtlArray(rand(Float32, n,n))
    b = MtlArray(rand(Float32, n,n))

    threads = (32,32)
    grid_size = cld.(n, threads)
    @show threads,grid_size

    ts=zeros(nsample)
    for i  1:nsample
        ts[i] = @elapsed Metal.@sync begin
            # @metal threads=threads grid=grid_size kernel_copy!(a, b)
            a .= b
        end
    end

    @assert Array(a)==Array(b)

    @show ts
    tmin = minimum(ts)

    size_in_bytes = 2*length(a)*sizeof(Float32) #1R+1W
    byte_per_ns = size_in_bytes / (tmin*1.e9)

    println("GBs: $(round(byte_per_ns; digits=3))")

    # Cleanup memory (is it necessary)
    finalize(a)
    finalize(b)
end
device_copy()
@maxwindiff
Copy link
Contributor

I suspect this is because @cartesianidx(dest, i) in the broadcast kernel, which calls CartesianIndices(x)[i], ends up doing some division?

;  @ /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:55 within `broadcast_kernel`
; ┌ @ /Users/kichi/.julia/dev/GPUArrays/src/device/indexing.jl:81 within `macro expansion`
; │┌ @ multidimensional.jl:262 within `CartesianIndices`
; ││┌ @ abstractarray.jl:95 within `axes`
; │││┌ @ tuple.jl:222 within `map`
; ││││┌ @ range.jl:455 within `oneto`
; │││││┌ @ range.jl:453 within `OneTo` @ range.jl:440
; ││││││┌ @ promotion.jl:488 within `max`
; │││││││┌ @ essentials.jl:489 within `ifelse`
          %8 = icmp sgt i64 %.unpack5.unpack, 0
; │└└└└└└└
; │┌ @ abstractarray.jl:1241 within `getindex`
; ││┌ @ abstractarray.jl:1286 within `_getindex`
; │││┌ @ abstractarray.jl:1293 within `_to_subscript_indices`
; ││││┌ @ abstractarray.jl:1315 within `_unsafe_ind2sub`
; │││││┌ @ abstractarray.jl:2639 within `_ind2sub` @ abstractarray.jl:2677
; ││││││┌ @ int.jl:86 within `-`
         %9 = add nsw i64 %6, -1
; ││││││└
; ││││││┌ @ abstractarray.jl:2690 within `_ind2sub_recurse`
; │││││││┌ @ abstractarray.jl:2697 within `_div`
; ││││││││┌ @ int.jl:288 within `div`
           br i1 %8, label %pass, label %fail

If I change the broadcast kernel to use @linearidx instead of @cartesianidx, it runs almost as fast as the handwritten kernel and the result seems to be correct. But surely there must be a reason @cartesianidx was used?

@maleadt
Copy link
Member

maleadt commented Jan 27, 2023

Yes, it does an integer division in order to go from a linear index (which @cartesianidx uses) to, well, a Cartesian (N-dimensional) one. That's generally pretty slow, but there's no easy way around. Some platforms offer multi- or ND hardware indices (e.g., CUDA's threadIdx.xyz, or OpenCL's get_global_id(dim)), but those typically aren't fully general (either not supporting arbitrary rank, or being limited in size) so there's no straightforward path to using those without having to special-case kernels for platforms and inputs.

And the reason we use Cartesian indices and not linear ones is that you can pass non-contiguous GPU inputs to the broadcast kernel (say, a non-contiguous view) in which case there's no linear index that would work.

@maxwindiff
Copy link
Contributor

I assumed the non-contiguous view would still take care of translating linear indices to whatever indices underneath, or is that not true in general?

julia> a = MtlArray(Float32.(reshape(1:100, 10, 10)))
10×10 MtlMatrix{Float32}:
  1.0  11.0  21.0  31.0  41.0  51.0  61.0  71.0  81.0   91.0
  2.0  12.0  22.0  32.0  42.0  52.0  62.0  72.0  82.0   92.0
  3.0  13.0  23.0  33.0  43.0  53.0  63.0  73.0  83.0   93.0
  4.0  14.0  24.0  34.0  44.0  54.0  64.0  74.0  84.0   94.0
  5.0  15.0  25.0  35.0  45.0  55.0  65.0  75.0  85.0   95.0
  6.0  16.0  26.0  36.0  46.0  56.0  66.0  76.0  86.0   96.0
  7.0  17.0  27.0  37.0  47.0  57.0  67.0  77.0  87.0   97.0
  8.0  18.0  28.0  38.0  48.0  58.0  68.0  78.0  88.0   98.0
  9.0  19.0  29.0  39.0  49.0  59.0  69.0  79.0  89.0   99.0
 10.0  20.0  30.0  40.0  50.0  60.0  70.0  80.0  90.0  100.0

julia> a1 = @view a[[1,3,5], [2,4,6]]
3×3 view(::MtlMatrix{Float32}, [1, 3, 5], [2, 4, 6]) with eltype Float32:
 11.0  31.0  51.0
 13.0  33.0  53.0
 15.0  35.0  55.0

julia> a1[1:9]'
1×9 adjoint(::MtlVector{Float32}) with eltype Float32:
 11.0  13.0  15.0  31.0  33.0  35.0  51.0  53.0  55.0

Or can we specialize the kernel if the Broadcasted is one of the known styles? (not sure if this works, I'm not super familiar with the broadcasting machinery)

Just curious, does CUDA have this slowdown as well?

@maleadt
Copy link
Member

maleadt commented Feb 9, 2023

CUDA suffers from this as well, but recent generations of the hardware have been much much better at dealing with more complex operations like this. That's also probably why we haven't reimplemented broadcast ourselves yet.

IIRC, another potential slowdown is how the keeps tuple argument is passed (it's something that could be specialized on). Again, used to be a problem on older NVIDIA GPUs, but I don't think it matters that much anymore.

@maxwindiff
Copy link
Contributor

Can we do something like "if dest and bc are contiguous, use linearidx"?
For dest I'm thinking of checking lastindex(dest)-firstindex(dest)+1 == length(dest) and for bc check if the axes are unit ranges? I'm not confident about this however, it'd be good to add some tests to make cover different types of inputs. Do you have an example of making a non-contiguous view that breaks linear indexing?

For keeps (I assume that's Extruded.keeps) what kind of specialization are you thinking about?

@maleadt
Copy link
Member

maleadt commented Feb 13, 2023

Can we do something like "if dest and bc are contiguous, use linearidx"?

That would make for a good fast path, yes.

Alternatively, or in addition, we have the power of a JIT, of course. Hacker's Delight has two chapters on integer division, including one on integer division by known constants. We could probably use that to generate a kernel which performs the indexing without any actual division. We'd need to generate a new kernel for each size of input, but maybe that isn't too bad for GPU applications.

Do you have an example of making a non-contiguous view that breaks linear indexing?

julia> A = CUDA.rand(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.427609  0.264936
 0.841486  0.732059

# contiguous
julia> view(A, :, 1)
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.42760944
 0.84148616

# strided
julia> view(A, 1, :)
2-element view(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, 1, :) with eltype Float32:
 0.42760944
 0.2649361

# arbitrary
julia> view(A, cu([1, 2, 4]))
3-element view(::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, [1, 2, 4]) with eltype Float32:
 0.42760944
 0.84148616
 0.73205924

For keeps (I assume that's Extruded.keeps) what kind of specialization are you thinking about?

Similar to the above, IIRC keeps is used to determine which part of the multidimensional indices to keep. Since keeps is a parameter to the kernel, that results in code with branches that depend on a parameter load. Instead, we should consider specializing the kernel instead, as for GPU execution the compilation overhead is probably worth it.

@maleadt
Copy link
Member

maleadt commented Feb 13, 2023

@vchuravy mentioned that LLVM is actually smart enough to optimize the CartesianIndices getindex away (which causes the integer division) as long as the CartesianIndex is constant. KernelAbstraction does this by making the group size getters (blockDim etc) constant and constructing a CartesianIndices around those, however I'm not sure that's what we want (I'm not sure how that deals with ND input sizes, and IIUC it causes idle threads when the shape of the inputs doesn't match the shape of the launch). However, that also means it might just be as simple as making sure that the input iterators are passed to the kernel as a constant, i.e., as a Val. And if we're specializing the kernel then anyway it might make sense to make the size queries constant as well, e.g., by wrapping inputs in StaticArrays.SizedArray.

Lots of things to try here 🙂

@maxwindiff
Copy link
Contributor

Thanks Tim for the pointers! I'll experiment a bit.

@maxwindiff
Copy link
Contributor

maxwindiff commented Feb 14, 2023

I tried passing the iterator dimensions as a Val, it does eliminate the CartesianIndices getindex, but the division is simply moved to the time it indexes into the 2D array, so it's still slow. Then I tried wrapping the arrays in a SizedArray but Metal complained that they are not isbits. Actually, even if we can make the array sizes static, would LLVM be able to turn the divisions into bit twiddling code? If not we'll still have a division to deal with.

Here are my logs: https://gist.github.com/maxwindiff/8d530dd43b3d626bf0aac3fe4f943ee9

I also have another question about indexing non-contiguous views... if I change @cartesianidx to @linearidx, it appears broadcasting on non-contiguous views still work correctly:

 22:21:25 GPUArrays (master) % git diff
diff --git a/src/host/broadcast.jl b/src/host/broadcast.jl
index 9a44fb8..1cba162 100644
--- a/src/host/broadcast.jl
+++ b/src/host/broadcast.jl
@@ -55,7 +55,7 @@ end
         i = 0
         while i < nelem
             i += 1
-            I = @cartesianidx(dest, i)
+            I = @linearidx(dest, i)
             @inbounds dest[I] = bc′[I]
         end
         return
julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, :, 1) .*= 10; A
3×3 MtlMatrix{Int64}:
 10  2  3
 40  5  6
 70  8  9

julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, 1, :) .*= 10; A
3×3 MtlMatrix{Int64}:
 10  20  30
  4   5   6
  7   8   9

julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, [1,3,4,5]) .*= 10; A
3×3 MtlMatrix{Int64}:
 10  20  3
  4  50  6
 70   8  9

And broadcasting on contiguous arrays is now fast (the @cartesianidx version took 110ms`):

julia> @btime @Metal.sync begin
           global a .= b
         end
  11.805 ms (196 allocations: 5.95 KiB)

So it seems this lets us pay for the expensive calculations only when needed. Is there any reason we can't do this?

@maleadt
Copy link
Member

maleadt commented Feb 14, 2023

the division is simply moved to the time it indexes into the 2D array,

Yeah, the sizes probably need to constant too. And yes, it might be a little tricky to pass the necessary information to the kernel, but in principle I think it should work (or at least if LLVM knows how to optimize given constant denominators, which @vchuravy suggested it does).

I also have another question about indexing non-contiguous views... if I change @cartesianidx to @linearidx, it appears broadcasting on non-contiguous views still work correctly:

Huh, yes that appears to be the case. If we're using non-contiguous arrays, the SubArray can still be indexed linearly, so I guess we shouldn't be using cartesian indices here. That would make the solution much easier :-) I'll create a PR to run CI on this.

@maleadt
Copy link
Member

maleadt commented Feb 14, 2023

Actually, arbitrary broadcastable objects don't appear to be linearly indexable:

julia> bc = Base.Broadcast.broadcasted(identity, rand(2,2))
Base.Broadcast.Broadcasted(identity, ([0.7110622323877549 0.5733414791299075; 0.620167447659585 0.017047464877951835],))

julia> IndexStyle(bc)
IndexCartesian()

julia> bc[4]
ERROR: BoundsError: attempt to access Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}} at index [4]
Stacktrace:
 [1] throw_boundserror(A::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}}, I::Tuple{Int64})
   @ Base ./abstractarray.jl:744
 [2] checkbounds
   @ ./broadcast.jl:621 [inlined]
 [3] getindex(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}}, I::Int64)
   @ Base.Broadcast ./broadcast.jl:609
 [4] top-level scope
   @ REPL[17]:1

@maleadt
Copy link
Member

maleadt commented Feb 14, 2023

JuliaGPU/GPUArrays.jl#454 should help a bit, but most broadcast objects apparently have IndexStyle==Cartesian

@maxwindiff
Copy link
Contributor

I'm curious, what makes the bc′ in _copyto! support linear indexing when forced? At first I thought this is because the index is passed as-is to the MtlArray inside the Broadcasted object, but then I tried something like a .= 10 and it still worked. Or is this all undefined behavior and I just happened to get lucky?

 0:18:34 GPUArrays (tb/avoid_cartesian) % git diff
diff --git a/src/host/broadcast.jl b/src/host/broadcast.jl
index 5a8da77..64b46f3 100644
--- a/src/host/broadcast.jl
+++ b/src/host/broadcast.jl
@@ -58,15 +58,14 @@ end
             # HACK: cartesian iteration is slow, so avoid it if possible
             # TODO: generalize this, much like how `eachindex` picks an appropriate iterator
             #       (::AnyGPUArray methods shouldn't be hardcoded to use linear indexing)
-            I = if isa(IndexStyle(dest), IndexLinear) && isa(IndexStyle(bc′), IndexLinear)
-                @linearidx(dest, i)
-            else
-                @cartesianidx(dest, i)
-            end
+            I = @linearidx(dest, i)
             @inbounds dest[I] = bc′[I]
         end
         return
     end
+    @info "IndexStyle" IndexStyle(dest) IndexStyle(bc) IndexStyle(bc′)
+    @info "typeof" typeof(dest) typeof(bc) typeof(bc′)
+    @info "bc′" bc′
+    @show @inbounds bc′[10]
     elements = length(dest)
     elements_per_thread = typemax(Int)
     heuristic = launch_heuristic(backend(dest), broadcast_kernel, dest, bc′, 1;
julia> a .= b;
┌ Info: IndexStyle
│   IndexStyle(dest) = IndexLinear()
│   IndexStyle(bc) = IndexCartesian()
└   IndexStyle(bc′) = IndexCartesian()
┌ Info: typeof
│   typeof(dest) = MtlMatrix{Float32} (alias for MtlArray{Float32, 2})
│   typeof(bc) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{MtlMatrix{Float32}}}
└   typeof(bc′) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{MtlMatrix{Float32}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}
┌ Info: bc′
└   bc′ = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}}(identity, (Base.Broadcast.Extruded{MtlMatrix{Float32}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}(Float32[-0.45997995 -0.9323782  -0.78047013 -0.10763502; -0.41987753 -0.958211  -0.11566621 -0.01383841;  ; -0.21164703 -0.6332897  -0.13726139 -0.76514477; -0.71918046 -0.14217287  -0.42983288 -0.99601007], (true, true), (1, 1)),))
#= /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:69 =# @inbounds(bc′[10]) = -0.624596f0

julia> a .= 10;
┌ Info: IndexStyle
│   IndexStyle(dest) = IndexLinear()
│   IndexStyle(bc) = IndexCartesian()
└   IndexStyle(bc′) = IndexCartesian()
┌ Info: typeof
│   typeof(dest) = MtlMatrix{Float32} (alias for MtlArray{Float32, 2})
│   typeof(bc) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}
└   typeof(bc′) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}
┌ Info: bc′
└   bc′ = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}}(identity, (10,))
#= /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:69 =# @inbounds(bc′[10]) = 10

@maleadt
Copy link
Member

maleadt commented Feb 15, 2023

I think it may start to matter when adding singleton dimensions into the mix.

@maleadt
Copy link
Member

maleadt commented Feb 15, 2023

Did some more experimentation, and with JuliaGPU/GPUArrays.jl#454 the kernel is pretty fast now, getting 120GBs out of 180GBs (previously it only reacher 8GBs). I guess that getting rid of the keeps branches would further improve performance, but this is a very good start already.

@maleadt maleadt added the arrays Things about the array abstraction. label May 22, 2023
@maleadt
Copy link
Member

maleadt commented May 22, 2023

Just a note here: the GPUArrays.jl change got reverted, JuliaGPU/GPUArrays.jl#463, so performance is bad again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays Things about the array abstraction. performance Gotta go fast.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants