slow broadcast copy in 2D #41

LaurentPlagne · 2022-07-11T16:17:48Z

The following code evaluates the performance of the copy of 2 2D square MTL arrays a and b.
It gives a good performance (GBs: 360 GBs) using the kernel version (commented line) but a poor performance (GBs: 46.2) using the broadcast expression (a .= b)...

Note that the broadcast expression is OK (equivalent to kernel copy) for 1D arrays since the last bug fix.

using Metal

function kernel_copy!(a, b)
    (i,j) = thread_position_in_grid_2d()
    @inbounds a[i,j] = b[i,j]
    return
end

function device_copy(n=2^14,nsample=10)

    a = MtlArray(rand(Float32, n,n))
    b = MtlArray(rand(Float32, n,n))

    threads = (32,32)
    grid_size = cld.(n, threads)
    @show threads,grid_size

    ts=zeros(nsample)
    for i ∈ 1:nsample
        ts[i] = @elapsed Metal.@sync begin
            # @metal threads=threads grid=grid_size kernel_copy!(a, b)
            a .= b
        end
    end

    @assert Array(a)==Array(b)

    @show ts
    tmin = minimum(ts)

    size_in_bytes = 2*length(a)*sizeof(Float32) #1R+1W
    byte_per_ns = size_in_bytes / (tmin*1.e9)

    println("GBs: $(round(byte_per_ns; digits=3))")

    # Cleanup memory (is it necessary)
    finalize(a)
    finalize(b)
end
device_copy()

The text was updated successfully, but these errors were encountered:

maxwindiff · 2023-01-27T08:13:33Z

I suspect this is because @cartesianidx(dest, i) in the broadcast kernel, which calls CartesianIndices(x)[i], ends up doing some division?

;  @ /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:55 within `broadcast_kernel`
; ┌ @ /Users/kichi/.julia/dev/GPUArrays/src/device/indexing.jl:81 within `macro expansion`
; │┌ @ multidimensional.jl:262 within `CartesianIndices`
; ││┌ @ abstractarray.jl:95 within `axes`
; │││┌ @ tuple.jl:222 within `map`
; ││││┌ @ range.jl:455 within `oneto`
; │││││┌ @ range.jl:453 within `OneTo` @ range.jl:440
; ││││││┌ @ promotion.jl:488 within `max`
; │││││││┌ @ essentials.jl:489 within `ifelse`
          %8 = icmp sgt i64 %.unpack5.unpack, 0
; │└└└└└└└
; │┌ @ abstractarray.jl:1241 within `getindex`
; ││┌ @ abstractarray.jl:1286 within `_getindex`
; │││┌ @ abstractarray.jl:1293 within `_to_subscript_indices`
; ││││┌ @ abstractarray.jl:1315 within `_unsafe_ind2sub`
; │││││┌ @ abstractarray.jl:2639 within `_ind2sub` @ abstractarray.jl:2677
; ││││││┌ @ int.jl:86 within `-`
         %9 = add nsw i64 %6, -1
; ││││││└
; ││││││┌ @ abstractarray.jl:2690 within `_ind2sub_recurse`
; │││││││┌ @ abstractarray.jl:2697 within `_div`
; ││││││││┌ @ int.jl:288 within `div`
           br i1 %8, label %pass, label %fail

If I change the broadcast kernel to use @linearidx instead of @cartesianidx, it runs almost as fast as the handwritten kernel and the result seems to be correct. But surely there must be a reason @cartesianidx was used?

maleadt · 2023-01-27T08:49:43Z

Yes, it does an integer division in order to go from a linear index (which @cartesianidx uses) to, well, a Cartesian (N-dimensional) one. That's generally pretty slow, but there's no easy way around. Some platforms offer multi- or ND hardware indices (e.g., CUDA's threadIdx.xyz, or OpenCL's get_global_id(dim)), but those typically aren't fully general (either not supporting arbitrary rank, or being limited in size) so there's no straightforward path to using those without having to special-case kernels for platforms and inputs.

And the reason we use Cartesian indices and not linear ones is that you can pass non-contiguous GPU inputs to the broadcast kernel (say, a non-contiguous view) in which case there's no linear index that would work.

maxwindiff · 2023-02-09T08:13:36Z

I assumed the non-contiguous view would still take care of translating linear indices to whatever indices underneath, or is that not true in general?

julia> a = MtlArray(Float32.(reshape(1:100, 10, 10)))
10×10 MtlMatrix{Float32}:
  1.0  11.0  21.0  31.0  41.0  51.0  61.0  71.0  81.0   91.0
  2.0  12.0  22.0  32.0  42.0  52.0  62.0  72.0  82.0   92.0
  3.0  13.0  23.0  33.0  43.0  53.0  63.0  73.0  83.0   93.0
  4.0  14.0  24.0  34.0  44.0  54.0  64.0  74.0  84.0   94.0
  5.0  15.0  25.0  35.0  45.0  55.0  65.0  75.0  85.0   95.0
  6.0  16.0  26.0  36.0  46.0  56.0  66.0  76.0  86.0   96.0
  7.0  17.0  27.0  37.0  47.0  57.0  67.0  77.0  87.0   97.0
  8.0  18.0  28.0  38.0  48.0  58.0  68.0  78.0  88.0   98.0
  9.0  19.0  29.0  39.0  49.0  59.0  69.0  79.0  89.0   99.0
 10.0  20.0  30.0  40.0  50.0  60.0  70.0  80.0  90.0  100.0

julia> a1 = @view a[[1,3,5], [2,4,6]]
3×3 view(::MtlMatrix{Float32}, [1, 3, 5], [2, 4, 6]) with eltype Float32:
 11.0  31.0  51.0
 13.0  33.0  53.0
 15.0  35.0  55.0

julia> a1[1:9]'
1×9 adjoint(::MtlVector{Float32}) with eltype Float32:
 11.0  13.0  15.0  31.0  33.0  35.0  51.0  53.0  55.0

Or can we specialize the kernel if the Broadcasted is one of the known styles? (not sure if this works, I'm not super familiar with the broadcasting machinery)

Just curious, does CUDA have this slowdown as well?

maleadt · 2023-02-09T08:16:55Z

CUDA suffers from this as well, but recent generations of the hardware have been much much better at dealing with more complex operations like this. That's also probably why we haven't reimplemented broadcast ourselves yet.

IIRC, another potential slowdown is how the keeps tuple argument is passed (it's something that could be specialized on). Again, used to be a problem on older NVIDIA GPUs, but I don't think it matters that much anymore.

maxwindiff · 2023-02-11T19:33:01Z

Can we do something like "if dest and bc are contiguous, use linearidx"?
For dest I'm thinking of checking lastindex(dest)-firstindex(dest)+1 == length(dest) and for bc check if the axes are unit ranges? I'm not confident about this however, it'd be good to add some tests to make cover different types of inputs. Do you have an example of making a non-contiguous view that breaks linear indexing?

For keeps (I assume that's Extruded.keeps) what kind of specialization are you thinking about?

maleadt · 2023-02-13T13:32:41Z

Can we do something like "if dest and bc are contiguous, use linearidx"?

That would make for a good fast path, yes.

Alternatively, or in addition, we have the power of a JIT, of course. Hacker's Delight has two chapters on integer division, including one on integer division by known constants. We could probably use that to generate a kernel which performs the indexing without any actual division. We'd need to generate a new kernel for each size of input, but maybe that isn't too bad for GPU applications.

Do you have an example of making a non-contiguous view that breaks linear indexing?

julia> A = CUDA.rand(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.427609  0.264936
 0.841486  0.732059

# contiguous
julia> view(A, :, 1)
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.42760944
 0.84148616

# strided
julia> view(A, 1, :)
2-element view(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, 1, :) with eltype Float32:
 0.42760944
 0.2649361

# arbitrary
julia> view(A, cu([1, 2, 4]))
3-element view(::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, [1, 2, 4]) with eltype Float32:
 0.42760944
 0.84148616
 0.73205924

For keeps (I assume that's Extruded.keeps) what kind of specialization are you thinking about?

Similar to the above, IIRC keeps is used to determine which part of the multidimensional indices to keep. Since keeps is a parameter to the kernel, that results in code with branches that depend on a parameter load. Instead, we should consider specializing the kernel instead, as for GPU execution the compilation overhead is probably worth it.

maleadt · 2023-02-13T15:54:24Z

@vchuravy mentioned that LLVM is actually smart enough to optimize the CartesianIndices getindex away (which causes the integer division) as long as the CartesianIndex is constant. KernelAbstraction does this by making the group size getters (blockDim etc) constant and constructing a CartesianIndices around those, however I'm not sure that's what we want (I'm not sure how that deals with ND input sizes, and IIUC it causes idle threads when the shape of the inputs doesn't match the shape of the launch). However, that also means it might just be as simple as making sure that the input iterators are passed to the kernel as a constant, i.e., as a Val. And if we're specializing the kernel then anyway it might make sense to make the size queries constant as well, e.g., by wrapping inputs in StaticArrays.SizedArray.

Lots of things to try here 🙂

maxwindiff · 2023-02-14T05:53:44Z

Thanks Tim for the pointers! I'll experiment a bit.

maxwindiff · 2023-02-14T07:31:37Z

I tried passing the iterator dimensions as a Val, it does eliminate the CartesianIndices getindex, but the division is simply moved to the time it indexes into the 2D array, so it's still slow. Then I tried wrapping the arrays in a SizedArray but Metal complained that they are not isbits. Actually, even if we can make the array sizes static, would LLVM be able to turn the divisions into bit twiddling code? If not we'll still have a division to deal with.

Here are my logs: https://gist.github.com/maxwindiff/8d530dd43b3d626bf0aac3fe4f943ee9

I also have another question about indexing non-contiguous views... if I change @cartesianidx to @linearidx, it appears broadcasting on non-contiguous views still work correctly:

 22:21:25 GPUArrays (master) % git diff
diff --git a/src/host/broadcast.jl b/src/host/broadcast.jl
index 9a44fb8..1cba162 100644
--- a/src/host/broadcast.jl
+++ b/src/host/broadcast.jl
@@ -55,7 +55,7 @@ end
         i = 0
         while i < nelem
             i += 1
-            I = @cartesianidx(dest, i)
+            I = @linearidx(dest, i)
             @inbounds dest[I] = bc′[I]
         end
         return

julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, :, 1) .*= 10; A
3×3 MtlMatrix{Int64}:
 10  2  3
 40  5  6
 70  8  9

julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, 1, :) .*= 10; A
3×3 MtlMatrix{Int64}:
 10  20  30
  4   5   6
  7   8   9

julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, [1,3,4,5]) .*= 10; A
3×3 MtlMatrix{Int64}:
 10  20  3
  4  50  6
 70   8  9

And broadcasting on contiguous arrays is now fast (the @cartesianidx version took 110ms`):

julia> @btime @Metal.sync begin
           global a .= b
         end
  11.805 ms (196 allocations: 5.95 KiB)

So it seems this lets us pay for the expensive calculations only when needed. Is there any reason we can't do this?

maleadt · 2023-02-14T09:20:26Z

the division is simply moved to the time it indexes into the 2D array,

Yeah, the sizes probably need to constant too. And yes, it might be a little tricky to pass the necessary information to the kernel, but in principle I think it should work (or at least if LLVM knows how to optimize given constant denominators, which @vchuravy suggested it does).

I also have another question about indexing non-contiguous views... if I change @cartesianidx to @linearidx, it appears broadcasting on non-contiguous views still work correctly:

Huh, yes that appears to be the case. If we're using non-contiguous arrays, the SubArray can still be indexed linearly, so I guess we shouldn't be using cartesian indices here. That would make the solution much easier :-) I'll create a PR to run CI on this.

maleadt · 2023-02-14T09:39:24Z

Actually, arbitrary broadcastable objects don't appear to be linearly indexable:

julia> bc = Base.Broadcast.broadcasted(identity, rand(2,2))
Base.Broadcast.Broadcasted(identity, ([0.7110622323877549 0.5733414791299075; 0.620167447659585 0.017047464877951835],))

julia> IndexStyle(bc)
IndexCartesian()

julia> bc[4]
ERROR: BoundsError: attempt to access Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}} at index [4]
Stacktrace:
 [1] throw_boundserror(A::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}}, I::Tuple{Int64})
   @ Base ./abstractarray.jl:744
 [2] checkbounds
   @ ./broadcast.jl:621 [inlined]
 [3] getindex(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}}, I::Int64)
   @ Base.Broadcast ./broadcast.jl:609
 [4] top-level scope
   @ REPL[17]:1

maleadt · 2023-02-14T09:51:28Z

JuliaGPU/GPUArrays.jl#454 should help a bit, but most broadcast objects apparently have IndexStyle==Cartesian

maxwindiff · 2023-02-15T08:29:45Z

I'm curious, what makes the bc′ in _copyto! support linear indexing when forced? At first I thought this is because the index is passed as-is to the MtlArray inside the Broadcasted object, but then I tried something like a .= 10 and it still worked. Or is this all undefined behavior and I just happened to get lucky?

 0:18:34 GPUArrays (tb/avoid_cartesian) % git diff
diff --git a/src/host/broadcast.jl b/src/host/broadcast.jl
index 5a8da77..64b46f3 100644
--- a/src/host/broadcast.jl
+++ b/src/host/broadcast.jl
@@ -58,15 +58,14 @@ end
             # HACK: cartesian iteration is slow, so avoid it if possible
             # TODO: generalize this, much like how `eachindex` picks an appropriate iterator
             #       (::AnyGPUArray methods shouldn't be hardcoded to use linear indexing)
-            I = if isa(IndexStyle(dest), IndexLinear) && isa(IndexStyle(bc′), IndexLinear)
-                @linearidx(dest, i)
-            else
-                @cartesianidx(dest, i)
-            end
+            I = @linearidx(dest, i)
             @inbounds dest[I] = bc′[I]
         end
         return
     end
+    @info "IndexStyle" IndexStyle(dest) IndexStyle(bc) IndexStyle(bc′)
+    @info "typeof" typeof(dest) typeof(bc) typeof(bc′)
+    @info "bc′" bc′
+    @show @inbounds bc′[10]
     elements = length(dest)
     elements_per_thread = typemax(Int)
     heuristic = launch_heuristic(backend(dest), broadcast_kernel, dest, bc′, 1;

julia> a .= b;
┌ Info: IndexStyle
│   IndexStyle(dest) = IndexLinear()
│   IndexStyle(bc) = IndexCartesian()
└   IndexStyle(bc′) = IndexCartesian()
┌ Info: typeof
│   typeof(dest) = MtlMatrix{Float32} (alias for MtlArray{Float32, 2})
│   typeof(bc) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{MtlMatrix{Float32}}}
└   typeof(bc′) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{MtlMatrix{Float32}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}
┌ Info: bc′
└   bc′ = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}}(identity, (Base.Broadcast.Extruded{MtlMatrix{Float32}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}(Float32[-0.45997995 -0.9323782 … -0.78047013 -0.10763502; -0.41987753 -0.958211 … -0.11566621 -0.01383841; … ; -0.21164703 -0.6332897 … -0.13726139 -0.76514477; -0.71918046 -0.14217287 … -0.42983288 -0.99601007], (true, true), (1, 1)),))
#= /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:69 =# @inbounds(bc′[10]) = -0.624596f0

julia> a .= 10;
┌ Info: IndexStyle
│   IndexStyle(dest) = IndexLinear()
│   IndexStyle(bc) = IndexCartesian()
└   IndexStyle(bc′) = IndexCartesian()
┌ Info: typeof
│   typeof(dest) = MtlMatrix{Float32} (alias for MtlArray{Float32, 2})
│   typeof(bc) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}
└   typeof(bc′) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}
┌ Info: bc′
└   bc′ = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}}(identity, (10,))
#= /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:69 =# @inbounds(bc′[10]) = 10

maleadt · 2023-02-15T08:46:52Z

I think it may start to matter when adding singleton dimensions into the mix.

maleadt · 2023-02-15T16:24:11Z

Did some more experimentation, and with JuliaGPU/GPUArrays.jl#454 the kernel is pretty fast now, getting 120GBs out of 180GBs (previously it only reacher 8GBs). I guess that getting rid of the keeps branches would further improve performance, but this is a very good start already.

maleadt · 2023-05-22T11:01:39Z

Just a note here: the GPUArrays.jl change got reverted, JuliaGPU/GPUArrays.jl#463, so performance is bad again.

maleadt added the performance Gotta go fast. label Jul 12, 2022

maxwindiff mentioned this issue Jan 27, 2023

Specialize GPUArrays.global_index() to improve broadcast performance #76

Merged

maxwindiff mentioned this issue Feb 7, 2023

Poor performance of mapreduce #46

Closed

maleadt added the arrays Things about the array abstraction. label May 22, 2023

maleadt mentioned this issue Mar 6, 2024

Specialize broadcast to avoid integer divisions. #304

Merged

maleadt closed this as completed in #304 Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow broadcast copy in 2D #41

slow broadcast copy in 2D #41

LaurentPlagne commented Jul 11, 2022 •

edited by maleadt

Loading

maxwindiff commented Jan 27, 2023

maleadt commented Jan 27, 2023

maxwindiff commented Feb 9, 2023

maleadt commented Feb 9, 2023

maxwindiff commented Feb 11, 2023

maleadt commented Feb 13, 2023

maleadt commented Feb 13, 2023

maxwindiff commented Feb 14, 2023

maxwindiff commented Feb 14, 2023 •

edited

Loading

maleadt commented Feb 14, 2023

maleadt commented Feb 14, 2023 •

edited

Loading

maleadt commented Feb 14, 2023 •

edited

Loading

maxwindiff commented Feb 15, 2023

maleadt commented Feb 15, 2023

maleadt commented Feb 15, 2023

maleadt commented May 22, 2023

slow broadcast copy in 2D #41

slow broadcast copy in 2D #41

Comments

LaurentPlagne commented Jul 11, 2022 • edited by maleadt Loading

maxwindiff commented Jan 27, 2023

maleadt commented Jan 27, 2023

maxwindiff commented Feb 9, 2023

maleadt commented Feb 9, 2023

maxwindiff commented Feb 11, 2023

maleadt commented Feb 13, 2023

maleadt commented Feb 13, 2023

maxwindiff commented Feb 14, 2023

maxwindiff commented Feb 14, 2023 • edited Loading

maleadt commented Feb 14, 2023

maleadt commented Feb 14, 2023 • edited Loading

maleadt commented Feb 14, 2023 • edited Loading

maxwindiff commented Feb 15, 2023

maleadt commented Feb 15, 2023

maleadt commented Feb 15, 2023

maleadt commented May 22, 2023

LaurentPlagne commented Jul 11, 2022 •

edited by maleadt

Loading

maxwindiff commented Feb 14, 2023 •

edited

Loading

maleadt commented Feb 14, 2023 •

edited

Loading

maleadt commented Feb 14, 2023 •

edited

Loading