-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast path onehotbatch(::Vector{Int}, ::UnitRange)
#27
Conversation
Codecov ReportBase: 95.96% // Head: 96.21% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #27 +/- ##
==========================================
+ Coverage 95.96% 96.21% +0.24%
==========================================
Files 3 4 +1
Lines 124 132 +8
==========================================
+ Hits 119 127 +8
Misses 5 5
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
function onehotbatch(data::AbstractArray{<:Integer}, labels::AbstractUnitRange{<:Integer}) | ||
# lo, hi = extrema(data) # fails on Julia 1.6 | ||
lo, hi = minimum(data), maximum(data) | ||
lo < first(labels) && error("Value $lo not found in labels") | ||
hi > last(labels) && error("Value $hi not found in labels") | ||
offset = 1 - first(labels) | ||
indices = UInt32.(data .+ offset) | ||
return OneHotArray(indices, length(labels)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the bounds checking here is quite expensive, especially on GPU arrays where I think each of minimum & maximum forces synchronisation:
julia> let ci = cu(rand(1:99, 100))
@btime CUDA.@sync onehotbatch($ci, 1:99)
@btime CUDA.@sync OneHotMatrix($ci, 99)
end;
100.993 μs (86 allocations: 4.02 KiB)
2.803 μs (0 allocations: 0 bytes)
julia> let ci = cu(rand(1:99, 100))
@btime CUDA.@sync maximum($ci), minimum($ci)
@btime CUDA.@sync extrema($ci)
@btime CUDA.@sync map($ci) do i
0<i<100 || error("bad index")
UInt32(i+0)
end
end;
71.448 μs (58 allocations: 2.91 KiB)
38.094 μs (29 allocations: 1.47 KiB)
18.543 μs (30 allocations: 1.14 KiB)
julia> let ci = cu(rand(1:99, 100)) # without explicit CUDA.@sync
@btime extrema($ci)
@btime OneHotMatrix($ci, 99) # async, which is good
@btime OneHotMatrix(map($ci) do i # unfortunately not?
0<i<100 || error("bad index")
UInt32(i+0)
end, 99)
end;
35.544 μs (29 allocations: 1.47 KiB)
6.527 ns (0 allocations: 0 bytes)
10.619 μs (30 allocations: 1.14 KiB)
Moving the check inside the broadcast is faster, at the cost of more obscure errors. Maybe that's ok? Still not fully async.
julia> map(cu(rand(1:199, 100))) do i
0<i<100 || error("bad index")
UInt32(i+0)
end
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is rather obscure indeed. What if we wrap the map
inside a try-catch and raise a proper error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can tell the GPU not to wait? This doesn't work but perhaps something similar does:
julia> let i = rand(1:99, 100)
@btime maximum($i)<100 || error("outside")
@btime @async maximum($i)<100 || error("outside")
end;
58.407 ns (0 allocations: 0 bytes)
759.747 ns (5 allocations: 496 bytes)
julia> let ci = cu(rand(1:99, 100))
@btime maximum($ci)<100 || error("outside")
@btime @async maximum($ci)<100 || error("outside")
end;
35.134 μs (29 allocations: 1.45 KiB)
# hangs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ideal solution would be something like JuliaGPU/CUDA.jl#1140. If we had a way to write kernels, another idea would be to create an ad-hoc in
kernel which flips a one-element bool array to true if it finds a matching element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like the right thing. Perhaps rather than owning a kernel, this package could call checkbounds(out, inds, 1)
or whatever -- that's essentially the same operation.
I wondered what gather
did, and it turns out there is no check:
julia> NNlib.gather([1,20,300,4000] |> cu, [2,4,2,99] |> cu)
4-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
20
4000
20
0
julia> NNlib.gather([1,20,300,4000], [2,4,2,99])
ERROR: BoundsError: attempt to access 4-element Vector{Int64} at index [99]
The PR to add one FluxML/NNlibCUDA.jl#51 has many benchmarks... perhaps also 10s of μs.
This adds the obvious shortcut when the data is already indicies. It's a bit quicker, but also a partial solution to #16, as this will work with GPU arrays too.
Needs tests, and probably an error check.Done.