Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast path onehotbatch(::Vector{Int}, ::UnitRange) #27

Merged
merged 4 commits into from
Dec 27, 2022

Conversation

mcabbott
Copy link
Member

@mcabbott mcabbott commented Dec 26, 2022

This adds the obvious shortcut when the data is already indicies. It's a bit quicker, but also a partial solution to #16, as this will work with GPU arrays too.

julia> let x = rand(0:99, 100)
         @btime onehotbatch($x, 0:99)
       end;
  min 231.052 ns, mean 245.482 ns (1 allocation, 496 bytes)
  min 97.912 ns, mean 106.604 ns (1 allocation, 496 bytes)  # after

Needs tests, and probably an error check. Done.

@codecov-commenter
Copy link

codecov-commenter commented Dec 26, 2022

Codecov Report

Base: 95.96% // Head: 96.21% // Increases project coverage by +0.24% 🎉

Coverage data is based on head (7c1238f) compared to base (d27d037).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #27      +/-   ##
==========================================
+ Coverage   95.96%   96.21%   +0.24%     
==========================================
  Files           3        4       +1     
  Lines         124      132       +8     
==========================================
+ Hits          119      127       +8     
  Misses          5        5              
Impacted Files Coverage Δ
src/onehot.jl 96.15% <100.00%> (+0.59%) ⬆️
src/OneHotArrays.jl 100.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Comment on lines +103 to +110
function onehotbatch(data::AbstractArray{<:Integer}, labels::AbstractUnitRange{<:Integer})
# lo, hi = extrema(data) # fails on Julia 1.6
lo, hi = minimum(data), maximum(data)
lo < first(labels) && error("Value $lo not found in labels")
hi > last(labels) && error("Value $hi not found in labels")
offset = 1 - first(labels)
indices = UInt32.(data .+ offset)
return OneHotArray(indices, length(labels))
Copy link
Member Author

@mcabbott mcabbott Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the bounds checking here is quite expensive, especially on GPU arrays where I think each of minimum & maximum forces synchronisation:

julia> let ci = cu(rand(1:99, 100))
         @btime CUDA.@sync onehotbatch($ci, 1:99)
         @btime CUDA.@sync OneHotMatrix($ci, 99)
       end;
  100.993 μs (86 allocations: 4.02 KiB)
  2.803 μs (0 allocations: 0 bytes)

julia> let ci = cu(rand(1:99, 100))
         @btime CUDA.@sync maximum($ci), minimum($ci)
         @btime CUDA.@sync extrema($ci)
         @btime CUDA.@sync map($ci) do i
            0<i<100 || error("bad index")
            UInt32(i+0)
           end
        end;
  71.448 μs (58 allocations: 2.91 KiB)
  38.094 μs (29 allocations: 1.47 KiB)
  18.543 μs (30 allocations: 1.14 KiB)
  
julia> let ci = cu(rand(1:99, 100))  # without explicit CUDA.@sync 
         @btime extrema($ci)
         @btime OneHotMatrix($ci, 99)       # async, which is good
         @btime OneHotMatrix(map($ci) do i  # unfortunately not?
            0<i<100 || error("bad index")
            UInt32(i+0)
           end, 99)
        end;
  35.544 μs (29 allocations: 1.47 KiB)
  6.527 ns (0 allocations: 0 bytes)
  10.619 μs (30 allocations: 1.14 KiB)

Moving the check inside the broadcast is faster, at the cost of more obscure errors. Maybe that's ok? Still not fully async.

julia> map(cu(rand(1:199, 100))) do i
                   0<i<100 || error("bad index")
                   UInt32(i+0)
                  end
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is rather obscure indeed. What if we wrap the map inside a try-catch and raise a proper error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can tell the GPU not to wait? This doesn't work but perhaps something similar does:

julia> let i = rand(1:99, 100)
         @btime maximum($i)<100 || error("outside")
         @btime @async maximum($i)<100 || error("outside")
       end;
  58.407 ns (0 allocations: 0 bytes)
  759.747 ns (5 allocations: 496 bytes)

julia> let ci = cu(rand(1:99, 100))
         @btime maximum($ci)<100 || error("outside")
         @btime @async maximum($ci)<100 || error("outside")
       end;
  35.134 μs (29 allocations: 1.45 KiB)
  # hangs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ideal solution would be something like JuliaGPU/CUDA.jl#1140. If we had a way to write kernels, another idea would be to create an ad-hoc in kernel which flips a one-element bool array to true if it finds a matching element.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like the right thing. Perhaps rather than owning a kernel, this package could call checkbounds(out, inds, 1) or whatever -- that's essentially the same operation.

I wondered what gather did, and it turns out there is no check:

julia> NNlib.gather([1,20,300,4000] |> cu, [2,4,2,99] |> cu)
4-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
   20
 4000
   20
    0

julia> NNlib.gather([1,20,300,4000], [2,4,2,99])
ERROR: BoundsError: attempt to access 4-element Vector{Int64} at index [99]

The PR to add one FluxML/NNlibCUDA.jl#51 has many benchmarks... perhaps also 10s of μs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants