-
-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
accept integer labels in (logit)crossentropy #2141
base: master
Are you sure you want to change the base?
Conversation
mhmhm, I don't think so,
Yes, but I think it is convenient to have this method for the notable case of 1:10 labels (as pytorch does). |
@mcabbott ops, I erroneously edited your comment instead of replying |
Haha no problem. For crossentropy this path does Whether these optimisations are worth doing I don't know. The loss function is seldom much of the time. And of course it's the gradient time which counts most, and Zygote un-fuses things, and you should time it on the GPU, and make sure all paths are tested to death... If they aren't worth doing, then it's probably not worth having two implementations. It just seems like a bug magnet. The proposed API would be provided by a one-line definition API-wise this is a bit like #2090. It lets you simplify slightly a common case: train!(model, [(x1,y1), (x2,y2), ...], opt) do m,x,y # today
crossentropy(m(x), onehotbatch(y, 1:10))
end
train!(model, [(x1,y1), (x2,y2), ...], opt) do m,x,y # with this PR
crossentropy(m(x), y)
end
train!(crossentropy, model, [(x1,y1), (x2,y2), ...], opt) # with this and 2090 But unlike 2090 it only applies to a few loss functions (why not all?). It's one more thing to know about & to document, not just "two same-size arrays" always. |
One could argue that |
I benchmarked crossv2(ŷ, y::AbstractVector{<:Integer}) = logitcrossentropy(ŷ, Flux.onehotbatch(y, 1:size(ŷ, 1))) and on CPU the performance are the same. On GPU we have this odd thing for which I think there should be a corresponding issue but i could not find it: julia> y = [1,2,3] |> gpu
3-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
1
2
3
julia> Flux.onehotbatch(y, 1:3) # back to cpu !!!??!!
3×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
1 ⋅ ⋅
⋅ 1 ⋅
⋅ ⋅ 1
julia> Flux.onehotbatch(y, 1:3) |> gpu
3×3 OneHotMatrix(::CuArray{UInt32, 1, CUDA.Mem.DeviceBuffer}) with eltype Bool:
1 ⋅ ⋅
⋅ 1 ⋅
⋅ ⋅ 1 |
On GPU I tried the workaround crossv2(ŷ::AbstractMatrix, y::AbstractVector{<:Integer}) = logitcrossentropy(ŷ, Flux.onehotbatch(y, 1:size(ŷ, 1)))
crossv2(ŷ::AbstractMatrix, y::CuVector{<:Integer}) = logitcrossentropy(ŷ, Flux.onehotbatch(y, 1:size(ŷ, 1)) |> gpu) but performance are not acceptable: perf(10, 128)
# with ŷ
# 14.648 μs (10 allocations: 13.17 KiB)
# 27.381 μs (19 allocations: 35.39 KiB)
# with labels
# 13.716 μs (16 allocations: 9.88 KiB)
# 41.338 μs (119 allocations: 25.22 KiB)
# crossv2
# 15.860 μs (14 allocations: 14.00 KiB)
# 28.804 μs (23 allocations: 36.22 KiB)
# with ŷ - gpu
# 46.107 μs (163 allocations: 8.52 KiB)
# 109.656 μs (414 allocations: 24.17 KiB)
# with labels - gpu
# 42.620 μs (125 allocations: 6.23 KiB)
# 117.972 μs (375 allocations: 19.61 KiB)
# crossv2 - gpu
# 950.374 μs (564 allocations: 69.98 KiB)
# 1.134 ms (1117 allocations: 101.03 KiB)
perf(100, 128)
# with ŷ
# 121.148 μs (12 allocations: 103.02 KiB)
# 212.059 μs (25 allocations: 304.92 KiB)
# with labels
# 113.914 μs (17 allocations: 54.80 KiB)
# 215.665 μs (122 allocations: 159.98 KiB)
# crossv2
# 121.167 μs (13 allocations: 103.58 KiB)
# 213.441 μs (26 allocations: 305.48 KiB)
# with ŷ - gpu
# 47.880 μs (163 allocations: 8.52 KiB)
# 110.307 μs (414 allocations: 24.17 KiB)
# with labels - gpu
# 40.567 μs (125 allocations: 6.23 KiB)
# 122.961 μs (375 allocations: 19.61 KiB)
# crossv2 - gpu
# 936.448 μs (561 allocations: 69.72 KiB)
# 1.145 ms (1114 allocations: 100.77 KiB)
perf(100, 1280)
# with ŷ
# 1.378 ms (12 allocations: 1.00 MiB)
# 2.320 ms (25 allocations: 2.97 MiB)
# with labels
# 1.321 ms (18 allocations: 540.97 KiB)
# 2.169 ms (123 allocations: 1.52 MiB)
# crossv2
# 1.393 ms (13 allocations: 1.01 MiB)
# 2.309 ms (26 allocations: 2.98 MiB)
# with ŷ - gpu
# 60.885 μs (210 allocations: 10.77 KiB)
# 121.919 μs (464 allocations: 26.47 KiB)
# with labels - gpu
# 52.679 μs (174 allocations: 8.52 KiB)
# 128.602 μs (426 allocations: 21.92 KiB)
# crossv2 - gpu
# 9.136 ms (4064 allocations: 616.53 KiB)
# 9.197 ms (4620 allocations: 647.62 KiB) |
FluxML/OneHotArrays.jl#27 mostly fixes the gpu issue with onehot construction. crossv2(ŷ::AbstractMatrix, y::AbstractVector{<:Integer}) = logitcrossentropy(ŷ, Flux.onehotbatch(y, 1:size(ŷ, 1)))
cross_unsafe(ŷ::AbstractMatrix, y::AbstractVector{<:Integer}) = logitcrossentropy(ŷ, OneHotMatrix(y, size(ŷ, 1)))
perf(10, 128)
# with labels - gpu
# 42.620 μs (125 allocations: 6.23 KiB)
# 117.972 μs (375 allocations: 19.61 KiB)
# crossv2 - gpu
# 107.913 μs (284 allocations: 14.45 KiB)
# 177.093 μs (535 allocations: 30.11 KiB)
# cross_unsafe - gpu
# 46.647 μs (163 allocations: 8.52 KiB)
# 110.759 μs (414 allocations: 24.17 KiB)
perf(100, 128)
# with labels - gpu
# 40.567 μs (125 allocations: 6.23 KiB)
# 122.961 μs (375 allocations: 19.61 KiB)
# crossv2 - gpu
# 104.917 μs (284 allocations: 14.45 KiB)
# 171.141 μs (535 allocations: 30.11 KiB)
# cross_unsafe - gpu
# 46.137 μs (163 allocations: 8.52 KiB)
# 109.084 μs (414 allocations: 24.17 KiB)
perf(100, 1280)
# with labels - gpu
# 52.679 μs (174 allocations: 8.52 KiB)
# 128.602 μs (426 allocations: 21.92 KiB)
# crossv2 - gpu
# 137.448 μs (422 allocations: 20.91 KiB)
# 208.361 μs (676 allocations: 36.61 KiB)
# cross_unsafe - gpu
# 58.479 μs (210 allocations: 10.77 KiB)
# 121.839 μs (464 allocations: 26.47 KiB) Unfortunately, the unsafe version is really unsafe, out-of-range indexes are not caught anywhere since the following pattern doesn't error: julia> oh = OneHotMatrix([1,2,2,5], 3)
3×4 OneHotMatrix(::Vector{Int64}) with eltype Bool:
1 ⋅ ⋅ ⋅
⋅ 1 1 ⋅
⋅ ⋅ ⋅ ⋅
julia> oh .* ones(3, 4)
3×4 Matrix{Float64}:
1.0 0.0 0.0 0.0
0.0 1.0 1.0 0.0
0.0 0.0 0.0 0.0 |
The recent PR FluxML/OneHotArrays.jl#29 brings down the gpu gap of Since the gap is still appreciable I suggest we go with the fastest implementation. It is true, that we will have two different implementations for the crossentropy but they are both very simple, so I don't think we should be worried about this. crossv2(ŷ::AbstractMatrix, y::AbstractVector{<:Integer}) = logitcrossentropy(ŷ, Flux.onehotbatch(y, 1:size(ŷ, 1)))
cross_unsafe(ŷ::AbstractMatrix, y::AbstractVector{<:Integer}) = logitcrossentropy(ŷ, OneHotMatrix(y, size(ŷ, 1)))
perf(10, 128)
# with ŷ - gpu
# 43.481 μs (163 allocations: 8.52 KiB)
# 101.580 μs (414 allocations: 24.17 KiB)
# with labels - gpu
# 38.242 μs (125 allocations: 6.23 KiB)
# 104.415 μs (375 allocations: 19.61 KiB)
# crossv2 - gpu
# 55.354 μs (197 allocations: 10.30 KiB)
# 114.214 μs (448 allocations: 25.95 KiB)
# cross_unsafe - gpu
# 46.076 μs (163 allocations: 8.52 KiB)
# 104.907 μs (414 allocations: 24.17 KiB)
perf(100, 128)
# with ŷ - gpu
# 47.519 μs (163 allocations: 8.52 KiB)
# 104.055 μs (414 allocations: 24.17 KiB)
# with labels - gpu
# 38.973 μs (125 allocations: 6.23 KiB)
# 106.269 μs (375 allocations: 19.61 KiB)
# crossv2 - gpu
# 57.748 μs (197 allocations: 10.30 KiB)
# 113.713 μs (448 allocations: 25.95 KiB)
# cross_unsafe - gpu
# 46.637 μs (163 allocations: 8.52 KiB)
# 105.106 μs (414 allocations: 24.17 KiB)
perf(100, 1280)
# with ŷ - gpu
# 58.510 μs (210 allocations: 10.77 KiB)
# 118.973 μs (464 allocations: 26.47 KiB)
# with labels - gpu
# 51.075 μs (174 allocations: 8.52 KiB)
# 120.616 μs (426 allocations: 21.92 KiB)
# crossv2 - gpu
# 67.346 μs (245 allocations: 12.56 KiB)
# 129.272 μs (499 allocations: 28.27 KiB)
# cross_unsafe - gpu
# 59.422 μs (210 allocations: 10.77 KiB)
# 121.117 μs (464 allocations: 26.47 KiB) |
Re this fast path, is it just that gather has no bounds checks? That's what FluxML/NNlibCUDA.jl#51 was going to fix, I think. API-wise, the question above is: Why should a method accepting indices (equivalent to making onehot, regardless of the implementation) apply to only to some loss functions, not to all? |
What other losses could accept labels? From a quick check from the phone crossentropy seems to be the only one, the other being for regression or binary classification |
There are a few which do accept labels, like binarycrossentropy. Almost all the others expect two arrays the same size, and could plausibly take a OneHotMatrix as the truth. They could all take this shortcut of taking its indices instead of the matrix. What's the argument that crossentropy is different? |
binarycrossentropy already takes labels and labels only.
That is unrealistic, they are meant for regression, i.e. continuous target. Yes, you can use |
You're right, I forgot about that, gather is unsafe on GPU: julia> labelsbad = [1,2,5]
3-element Vector{Int64}:
1
2
5
julia> ŷ = rand(2, 3)
2×3 Matrix{Float64}:
0.172578 0.307518 0.804586
0.117243 0.964788 0.685842
julia> logitcrossentropy(ŷ, labelsbad) # using gather, on CPU it is OK
ERROR: BoundsError: attempt to access 2×3 Matrix{Float64} at index [5, 3]
Stacktrace:
[1] throw_boundserror(A::Matrix{Float64}, I::Tuple{Int64, Int64})
@ Base ./abstractarray.jl:703
[2] checkbounds
@ ./abstractarray.jl:668 [inlined]
[3] view
@ ./subarray.jl:177 [inlined]
[4] _view
@ ~/.julia/packages/NNlib/57X3s/src/scatter.jl:38 [inlined]
[5] gather!(dst::Vector{Float64}, src::Matrix{Float64}, idx::Vector{CartesianIndex{2}})
@ NNlib ~/.julia/packages/NNlib/57X3s/src/gather.jl:27
[6] gather
@ ~/.julia/packages/NNlib/57X3s/src/gather.jl:78 [inlined]
[7] gather
@ ~/.julia/packages/NNlib/57X3s/src/gather.jl:116 [inlined]
[8] logitcrossentropy(ŷ::Matrix{Float64}, y::Vector{Int64}; dims::Int64, agg::typeof(mean))
@ Flux.Losses ~/.julia/dev/Flux/src/losses/functions.jl:299
[9] logitcrossentropy(ŷ::Matrix{Float64}, y::Vector{Int64})
@ Flux.Losses ~/.julia/dev/Flux/src/losses/functions.jl:293
[10] top-level scope
@ REPL[22]:1
julia> logitcrossentropy(ŷ |> gpu, labelsbad |> gpu) # using gather, on GPU returns garbage
-0.22993934f0
julia> crossv2(ŷ |> gpu, labelsbad |> gpu) # crossv2 is safe
ERROR: Out-of-bounds array access.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: KernelException: exception thrown during kernel execution on device NVIDIA GeForce RTX 2080 Ti
Stacktrace:
[1] check_exceptions()
@ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/exceptions.jl:34
[2] nonblocking_synchronize
@ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/context.jl:331 [inlined]
[3] device_synchronize()
@ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/context.jl:319
[4] CuModule(data::Vector{UInt8}, options::Dict{CUDA.CUjit_option_enum, Any})
@ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:41
[5] CuModule
@ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/module.jl:23 [inlined]
[6] cufunction_link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry, :external_gvars), Tuple{Vector{UInt8}, String, Vector{String}}})
@ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:479
[7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
@ GPUCompiler ~/.julia/packages/GPUCompiler/hi5Wg/src/cache.jl:95
[8] cufunction(f::GPUArrays.var"#broadcast_kernel#17", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Extruded{OneHotMatrix{UInt32, CuDeviceVector{UInt32, 1}}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:299
[9] cufunction
@ ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:292 [inlined]
[10] macro expansion
@ ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:102 [inlined]
[11] #launch_heuristic#248
@ ~/.julia/packages/CUDA/DfvRa/src/gpuarrays.jl:17 [inlined]
[12] _copyto!
@ ~/.julia/packages/GPUArrays/fqD8z/src/host/broadcast.jl:63 [inlined]
[13] copyto!
@ ~/.julia/packages/GPUArrays/fqD8z/src/host/broadcast.jl:46 [inlined]
[14] copy
@ ~/.julia/packages/GPUArrays/fqD8z/src/host/broadcast.jl:37 [inlined]
[15] materialize
@ ./broadcast.jl:860 [inlined]
[16] logitcrossentropy(ŷ::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y::OneHotMatrix{UInt32, CuArray{UInt32, 1, CUDA.Mem.DeviceBuffer}}; dims::Int64, agg::typeof(mean))
@ Flux.Losses ~/.julia/dev/Flux/src/losses/functions.jl:290
[17] logitcrossentropy
@ ~/.julia/dev/Flux/src/losses/functions.jl:288 [inlined]
[18] crossv2(ŷ::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
@ Main ~/.julia/dev/Flux/perf.jl:4
[19] top-level scope
@ REPL[24]:1 I'm going to remove the implementation based on gather in favor of |
Sure, that's what I meant about binarycrossentropy. Labels and only labels. But note that it's a separate function, rather than being a method It is possible that a 3rd version of crossentropy which takes integers starting from 1 instead of Bool should also be a 3rd function, not a 2nd method of the one which otherwise wants two arrays the same size? But IDK if the extra names are worth it. Either way, I think the 3rd version does deserve a 3rd docstring, not just an aside in the existing one. Agree about |
A convenience method that avoids the onehot transformation of the labels.
Pytorch supports this as well.
Performance seems to be approximately the same on both CPU and GPU:
We should wait for FluxML/NNlib.jl#449 before merging this.
PR Checklist