onecold is very slow #556

afternone · 2019-01-15T01:52:13Z

julia> valY |> size
(3755, 128)

julia> @time onecold(valY)
 21.304120 seconds (2.40 M allocations: 95.372 MiB, 0.08% gc time)
128-element Array{Int64,1}:
   1
   2
   3
   ⋮
 126
 127
 128

colbec · 2019-01-15T15:26:37Z

On my machine with Julia 1.0.3 I get:

julia> using Flux: onecold
julia> valY = rand(3755,128);
julia> valY |> size
(3755, 128)
julia> @time onecold(valY)
  0.838053 seconds (2.47 M allocations: 126.871 MiB, 2.98% gc time)
128-element Array{Int64,1}:
 2885
 2932
  755
...
 3122
 3128
 1780

and with Julia Master I get

  0.001299 seconds (1.06 k allocations: 58.750 KiB)
128-element Array{Int64,1}:
 3659
 1292
...

so maybe there is something else happening on your machine?

afternone · 2019-01-16T01:34:14Z

What happens when change valY to a CuArray?

colbec · 2019-01-16T05:39:09Z

No GPU here. Please specify your issue more completely.

KristofferC · 2019-01-18T19:37:00Z

The problem to do when the input is on the GPU and is a TrackedArray cf:

using Flux: onecold
valY = rand(1000,128);
@time onecold(valY)
#   0.000369 seconds (1.00 k allocations: 36.375 KiB)
@time onecold(gpu(valY))
#   0.025580 seconds (12.65 k allocations: 1.499 MiB)
@time onecold(Flux.param(valY));
#  0.007966 seconds (641.86 k allocations: 22.523 MiB, 49.17% gc time)
@time onecold(Flux.param(gpu(valY)));
#  4.803005 seconds (1.16 M allocations: 42.439 MiB, 0.15% gc time)

KristofferC · 2019-01-18T19:48:43Z

And running CuArrays.allowscalar(false) we get the error scalar setindex! is disallowed which is why it is slow.

see #556

MikeInnes · 2019-01-24T10:42:34Z

I've fixed the basic issue for vectors, but there's still the problem that we don't support argmax across a dimension in CuArrays. It's probably straightforward to implement that with mapreduce. We'll also need mapslices, I'm not sure if that works already.

appleparan · 2019-12-02T19:41:59Z

Bump this issue. Now we could rewrite using findmax by CuArray's #500

h-spiess · 2019-12-04T17:17:26Z

The problem to do when the input is on the GPU and is a TrackedArray cf:

using Flux: onecold
valY = rand(1000,128);
@time onecold(valY)
#   0.000369 seconds (1.00 k allocations: 36.375 KiB)
@time onecold(gpu(valY))
#   0.025580 seconds (12.65 k allocations: 1.499 MiB)
@time onecold(Flux.param(valY));
#  0.007966 seconds (641.86 k allocations: 22.523 MiB, 49.17% gc time)
@time onecold(Flux.param(gpu(valY)));
#  4.803005 seconds (1.16 M allocations: 42.439 MiB, 0.15% gc time)

Hey. I have the same performance problems with the new flux release using zygote. So i assume its not due to being tracked by Tracker.

piever · 2020-01-02T13:15:16Z

Not sure if this is a good general solution, but the following hack removes the need for onecold when computing the accuracy on a softmax on onehot-encoded data, and is much faster when the model is run on the GPU with CuArrays.

using Flux: OneHotMatrix
compare(y::OneHotMatrix, y′) = maximum(y′, dims = 1) .== maximum(y .* y′, dims = 1)
accuracy(x, y::OneHotMatrix) = mean(compare(y, model(x)))

@DhairyaLGandhi

1448: Arbitrary dimension one-hot arrays r=DhairyaLGandhi a=darsnack This supersedes #1447. It should address the same issues: - fix #1445, #1229 - probably fix also #864, #556, #189 This PR introduces a new one-hot N-dimensional array type, `OneHotArray`. Like #1447, this approach avoids the pointer allocations associated with `OneHotMatrix` being an array of `OneHotVector`s. It also lifts the "height" into the type parameter to avoid unnecessary allocation. Unlike #1447, this approach does not introduce a new primitive type. Instead, a "one-hot vector" is represented with a single subtype of `Integer` that is configurable by the user. By default, the exposed API will use `UInt32`. Fundamentally, the primitive type is necessary because wrapping a `UInt32` as a `OneHotVector` will suffer memory penalties when you create an `Array{<:OneHotVector}`. But if we begin by designing for N-dimensions, then `OneHotVector` is just the specialized 1D case (similar to how `Vector{T} = Array{T, 1}`). ## Performance I compared against the same tests mentioned in #1447. Please suggest more if you want to. 1. #189 ```jl #master julia> x = Flux.onehotbatch(rand(1:100, 50), 1:100); julia> W = rand(128, 100); julia> @Btime $W * $x; 5.095 μs (13 allocations: 50.86 KiB) julia> cW, cx = cu(W), cu(x); julia> @Btime $cW * $cx; 24.948 μs (86 allocations: 3.11 KiB) #1447 julia> x = Flux.onehotbatch(rand(1:100, 50), 1:100); julia> W = rand(128, 100); julia> @Btime $W * $x; 5.312 μs (3 allocations: 50.36 KiB) julia> cW, cx = cu(W), cu(x); julia> @Btime $cW * $cx; 8.466 μs (61 allocations: 1.69 KiB) # this PR julia> x = Flux.onehotbatch(rand(1:100, 50), 1:100); julia> W = rand(128, 100); julia> @Btime $W * $x; 4.708 μs (3 allocations: 50.56 KiB) julia> cW, cx = cu(W), cu(x); julia> @Btime $cW * $cx; 8.576 μs (63 allocations: 1.73 KiB) ``` 2. #556 ```jl #master julia> valY = randn(1000, 128); julia> @Btime Flux.onecold($valY); 365.712 μs (1131 allocations: 38.16 KiB) julia> @Btime Flux.onecold($(gpu(valY))); ┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)` └ @ GPUArrays ~/.julia/packages/GPUArrays/jhRU7/src/host/indexing.jl:43 1.330 s (781248 allocations: 31.59 MiB) #1447 julia> valY = randn(1000, 128); julia> @Btime Flux.onecold($valY); 524.767 μs (8 allocations: 4.00 KiB) julia> @Btime Flux.onecold($(gpu(valY))); 27.563 μs (169 allocations: 5.56 KiB) # this PR julia> valY = randn(1000, 128); julia> @Btime Flux.onecold($valY); 493.017 μs (8 allocations: 4.53 KiB) julia> @Btime Flux.onecold($(gpu(valY))); 26.702 μs (171 allocations: 5.61 KiB) ``` ## Summary This should basically be #1447 but simpler to maintain w/ fewer changes. Tests are passing, though I think we should add more tests for one-hot data (currently our test set seems pretty sparse). Performance matches #1447 where I have tested, but please suggest more performance tests. In theory, any performance difference between #1447 and this PR should be recoverable. ### PR Checklist - [ ] Tests are added - [ ] Entry in NEWS.md - [ ] Documentation, if applicable - [ ] Final review from @DhairyaLGandhi (for API changes). cc @CarloLucibello @chengchingwen Co-authored-by: Kyle Daruwalla <daruwalla@wisc.edu> Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>

MikeInnes added the help wanted label Jan 24, 2019

MikeInnes added a commit that referenced this issue Jan 24, 2019

test onecold-of-tracked-gpu-vector

0142d89

see #556

appleparan mentioned this issue Dec 17, 2019

Why run MNIST example is very slow? #968

Closed

CarloLucibello mentioned this issue Dec 27, 2020

fix onehot gpu #1441

Closed

DhairyaLGandhi closed this as completed Dec 31, 2020

CarloLucibello reopened this Dec 31, 2020

chengchingwen mentioned this issue Dec 31, 2020

new onehot implementation #1447

Closed

4 tasks

darsnack mentioned this issue Jan 1, 2021

Arbitrary dimension one-hot arrays #1448

Merged

4 tasks

CarloLucibello closed this as completed Jan 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

onecold is very slow #556

onecold is very slow #556

afternone commented Jan 15, 2019

colbec commented Jan 15, 2019

afternone commented Jan 16, 2019

colbec commented Jan 16, 2019

KristofferC commented Jan 18, 2019 •

edited

Loading

KristofferC commented Jan 18, 2019

MikeInnes commented Jan 24, 2019

appleparan commented Dec 2, 2019

h-spiess commented Dec 4, 2019

piever commented Jan 2, 2020 •

edited

Loading

onecold is very slow #556

onecold is very slow #556

Comments

afternone commented Jan 15, 2019

colbec commented Jan 15, 2019

afternone commented Jan 16, 2019

colbec commented Jan 16, 2019

KristofferC commented Jan 18, 2019 • edited Loading

KristofferC commented Jan 18, 2019

MikeInnes commented Jan 24, 2019

appleparan commented Dec 2, 2019

h-spiess commented Dec 4, 2019

piever commented Jan 2, 2020 • edited Loading

KristofferC commented Jan 18, 2019 •

edited

Loading

piever commented Jan 2, 2020 •

edited

Loading