Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

onecold is very slow #556

Closed
afternone opened this issue Jan 15, 2019 · 9 comments
Closed

onecold is very slow #556

afternone opened this issue Jan 15, 2019 · 9 comments

Comments

@afternone
Copy link

julia> valY |> size
(3755, 128)

julia> @time onecold(valY)
 21.304120 seconds (2.40 M allocations: 95.372 MiB, 0.08% gc time)
128-element Array{Int64,1}:
   1
   2
   3
   
 126
 127
 128
@colbec
Copy link

colbec commented Jan 15, 2019

On my machine with Julia 1.0.3 I get:

julia> using Flux: onecold
julia> valY = rand(3755,128);
julia> valY |> size
(3755, 128)
julia> @time onecold(valY)
  0.838053 seconds (2.47 M allocations: 126.871 MiB, 2.98% gc time)
128-element Array{Int64,1}:
 2885
 2932
  755
...
 3122
 3128
 1780

and with Julia Master I get

  0.001299 seconds (1.06 k allocations: 58.750 KiB)
128-element Array{Int64,1}:
 3659
 1292
...

so maybe there is something else happening on your machine?

@afternone
Copy link
Author

What happens when change valY to a CuArray?

@colbec
Copy link

colbec commented Jan 16, 2019

No GPU here. Please specify your issue more completely.

@KristofferC
Copy link
Contributor

KristofferC commented Jan 18, 2019

The problem to do when the input is on the GPU and is a TrackedArray cf:

using Flux: onecold
valY = rand(1000,128);
@time onecold(valY)
#   0.000369 seconds (1.00 k allocations: 36.375 KiB)
@time onecold(gpu(valY))
#   0.025580 seconds (12.65 k allocations: 1.499 MiB)
@time onecold(Flux.param(valY));
#  0.007966 seconds (641.86 k allocations: 22.523 MiB, 49.17% gc time)
@time onecold(Flux.param(gpu(valY)));
#  4.803005 seconds (1.16 M allocations: 42.439 MiB, 0.15% gc time)

@KristofferC
Copy link
Contributor

And running CuArrays.allowscalar(false) we get the error scalar setindex! is disallowed which is why it is slow.

MikeInnes added a commit that referenced this issue Jan 24, 2019
@MikeInnes
Copy link
Member

I've fixed the basic issue for vectors, but there's still the problem that we don't support argmax across a dimension in CuArrays. It's probably straightforward to implement that with mapreduce. We'll also need mapslices, I'm not sure if that works already.

@appleparan
Copy link

Bump this issue. Now we could rewrite using findmax by CuArray's #500

@h-spiess
Copy link

h-spiess commented Dec 4, 2019

The problem to do when the input is on the GPU and is a TrackedArray cf:

using Flux: onecold
valY = rand(1000,128);
@time onecold(valY)
#   0.000369 seconds (1.00 k allocations: 36.375 KiB)
@time onecold(gpu(valY))
#   0.025580 seconds (12.65 k allocations: 1.499 MiB)
@time onecold(Flux.param(valY));
#  0.007966 seconds (641.86 k allocations: 22.523 MiB, 49.17% gc time)
@time onecold(Flux.param(gpu(valY)));
#  4.803005 seconds (1.16 M allocations: 42.439 MiB, 0.15% gc time)

Hey. I have the same performance problems with the new flux release using zygote. So i assume its not due to being tracked by Tracker.

@piever
Copy link
Contributor

piever commented Jan 2, 2020

Not sure if this is a good general solution, but the following hack removes the need for onecold when computing the accuracy on a softmax on onehot-encoded data, and is much faster when the model is run on the GPU with CuArrays.

using Flux: OneHotMatrix
compare(y::OneHotMatrix, y′) = maximum(y′, dims = 1) .== maximum(y .* y′, dims = 1)
accuracy(x, y::OneHotMatrix) = mean(compare(y, model(x)))

bors bot added a commit that referenced this issue Jan 8, 2021
1448: Arbitrary dimension one-hot arrays r=DhairyaLGandhi a=darsnack

This supersedes #1447. It should address the same issues:
- fix #1445, #1229
- probably fix also #864, #556, #189

This PR introduces a new one-hot N-dimensional array type, `OneHotArray`. Like #1447, this approach avoids the pointer allocations associated with `OneHotMatrix` being an array of `OneHotVector`s. It also lifts the "height" into the type parameter to avoid unnecessary allocation. Unlike #1447, this approach does not introduce a new primitive type. Instead, a "one-hot vector" is represented with a single subtype of `Integer` that is configurable by the user. By default, the exposed API will use `UInt32`.

Fundamentally, the primitive type is necessary because wrapping a `UInt32` as a `OneHotVector` will suffer memory penalties when you create an `Array{<:OneHotVector}`. But if we begin by designing for N-dimensions, then `OneHotVector` is just the specialized 1D case (similar to how `Vector{T} = Array{T, 1}`).

## Performance

I compared against the same tests mentioned in #1447. Please suggest more if you want to.

1. #189
```jl
#master
julia> x = Flux.onehotbatch(rand(1:100, 50), 1:100);

julia> W = rand(128, 100);

julia> @Btime $W * $x;
  5.095 μs (13 allocations: 50.86 KiB)

julia> cW, cx = cu(W), cu(x);

julia> @Btime $cW * $cx;
  24.948 μs (86 allocations: 3.11 KiB)

#1447
julia> x = Flux.onehotbatch(rand(1:100, 50), 1:100);

julia> W = rand(128, 100);

julia> @Btime $W * $x;
  5.312 μs (3 allocations: 50.36 KiB)

julia> cW, cx = cu(W), cu(x);

julia> @Btime $cW * $cx;
  8.466 μs (61 allocations: 1.69 KiB)

# this PR
julia> x = Flux.onehotbatch(rand(1:100, 50), 1:100);

julia> W = rand(128, 100);

julia> @Btime $W * $x;
  4.708 μs (3 allocations: 50.56 KiB)

julia> cW, cx = cu(W), cu(x);

julia> @Btime $cW * $cx;
  8.576 μs (63 allocations: 1.73 KiB)
```

2. #556
```jl
#master
julia> valY = randn(1000, 128);

julia> @Btime Flux.onecold($valY);
  365.712 μs (1131 allocations: 38.16 KiB)

julia> @Btime Flux.onecold($(gpu(valY)));
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/jhRU7/src/host/indexing.jl:43
  1.330 s (781248 allocations: 31.59 MiB)

#1447
julia> valY = randn(1000, 128);

julia> @Btime Flux.onecold($valY);
  524.767 μs (8 allocations: 4.00 KiB)

julia> @Btime Flux.onecold($(gpu(valY)));
  27.563 μs (169 allocations: 5.56 KiB)

# this PR
julia> valY = randn(1000, 128);

julia> @Btime Flux.onecold($valY);
  493.017 μs (8 allocations: 4.53 KiB)

julia> @Btime Flux.onecold($(gpu(valY)));
  26.702 μs (171 allocations: 5.61 KiB)
```

## Summary

This should basically be #1447 but simpler to maintain w/ fewer changes. Tests are passing, though I think we should add more tests for one-hot data (currently our test set seems pretty sparse). Performance matches #1447 where I have tested, but please suggest more performance tests. In theory, any performance difference between #1447 and this PR should be recoverable.

### PR Checklist

- [ ] Tests are added
- [ ] Entry in NEWS.md
- [ ] Documentation, if applicable
- [ ] Final review from @DhairyaLGandhi (for API changes).

cc @CarloLucibello @chengchingwen 

Co-authored-by: Kyle Daruwalla <daruwalla@wisc.edu>
Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants