new onehot implementation #1447

chengchingwen · 2020-12-31T23:13:41Z

fix #1445, #1229
probably also #864, #556, #189

The new implementation support multi-dimensional onehot representation (#1229, #1445) with the new type OneHotArray{K} and OneHot{K}. Since the height of previous OneHotMatrix is now store on the type parameter, we can reduce the memory consumption by almost half of the origin (#1445). Most the operation are now GPU compatible (#864) without scalar operations, potentially solve the performance issues (#189, #556).

The API should be the same as I implemented them with the new type.

Remaining problem:
The backward of "embedding lookup" for GPU still require scalar operation. This can be fix with the scatter operator.

some performance comparison:

Huge performance difference between sparse and dense representation on GPU #189

# the old one
julia> x = Flux.onehotbatch(rand(1:100, 50), 1:100);

julia> W = rand(128, 100);

julia> @btime W * x;
  5.021 μs (13 allocations: 50.86 KiB)

julia> cW, cx = cu(W), cu(x);

julia> @btime cW * cx;
  18.942 μs (52 allocations: 2.06 KiB)

# the new one
julia> x = Flux.onehotbatch(rand(1:100, 50), 1:100);

julia> @btime W * x;
  4.588 μs (3 allocations: 50.36 KiB)

julia> cW, cx = cu(W), cu(x);

julia> @btime cW * cx;
  8.007 μs (60 allocations: 1.67 KiB)

onecold is very slow #556

julia> valY = randn(1000, 128);

# the old one
julia> @btime onecold(valY);
  442.836 μs (998 allocations: 36.08 KiB)

julia> @btime onecold($(gpu(valY)));
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/eVYIC/src/host/indexing.jl:43
  6.372 ms (13804 allocations: 479.33 KiB)

# the new one
julia> @btime onecold(valY);
  552.794 μs (8 allocations: 4.00 KiB)

julia> @btime onecold($(gpu(valY)));
  24.778 μs (169 allocations: 5.56 KiB)

PR Checklist

Tests are added
Entry in NEWS.md
Documentation, if applicable
Final review from @dhairyagandhi96 (for API changes).

src/onehot.jl

CarloLucibello · 2021-01-01T11:01:52Z

src/onehot.jl

+Base.convert(::Type{UInt32},  o::OneHot) = UInt32(o)
+Base.convert(::Type{To}, o::OneHot) where {To} = convert(To, convert(UInt32, o))
+
+Base.convert(ot::Type{OneHot{K}}, x::Core.BuiltinInts) where {K} = toOneHot(ot, x)


better not rely on julia internals

Do you mean the Core.BuiltinInts?

The Core.BuiltinInts is just a alias for Union{Bool, Int32, Int64, UInt32, UInt64, UInt8, Int128, Int16, Int8, UInt128, UInt16}. I can replace it with the Union if you think that would be better

Sorry for jumping in, but it seems better to take Carlo's suggestion below and directly define convert for each member of the union instead of toOneHot.

Core.BuiltinInts is undocumented, so it could be changed or moved in next julia releases (although it seems highly unlikely). Better define the union ourselves

CarloLucibello · 2021-01-01T11:06:07Z

src/onehot.jl

+toOneHot(ot::Type{OneHot{K}}, x::UInt32) where {K} = check_onehot_encode(ot, x)
+toOneHot(ot::Type{OneHot{K}}, x::UInt64) where {K} = checked_onehot_trunc_uint(ot, x)
+toOneHot(ot::Type{OneHot{K}}, x::UInt128) where {K} = checked_onehot_trunc_uint(ot, x)
+toOneHot(ot::Type{OneHot{K}}, x::Bool) where {K} = and_int(zext_int(ot, x), toOneHot(ot, 0x1))


why dispatch convert to toOneHot instead of replacing these with convert methods?

toOneHot convert the primitive types into OneHot with specific intrinsic bit operations. convert will then using toOneHot to convert them.

yes, I was wondering why you define toOneHot at all, instead of

Base.convert(ot::Type{OneHot{K}}, x::UInt32) where {K} = check_onehot_encode(ot, x) Base.convert(ot::Type{OneHot{K}}, x::UInt64) where {K} = checked_onehot_trunc_uint(ot, x) ....

I can't really recall why they look the same as convert but in separate method. Originally I try to follow the code in Base. I'll check if they can be changed without any issues. if so, then the Core.BuiltinInts issue could also be fixed.

CarloLucibello · 2021-01-01T11:15:39Z

This looks great. I see that the new onecold implementation is the same as #1441 (which I'll now close)

src/onehot.jl

CarloLucibello · 2021-01-01T11:38:15Z

The only breaking change I could spot is that the onehot and onehotbatch do not return a OneHotVector and OneHotMatrix as previously mentioned in the docs. I'm not sure how we can mitigate this. Maybe something like

const OneHotVector = OneHot
const OneHotMatrix = OneHotArray{K, 1, 2} where K

The parametrization of OneHotArray is a bit of a problem.

In any case, given the impact of the change and since the PR is almost ready, I'll mark this as needed for v0.12

fix doc printing typo Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>

chengchingwen · 2021-01-01T13:53:05Z

We do have OneHotMatrix here. I can also add the OneHotVector one.

chengchingwen · 2021-01-01T13:56:25Z

The parametrization of OneHotArray is a bit of a problem.

@CarloLucibello Could you elaborate more on this?

CarloLucibello · 2021-01-01T14:34:47Z

Could you elaborate more on this?

never mind, you solved the issue

CarloLucibello · 2021-01-01T14:36:58Z

src/onehot.jl

-# remove workaround when https://github.com/JuliaGPU/CuArrays.jl/issues/676 is fixed
-A::AbstractMatrix * B::OneHotMatrix = A[:, cpu(map(x->x.ix, B.data))]
+const OneHotVector{K} = OneHot{K}
+const OneHotMatrix{K} = OneHotArray{K, 1}


Suggested change

const OneHotMatrix{K} = OneHotArray{K, 1}

const OneHotMatrix{K, A} = OneHotArray{K, 1, 2, A}

why should we exclude it?

Because that's not usually something people have to handle. The value of A is handled by the OneHotArray's outer constructor, excluding A will allow the constructor to be correctly dispatched (avoid using the default constructor).

I may be wrong, but I think that adding A is not going to change the constructor dispatch at all, and has the advantage of more convenient function dispatch:

f(x::OneHotMatrix{K, <:CuArray}) where K = ...

The type var K is handle with UInt32 but currently Julia cannot specific the type of type var. Therefore, I use constructor of partial specialized type (like OneHotMatrix{K, 1}) to check/convert the value. Calling OneHotArray{K, 1, 2, A} will result in calling the default constructor, which will bypass the conversion and result in type error. If we are going the use this kind of expression then I will try to replace the default constructor.

src/onehot.jl

DhairyaLGandhi

I'm curious why not generalise the current implementation. Seems the performance was gained by avoiding the scalar indexing on the internal vector in OneHotMatrix, we could just rewrite that?

DhairyaLGandhi · 2021-01-01T15:20:09Z

src/onehot.jl


-Base.getindex(xs::OneHotVector, ::Colon) = OneHotVector(xs.ix, xs.of)
+# onehot encode
+primitive type OneHot{K} <: AbstractOneHotArray{1} 32 end


Why do we need to define a primitive here instead of making OneHotArray general across dimensions?

I had the same question. The primitive type avoids the pointer allocation of a wrapper type is my guess.

But I think the need for a primitive type/array of one-hots comes from the design choice of having a one-hot vector as a unique type that arbitrary dimension arrays are built off. Instead of designing a type for the 1D case first, having a N-D one-hot array type as the base seems like an alternate design choice. I am currently prototyping this.

Avoid pointer allocation is part of the reason. See my comment on this PR for the throughout explanation

CarloLucibello · 2021-01-01T15:27:01Z

src/onehot.jl

+    ret = cat(xidss...; dims=sdims)
+    OneHotArray(ret)
+  end
+end


not sure, but probably we can remove all the Vals and just use Ints,
I think now the compiler is smart enough to perform the same optimizations

Also, since this doesn't support multiple dims, we should restrict the signature to

Base.cat(xss::OneHotArray{K}...; dims::Int)

Val is also used in Julia cat relative function (see base/abstractarray.jl). Those definition are used in order to be compatible with Julia abstract array interface.

DhairyaLGandhi · 2021-01-01T15:27:39Z

src/onehot.jl

-import Base: *
+using Base: @_noinline_meta, @_inline_meta
+using Core: is_top_bit_set
+using Core.Intrinsics: bitcast, trunc_int, sext_int, zext_int, sle_int, eq_int, and_int


we should avoid using internal functions

CarloLucibello · 2021-01-01T15:27:49Z

src/onehot.jl

+  isequal(prod(dims), length(parent)) || throw(DimensionMismatch("new dimensions $(dims) must be consistent with array size $(length(parent))"))
+  return isequal(K, first(dims)) ?
+    ohreshape(parent, Base.tail(dims)) :
+    Base._reshape(parent, dims)


also here we shouldn't use julia's internals

Base._reshape is use to avoid redefine all the reshape dim patterns and handle the reshape of AbstractArray{Bool}. It's quite common for defining custom array (like FillArrays)

it's enough to use

Suggested change

Base._reshape(parent, dims)

reshape(parent, dims)

then Base will handle that
https://github.com/JuliaLang/julia/blob/0f0bc294f7b18d685c4998ada9c2c33841a74f2d/base/reshapedarray.jl#L112

we are overloading it, that would result in infinite recursion.

That line is exactly the reason I use Base._reshape

darsnack · 2021-01-01T16:23:44Z

Here is a slightly different approach (which I think @DhairyaLGandhi was alluding to). Fundamentally, both this PR and the old Flux code created one-hot arrays by first designing a one-hot vector then wrapping that type in a multi-dimensional array (or matrix only for the old code). Instead, I design first for the notion of a N-dimensional one-hot array, then I make OneHotVector a specialized case (similar to AbstractArray).

const OneHotIndex{T, N} = Union{T, AbstractArray{T, N}}

struct OneHotArray{T<:Integer, L, N, var"N+1", I<:OneHotIndex{T, N}} <: AbstractArray{Bool, var"N+1"}
    indices::I
end

const OneHotVector{T, L} = OneHotArray{T, L, 0, 1, T}
const OneHotMatrix{T, L, I} = OneHotArray{T, L, 1, 2, I}

Base.size(x::OneHotArray{<:Any, L}) where L = (L, size(x.indices)...)

_onehotindex(x, i) = (x == i)

Base.getindex(x::OneHotArray, i, I...) = _onehotindex.(x.indices[I...], i)
Base.getindex(x::OneHotVector, i) = _onehotindex(x.indices, i)

Like this PR, the approach above should enjoy all the same advantages:

arbitrary dimensional support
better handling for the array interface (e.g. hcat on the GPU)
easy to use Adapt.jl to map storage
lower memory consumption
- this implementation is actually even better, since the user can use less that 32 bits to represent a one-hot vector if they want to

But using this approach means we don't need to define a new primitive type or use any Core functions. I will try and submit a PR today so we can compare these approaches for performance. If we can recover the performance of this PR with the new approach, then I think we should use the new approach (it is more intuitive and easier to maintain in imo).

A demo:

julia> is = [1, 3, 2, 1]
4-element Array{Int64,1}:
 1
 3
 2
 1

julia> x = OneHotMatrix{eltype(is), 3, typeof(is)}(is)
3×4 OneHotArray{Int64,3,1,2,Array{Int64,1}}:
 1  0  0  1
 0  0  1  0
 0  1  0  0

julia> x[2, 3]
true

julia> is = [1 2; 3 1; 2 1]
3×2 Array{Int64,2}:
 1  2
 3  1
 2  1

julia> x = OneHotArray{eltype(is), 3, ndims(is), ndims(is) + 1, typeof(is)}(is)
3×3×2 OneHotArray{Int64,3,2,3,Array{Int64,2}}:
[:, :, 1] =
 1  0  0
 0  0  1
 0  1  0

[:, :, 2] =
 0  1  1
 1  0  0
 0  0  0

julia> x[1, 3, 2]
true

chengchingwen · 2021-01-02T12:50:08Z

There are a few thoughts about some decisions, so I'll make some explanation here.

The core idea was to develop a general representation that's both semantically meaningful and GPU friendly. So why using primitive type instead of extending the origin OneHotVector? I do use that solution before. Actually, if you checkout to some older version of Transformers.jl like v0.1.1, you can find the code. The problem is that CUDAnative.jl and CuArrays.jl are actually better at handling primitive type than some custom struct that looks almost like primitive one. I wrote a lot of workaround just to avoid memory copy and scalar indexing. Sometimes there were even some functions that cannot be compiled with CUDAnative.jl. That's why I started to use primitive type for the onehot representation. The use of parametric type avoid storing duplicate values. The use of primitive type allow the value to be easily handled by GPU with less operations and memory needed, and even make reintrepret possible. As @darsnack said, it's somewhat possible to use a single wrapper around integer and integer array. However, I personally think that solution is not semantically meaningful enough, especially when it comes to OneHotVector. It would be confusing when you write specialized code. And both 0-dimensional array and Union-parameterized struct suffer from unnecessary allocations, and the potential of type unstability.

However, primitive type does introduce lots of Julia's internal functions into the scope. Currently we need to use Core.Intrinsics to successfully define a custom primitive type. I know there are some potential risk on using internal functions, but on the other hand I think the intrinsic APIs are less prone to be change greatly for 2 reasons. First, since all code that introduce new primitive type are using the same APIs, even Julia's base primitive are defined with them (see base/boot.jl, base/float.jl, etc.), the change of these APIs will become some breaking changes. Second, the intrinsics are basically the direct translation of llvm IR, tbh there're not much things to change. Therefore, I believe the use of the Core.Intrinsics and some other relative internal function are quite harmless.

darsnack · 2021-01-02T14:27:14Z

Thanks for the explanation!

The problem is that CUDAnative.jl and CuArrays.jl are actually better at handling primitive type than some custom struct that looks almost like primitive one.

I don't particularly understand this point, so do you mind elaborating? Why would CUDA handle an array of UInt32s differently from an array of OneHots (where OneHot is just an unsigned 32-bit integer)?

And both 0-dimensional array and Union-parameterized struct suffer from unnecessary allocations, and the potential of type unstability.

I'm also a little confused here. The difference between OneHotArray here and in #1448 is that the type of each integer is a OneHot here and in the #1448 it is a concrete subtype of Integer.

Also what do you mean by 0-dimensional array? For one-hot arrays, there is always at least one dimension (the first). There's nothing like a scalar one-hot vector. Unless you are referring to the dimension of the internal integer array?

Do you mind suggesting some more performance tests that illustrate the issues you had to solve with CUDA? I'd like to add them to #1448 for comparison. It's entirely possible that the GPUCompiler does something better with the primitive type, but fundamentally there doesn't seem like there is a reason to do so.

chengchingwen · 2021-01-02T15:04:35Z

I don't particularly understand this point, so do you mind elaborating? Why would CUDA handle an array of UInt32s differently from an array of OneHots (where OneHot is just an unsigned 32-bit integer)

I'm talking about the old OneHotVector which is a struct of two UInt32 and can be directly treat as a UInt64.

Also what do you mean by 0-dimensional array? For one-hot arrays, there is always at least one dimension (the first). There's nothing like a scalar one-hot vector. Unless you are referring to the dimension of the internal integer array?

In Julia you can define a 0-dimension array with something like ones() or Array{Float64, 0}(undef). It's an array of type Array{Float64, 0}, which is an unpleasant but workable way. As you can define OneHotVector with underlying integer array of type Array{Int32, 0}.

I'm also a little confused here. The difference between OneHotArray here and in #1448 is that the type of each integer is a OneHot here and in the #1448 it is a concrete subtype of Integer.

I was referring to the const OneHotIndex{T, N} = Union{T, AbstractArray{T, N}} in #1448 with I<:OneHotIndex{T, N} as the "Union-parameterized struct". Given I is just T, we still allocate the wrapper OneHotArray even it's just a single value and introduce a getfield operation which requires a few IRs. On the other hand, since there is an parameterized with Union, it's possible that some function might introduce type unstability.

darsnack · 2021-01-02T15:31:36Z

Ah I understand your point. Yeah the corner case of a single one-hot vector will incur a wrapper overhead, but I'm curious where that is a problem in practice. If you have a single one-hot vector, then there is a total of two integers of memory being used, so it doesn't seem like a problem. If you have many one-hot vectors stored in an array, then it makes more sense to batch them. Is there a use-case where you don't want to batch, but you do need many one-hot vectors?

On the other hand, since there is an parameterized with Union, it's possible that some function might introduce type unstability.

The Union is only used to restrict the I type parameter of OneHotArray to be a concrete subtype of OneHotIndex. For an instance of OneHotArray, I should be a stable type for inference (I think)? I could add some @test_inferred if this is a concern.

chengchingwen · 2021-01-02T16:07:45Z

If you have a single one-hot vector, then there is a total of two integers of memory being used, so it doesn't seem like a problem. If you have many one-hot vectors stored in an array, then it makes more sense to batch them. Is there a use-case where you don't want to batch, but you do need many one-hot vectors?

0-dim array allocated on heap, and struct would depend on the compiler. Primitive type is special because it can always be simply passed by value after compilation, so no potential edge case. I agree that it makes more sense to batch them, but sometimes you can only iterate thought that sequence of one-hots (for some really dynamic and un-batch-able models). They are rare but exist, so I tend to rip out the potential performance issue if possible.

@DhairyaLGandhi

1448: Arbitrary dimension one-hot arrays r=DhairyaLGandhi a=darsnack This supersedes #1447. It should address the same issues: - fix #1445, #1229 - probably fix also #864, #556, #189 This PR introduces a new one-hot N-dimensional array type, `OneHotArray`. Like #1447, this approach avoids the pointer allocations associated with `OneHotMatrix` being an array of `OneHotVector`s. It also lifts the "height" into the type parameter to avoid unnecessary allocation. Unlike #1447, this approach does not introduce a new primitive type. Instead, a "one-hot vector" is represented with a single subtype of `Integer` that is configurable by the user. By default, the exposed API will use `UInt32`. Fundamentally, the primitive type is necessary because wrapping a `UInt32` as a `OneHotVector` will suffer memory penalties when you create an `Array{<:OneHotVector}`. But if we begin by designing for N-dimensions, then `OneHotVector` is just the specialized 1D case (similar to how `Vector{T} = Array{T, 1}`). ## Performance I compared against the same tests mentioned in #1447. Please suggest more if you want to. 1. #189 ```jl #master julia> x = Flux.onehotbatch(rand(1:100, 50), 1:100); julia> W = rand(128, 100); julia> @Btime $W * $x; 5.095 μs (13 allocations: 50.86 KiB) julia> cW, cx = cu(W), cu(x); julia> @Btime $cW * $cx; 24.948 μs (86 allocations: 3.11 KiB) #1447 julia> x = Flux.onehotbatch(rand(1:100, 50), 1:100); julia> W = rand(128, 100); julia> @Btime $W * $x; 5.312 μs (3 allocations: 50.36 KiB) julia> cW, cx = cu(W), cu(x); julia> @Btime $cW * $cx; 8.466 μs (61 allocations: 1.69 KiB) # this PR julia> x = Flux.onehotbatch(rand(1:100, 50), 1:100); julia> W = rand(128, 100); julia> @Btime $W * $x; 4.708 μs (3 allocations: 50.56 KiB) julia> cW, cx = cu(W), cu(x); julia> @Btime $cW * $cx; 8.576 μs (63 allocations: 1.73 KiB) ``` 2. #556 ```jl #master julia> valY = randn(1000, 128); julia> @Btime Flux.onecold($valY); 365.712 μs (1131 allocations: 38.16 KiB) julia> @Btime Flux.onecold($(gpu(valY))); ┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)` └ @ GPUArrays ~/.julia/packages/GPUArrays/jhRU7/src/host/indexing.jl:43 1.330 s (781248 allocations: 31.59 MiB) #1447 julia> valY = randn(1000, 128); julia> @Btime Flux.onecold($valY); 524.767 μs (8 allocations: 4.00 KiB) julia> @Btime Flux.onecold($(gpu(valY))); 27.563 μs (169 allocations: 5.56 KiB) # this PR julia> valY = randn(1000, 128); julia> @Btime Flux.onecold($valY); 493.017 μs (8 allocations: 4.53 KiB) julia> @Btime Flux.onecold($(gpu(valY))); 26.702 μs (171 allocations: 5.61 KiB) ``` ## Summary This should basically be #1447 but simpler to maintain w/ fewer changes. Tests are passing, though I think we should add more tests for one-hot data (currently our test set seems pretty sparse). Performance matches #1447 where I have tested, but please suggest more performance tests. In theory, any performance difference between #1447 and this PR should be recoverable. ### PR Checklist - [ ] Tests are added - [ ] Entry in NEWS.md - [ ] Documentation, if applicable - [ ] Final review from @DhairyaLGandhi (for API changes). cc @CarloLucibello @chengchingwen Co-authored-by: Kyle Daruwalla <daruwalla@wisc.edu> Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>

chengchingwen added 2 commits January 1, 2021 04:26

new onehot implementation

9b0edf0

argmax for onehotarray

65bc47d