-
-
Notifications
You must be signed in to change notification settings - Fork 83
Upsampling GPU Kernel for Flux #293
base: master
Are you sure you want to change the base?
Conversation
Okay, so I had a crack at this today. Firstly, CUDAnative itself now supports atomics, thus we can make a direct call ala @inbounds CUDAnative.atomic_add!(pointer(x, x_idx), y[y_idx]) For sake of interest the call to pointer (dunno where its actually defined, I guess CUDAnative), probably only does ptr = convert(CuArrays.CuPtr{T}, CuArrays.buffer(x, x_idx))
CUDAnative.DevicePtr{T,AS.Global}(ptr) However, the code itself for the ∇upsample has a two bugs: firstly, after the call to similar, we need to zero out the tensor or we add to the initial random values and, secondly, the height/width needs to be recalculated, or we have shifted indices (unless I fundamentally misunderstood how the gradients will propagate). tl;dr: function ∇upsample_kernel(state, y, x, height, width, channels, batch, stride, scale)
i = @linearidx y state
y_idx = i
y_h = (i - 1) % (height * stride[1]) + 1
i = (i - 1) ÷ (height * stride[1])
y_w = i % (width * stride[2]) + 1
i = i ÷ (width * stride[2])
y_c = i % channels + 1
i = i ÷ channels
y_b = i % batch + 1
x_h = (y_h - 1) ÷ stride[1] + 1
x_w = (y_w - 1) ÷ stride[2] + 1
x_c = y_c
x_idx = (y_b - 1) * width * height * channels + (x_c - 1) * width * height + (x_w - 1) * height + x_h
@inbounds CUDAnative.atomic_add!(pointer(x, x_idx), y[y_idx])
return nothing
end
function ∇upsample(dy::CuArray, stride, scale = 1)
(height, width, channels, batch) = size(dy)
@assert height % 2 == 0
@assert width % 2 == 0
dx = similar(dy, (height ÷ stride[1], width ÷ stride[2], channels, batch))
dx .= 0.0
(height, width, channels, batch) = size(dx)
gpu_call(∇upsample_kernel, dy, (dy, dx, height, width, channels, batch, stride, scale))
return dx
end Maybe the call to CUDAnative can be made smarter, that's simply how I got it to work locally quickly. Maybe @maleadt can give some advice. I guess the correct usage would be by doing a @atomic? I did see a recent bug https://github.com/JuliaGPU/CUDAnative.jl/issues/421 so I didnt test it for now. One problem I can imagine with the current code. If scale, e.g., is not of the same T as CuArray elements, then the call to CUDAnative will fail (whether it should be caught beforehand/made impossible by changing the function signature of upsample - dunno). For tests, I didn't write any yet. I manually verified what x_idx/y_idx are generated for a simple (2,2,1,1) -> (4,4,1,1) upsampling case and empirically my model where I use it (with all dims >1), it seems to work (comparing to the alternative of abusing a ConvTranspose as upsampling layer). However, I am happy to contribute some test cases as well soon. Let's crack down on some stale PR :) |
Great! Please do open a PR and also take a look at |
Okay, will do so soon (currently semi-homeless, so a tad bit difficult to get things done these days). Also just took a look at the code in FluxML/NNlib.jl#95 by pshask, which seems to work as well, including on CuArrays. Quick test didn't yield big speed differences, but need to do some proper benchmarking. Guess, based on the results, one solution should be picked. Although his works on both Cu/Arrays, which is neat and clean. |
To clarify, |
fc487fd
to
fced436
Compare
Thanks for working on this! A fully functional upsampling layer will be a very important addition to Flux. |
Current Issues:
@maleadt suggested on slack to use CUDAatomics for now, but I could not get it working. Concerns regarding that
Stuff to handle before it is ready for merging:
I am not entirely sure if this kernel belongs at this location (inside CUDNN). I can shift it to any other location if needed.
cc @MikeInnes