Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

Upsampling GPU Kernel for Flux #293

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

avik-pal
Copy link
Contributor

@avik-pal avik-pal commented Feb 24, 2019

Current Issues:

  • Line 47 in upsample.jl is not a safe operation without atomic addition. It results in incorrect and inconsistent answers in the backward pass.

@maleadt suggested on slack to use CUDAatomics for now, but I could not get it working. Concerns regarding that

  1. It needs CuDeviceArray, I am not entirely sure how to pass CuArray to it.
  2. Should we be adding a dependency on a package which is not recommended.

Stuff to handle before it is ready for merging:

  • Open the corresponding pull requests in Flux.jl and NNlib.jl to test the complete working
  • Add tests to verify for the correctness of the code.

I am not entirely sure if this kernel belongs at this location (inside CUDNN). I can shift it to any other location if needed.

cc @MikeInnes

@maleadt maleadt changed the title Upsampling GPU Kernel for Flux WIP: Upsampling GPU Kernel for Flux Apr 15, 2019
@dbadrian
Copy link

dbadrian commented Sep 2, 2019

Okay, so I had a crack at this today.

Firstly, CUDAnative itself now supports atomics, thus we can make a direct call ala

    @inbounds CUDAnative.atomic_add!(pointer(x, x_idx), y[y_idx])

For sake of interest the call to pointer (dunno where its actually defined, I guess CUDAnative), probably only does

  ptr = convert(CuArrays.CuPtr{T}, CuArrays.buffer(x, x_idx))
  CUDAnative.DevicePtr{T,AS.Global}(ptr)

However, the code itself for the ∇upsample has a two bugs: firstly, after the call to similar, we need to zero out the tensor or we add to the initial random values and, secondly, the height/width needs to be recalculated, or we have shifted indices (unless I fundamentally misunderstood how the gradients will propagate).

tl;dr:

function ∇upsample_kernel(state, y, x, height, width, channels, batch, stride, scale)
    i = @linearidx y state

    y_idx = i
    y_h = (i - 1) % (height * stride[1]) + 1
    i = (i - 1) ÷ (height * stride[1])
    y_w = i % (width * stride[2]) + 1
    i = i ÷ (width * stride[2])
    y_c = i % channels + 1
    i = i ÷ channels
    y_b = i % batch + 1

    x_h = (y_h - 1) ÷ stride[1] + 1
    x_w = (y_w - 1) ÷ stride[2] + 1
    x_c = y_c
    x_idx = (y_b - 1) * width * height * channels + (x_c - 1) * width * height + (x_w - 1) * height + x_h

    @inbounds CUDAnative.atomic_add!(pointer(x, x_idx), y[y_idx])

    return nothing
end

function ∇upsample(dy::CuArray, stride, scale = 1)
    (height, width, channels, batch) = size(dy)
    @assert height % 2 == 0
    @assert width % 2 == 0
    dx = similar(dy, (height ÷ stride[1], width ÷ stride[2], channels, batch))
    dx .= 0.0
    (height, width, channels, batch) = size(dx)
    gpu_call(∇upsample_kernel, dy, (dy, dx, height, width, channels, batch, stride, scale))
    return dx
end

Maybe the call to CUDAnative can be made smarter, that's simply how I got it to work locally quickly. Maybe @maleadt can give some advice. I guess the correct usage would be by doing a @atomic? I did see a recent bug https://github.com/JuliaGPU/CUDAnative.jl/issues/421 so I didnt test it for now.

One problem I can imagine with the current code. If scale, e.g., is not of the same T as CuArray elements, then the call to CUDAnative will fail (whether it should be caught beforehand/made impossible by changing the function signature of upsample - dunno).

For tests, I didn't write any yet. I manually verified what x_idx/y_idx are generated for a simple (2,2,1,1) -> (4,4,1,1) upsampling case and empirically my model where I use it (with all dims >1), it seems to work (comparing to the alternative of abusing a ConvTranspose as upsampling layer).

However, I am happy to contribute some test cases as well soon. Let's crack down on some stale PR :)

@vchuravy
Copy link
Member

vchuravy commented Sep 2, 2019

Great! Please do open a PR and also take a look at @atomic from CUDAnative.

@dbadrian
Copy link

dbadrian commented Sep 2, 2019

Okay, will do so soon (currently semi-homeless, so a tad bit difficult to get things done these days).

Also just took a look at the code in FluxML/NNlib.jl#95 by pshask, which seems to work as well, including on CuArrays. Quick test didn't yield big speed differences, but need to do some proper benchmarking. Guess, based on the results, one solution should be picked. Although his works on both Cu/Arrays, which is neat and clean.

@maleadt
Copy link
Member

maleadt commented Sep 3, 2019

Maybe the call to CUDAnative can be made smarter, that's simply how I got it to work locally quickly. Maybe @maleadt can give some advice. I guess the correct usage would be by doing a @atomic? I did see a recent bug JuliaGPU/CUDAnative.jl#421 so I didnt test it for now.

To clarify, @atomic disable bounds checking, so that's not an issue in your case. It's just syntactic sugar though, so it's perfectly fine to just call atomic_add! explicitly.

@sdewaele
Copy link

Thanks for working on this! A fully functional upsampling layer will be a very important addition to Flux.

@maleadt maleadt marked this pull request as draft May 8, 2020 10:35
@maleadt maleadt changed the title WIP: Upsampling GPU Kernel for Flux Upsampling GPU Kernel for Flux May 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants