Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bilinear upsample layer #1180

Closed
wants to merge 14 commits into from

Conversation

ltjkoomen
Copy link

Hi,

I've implemented a bilinear upsampling layer that upsamples the first 2 dimensions of a 4-dimensional array with an integer factor. I have seen the implementation in #1136, but:

  • This one is about 10 times faster on CPU on my laptop
  • This one uses broadcasting instead of scalar loop indexing, which makes it usable on the GPU
  • This one is directly compatible with Zygote

Some possible TODOs:

  • Currently the utility functions in src/layers/upsampling.jl used by the implementation are all added to the Flux namespace when doing using Flux, which seems like pollution and is unnecessary. However, I am not quite sure what the correct way is in Julia to shield these "private" functions.
  • With this implementation it is no problem to define an output size, instead of the quite limiting integer upscaling factors. Although usually upscaling is used to, for instance, increase the number of samples in each dimension by 2, maybe we should still expose this functionality somehow? This could be implemented through some specialized constructor perhaps.

ltjkoomen added 2 commits May 13, 2020 15:37
Added BilinearUpsample2d layer
@kczimm
Copy link

kczimm commented May 13, 2020

@ltjkoomen this is awesome. I hope others can help review this PR so it can be merged quickly. Thank you for this contribution. I'm happy to close #1136 in favor of this one.

@DhairyaLGandhi
Copy link
Member

Could we add GPU tests, I'd like to ensure that it doesn't cause trouble there

@ltjkoomen
Copy link
Author

I have noticed a problem when using gradient when the input is a CuArray. Although the upsampled output array is correct (equal to the CPU output), the output of gradient is wrong.

using Flux: gradient, BilinearUpsample2d
using CuArrays

x = Float32.([1 2; 3 4])[:,:,:,:]
x_c = CuArray(x)

c = BilinearUpsample2d((2,2))

o = c(x)
o_c = c(x_c)

g = gradient(x -> sum(c(x)), x)
g_c = gradient(x -> sum(c(x)), x_c)

The result is that o and o_c are equal, but g and g_c are not.
Anyone have experience getting Zygote to work with CuArrays?

Also removed a faulty doctest
@mcabbott
Copy link
Member

I haven't looked hard but I suspect that the problem is aliasing: won't you end up writing to two views of the same object, with shifted indices? When this gets done in parallel, I think that CuArrays is not careful about the order.

@ltjkoomen
Copy link
Author

ltjkoomen commented May 14, 2020

Even when I remove the @views and do it like this:

v1 = img[ilow1,ilow2,:,:]
v2 = img[ihigh1,ilow2,:,:]
imgupsampled1 = v1 .* (1 .- wdiff1) .+ v2 .* wdiff1
v3 = imgupsampled1[:,ihigh2_r,:,:]
imgupsampled2 = imgupsampled1 .* (1 .- wdiff2) .+ v3 .* wdiff2

It still doesn't work.

Wondering if it happens because the indices themselves are no CuArrays, which I did because it leads to a scalar indexing warning and slowdown. That's mentioned here.

I think it's because of the way I implemented the second interpolation step (in the 2nd dimension imgupsampled1[:,ihigh2_r,:,:], this will give the wrong gradients as ihigh2_r is copying a single column of imgupsampled1 multiple times. Will need a re-implementation.

I've tried it like this:

v1 = img[ilow1,:,:,:]
v2 = img[ihigh1,:,:,:]

imgupsampled1 = v1 .* (1 .- wdiff1) .+ v2 .* wdiff1

v3 = imgupsampled1[:,ilow2,:,:]
v4 = imgupsampled1[:,ihigh2,:,:]

imgupsampled2 = v3 .* (1 .- wdiff2) .+ v4 .* wdiff2

And with a CuArray it still fails to find the correct gradient, while the CPU code is doing fine all this time.

@ltjkoomen
Copy link
Author

I have located the problem, it seems gradient doesn't work with a CuArray generated through indexing with an array like idx=[1, 1, 2], like so: out = x[idx]. I have included a minimal example:

using Flux
using CuArrays

function min_example(x, idx)
    out = x[idx]
    return out
end

x = Float32.([1, 2])
idx = [1, 1, 2]
x_c = CuArray(x)

# CPU version
fex(x) = min_example(x, idx)
f(x) = sum(fex(x))
df(x) = gradient(f, x)[1]

# CuArray version
fex_c(x_c) = min_example(x_c, idx)
f_c(x_c) = sum(fex_c(x_c))
df_c(x_c) = gradient(f, x_c)[1]

@assert fex(x) == cpu(fex_c(x_c)) # pass, forward is the same
@assert df(x) == cpu(df_c(x_c)) # fail, gradient is wrong for the CuArray version

Unsure whether this is a bug or expected behavior. Or is this actually the possible issue brought up by @mcabbott ?

ltjkoomen added 2 commits May 15, 2020 12:42
Gradient doesn't currently work when using CuArrays
Currently gradient does not work with BilinearUpsampling2d when using CuArrays
@mcabbott
Copy link
Member

The issue I should have linked is https://github.com/JuliaGPU/CuArrays.jl/issues/684 . Was a bit confused by views etc, but the core idea is that you want to accumulate at indices like idx=[1, 1, 2], which can easily go wrong. (I think the same happens for @strided, threaded broadcasting on CPU.)

Surely there is some way around this, though. If you don't want to up/down-sample by factors larger than 2, can you divide up the work even/odd & be sure that there are no collisions?

@ltjkoomen
Copy link
Author

What about defining a custom adjoint function instead? The adjoint of upsampling is a downsampling operation, which could be implemented efficiently using the existing convolutional layer code. The forward part seems to work fine even with this current implementation, both the normal- and CUDA-array versions.

ltjkoomen added 7 commits May 16, 2020 16:38
Since Zygote isn't able to properly handle the current implementation, and moving to an iterative approach I believe would mean a very significant performance reduction, I have added a custom adjoint.

Since the adjoint of upsampling is a downsampling operation, I have used Flux.Conv in combination with a downsample kernel and some manual edge-effect correction.
Effort to reduce memory footprint slightly
type instability occurred due to a variable dimension number, fixed by hard-coding everything to 4 dimensions
Only type-instability left is the fact that the weights of Flux.Conv are not type stable.
@ltjkoomen
Copy link
Author

I have fixed some type-instabilities, mostly by hardcoding the fact that there are 4 dimensions to the input array and the first 2 are upsampled. This has fixed all type-instability issues in the forward pass. However, the backward pass is still not type-stable because of the fact that I have used the Flux.Conv layer to implement the downsampling operation in the custom adjoint. The type of the weights of Flux.Conv cannot be inferred, an issue which I have already brought up in #1178.

@andevellicus
Copy link

Is there any way to make this work for more than 2D? I work with medical imaging so a lot of that is in 3 dimensions....

@ltjkoomen
Copy link
Author

That would be trilinear upsampling. I guess the same method I use for bilinear upsampling could be used, but I haven't tried it yet.

@kczimm
Copy link

kczimm commented May 29, 2020

What is pending for this merge? If it just the type instability due to Flux.Conv can it be merged and then it will presumably be fixed when/if Flux.Conv fixes the type instability?

@@ -30,6 +30,7 @@ Random.seed!(0)
include("layers/normalisation.jl")
include("layers/stateless.jl")
include("layers/conv.jl")
include("layers/upsample.jl")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

@CarloLucibello
Copy link
Member

I agree this should be merged and type instabilities issues addressed separately

Copy link
Member

@DhairyaLGandhi DhairyaLGandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I had the review there but damn I guess I forgot to put it up

Typically we also want to match the indentation generally across the additions

wdiff1 = eltype(img).(wdiff1)
wdiff2 = eltype(img).(wdiff2)

if typeof(img) <: CuArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? The kernel should handle this case generically.

"""
@nograd function construct_xq(n::T, m::T) where T<:Integer
typed1 = one(n)
typed2 = 2typed1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use T and broadcasting perhaps.

ilow = floor.(Int, xq)
ihigh = ceil.(Int, xq)

wdiff = xq[:,:,:,:] .- ilow[:,:,:,:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the colons there if it's just everything?

Upsamples the first two dimensions of the 4-dimensional array `img` by the two upsample factors stored in `k_upsample`,
using bilinear interpolation. The interpolation grid is identical to the one used by `imresize` from `Images.jl`.
"""
function bilinear_upsample2d(img::AbstractArray{T,4}, k_upsample::NTuple{2,<:Real}) where T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be needed to get us to be able to do it with 3d convs as well?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this....

The above holds as long as `idx` contains every index in `x`.
"""
@nograd function adjoint_of_idx(idx::Vector{T}) where T<:Integer
d = trues(size(idx))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be type unstable to use bools with floats and that might stop julia from using blas calls more efficiently

ihigh2_r = adjoint_of_idx(ilow2)[ihigh2]

wdiff1 = eltype(img).(wdiff1)
wdiff2 = eltype(img).(wdiff2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use T from the method signature

Copy link
Member

@MikeInnes MikeInnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really nice, thanks! I think the API needs a bit of discussion but there are no major blockers.

Should the core kernels for this perhaps be in NNlib? Doesn't matter too much since we can move them later, but just thinking about things like adding specialised CUDA implementations and such down the line; this is what we've done with most other kernels like conv.

@@ -0,0 +1,325 @@
"""
BilinearUpsample2d(factors::Tuple{Integer,Integer})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary for this to be specific to 2D? Could it infer its dimension from factors, like the Conv layer, and be generic across dimension numbers? Also, could it not be considered BilinearInterpolate in general (e.g. with fractional factors)?

(It is fine if dimensions other than 2, or fractional factors, are not currently implemented and throw an error, just nice if we can add them in future.)

[If we do keep 2D it should be capitalised that way.]

end

@adjoint function (c::T where T<:BilinearUpsample2d)(x::AbstractArray)
(c::T where T<:BilinearUpsample2d)(x), c̄ -> (nothing, bilinear_upsample_adjoint(c̄, c.factors))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The adjoint should probably be applied to the bilinear_upsample function, not the layer, so that the function can be used directly where appropriate. As with Conv it would be nice to support a non-layer version of this with a nice API.

The type on the second c::T where ... seems unnecessary.

@AlfTetzlaff
Copy link

AlfTetzlaff commented Jun 17, 2020

Hi everybody, just recently I blindly translated the pytorch bilinear upsampling to julia & cuda. Unfortunately I didn't have the time to finish it. The array indexing is still messed up and honestly I don't really know how the gradient calculation works. I've put everything in a gist, so that you can just copy it, if interested :)

@CarloLucibello CarloLucibello mentioned this pull request Dec 19, 2020
92 tasks
@maxfreu
Copy link

maxfreu commented Dec 21, 2020

Hi! I would really like to give this a push. I could really need this (or nearest neighbor) for U-Net implementations. Have the reviews been addressed? If not, what would I have to do to work on them? In the mean time I have finished porting and testing the pytorch implementation core of bilinear upsampling here. It now produces the same output as pytorch for forward and backward (verified via nyan cat). Fractional upscaling factors are allowed, but still only 4D tensors - but in my opinion we dont need more for a start.

@DhairyaLGandhi
Copy link
Member

I think the pr is in a good shape. Have you been able to benchmark the run time against pytorch as well? Also, there is UNet.jl if that helps.

@maxfreu
Copy link

maxfreu commented Dec 22, 2020

Well, benchmarking is a science for itself... It seems that timings are on par in many cases, but there is some caching going on in pytorch. This is how I benchmark:

data = rand(Float32, 256,256,16,32) |> cu
@benchmark upsample_bilinear($data, 2,2) # min: 8ms, mean: 44ms
up_op = nn.UpsamplingBilinear2d(scale_factor=2)
data = torch.rand((32,16,256,256)).to("cuda")
times = []
with torch.no_grad():
    for i in range(100):
        t0 = time()
        up_op(data)
        torch.cuda.synchronize()
        torch.cuda.empty_cache()
        times.append(time()-t0)
print(min(times)*1000) # 19ms with empty_cache()  -  7.2ms without
print(np.mean(times)*1000) # 37ms with empty_cache()  -  7.4 without

CUDA 2.3
pytorch 1.5
GTX 980

@maxfreu
Copy link

maxfreu commented Dec 22, 2020

I just checked the changes in this PR, looks good indeed. Forward and backward pass work. However for the image size given in the above post the pytorch gpu kernel is another 3x faster. So maybe it makes sense to add it as dispatch for CuArrays. The pullback should also be straightforward to implement. The question is where this should happen... merge first, then add dispatch? Or the other way round? The kernel I ported is by the way from Caffee2, there are also the newer ATen kernels. As soon as I have figured out how to port them, I could output NearestNeighbour2d/3d & trilinear as well. Oh, and I dont know where to put the kernels - CUDA.jl? Here? NNlib?

@CarloLucibello
Copy link
Member

CarloLucibello commented Dec 23, 2020

@maxfreu cpu implementations shoud go to NNlib, kernel's should go to CUDA.jl if @maleadt agrees, otherwise, we can add them to Flux. A simple layer wrapper should live in Flux.

@maxfreu
Copy link

maxfreu commented Dec 28, 2020

I just noted that there is an older, good attempt by @avik-pal here for CPU and here for GPU, but I don't know which method is used. He implemented UpsamplingDims along the lines of ConvDims to produce specialized code. Should we consider that or is it over the top?

@ltjkoomen are you still enganged? Some time early next year might be sufficient 😉

@CarloLucibello
Copy link
Member

@ltjkoomen has not been responsive for some time. If the cpu implementation here is fine I will move it to NNlib. Any performance optimization can come later, the important thing is to agree on the interface

@CarloLucibello
Copy link
Member

CarloLucibello commented Dec 30, 2020

This should be fine:

NNlib.bilinear_upsample(x::AbstractArray{<:Real,4}, k::NTuple{2,<:Real})

CUDA.jl will have to overload it.

@maleadt
Copy link
Collaborator

maleadt commented Jan 4, 2021

@maxfreu cpu implementations shoud go to NNlib, kernel's should go to CUDA.jl if @maleadt agrees, otherwise, we can add them to Flux. A simple layer wrapper should live in Flux.

GPU implementations of NNlib interfaces can go in CUDA.jl, we already have a couple.

@qin-yu
Copy link

qin-yu commented Jan 4, 2021

Indeed we need to finish this layer! I'm also writing a UNet (with repeat(), see #270) right now 😆 I guess the GPU support for repeat() has not even been done (see JuliaGPU/GPUArrays.jl#126)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.