Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve bilinear upsampling #266

Merged
merged 23 commits into from
Feb 5, 2021
Merged

improve bilinear upsampling #266

merged 23 commits into from
Feb 5, 2021

Conversation

maxfreu
Copy link
Contributor

@maxfreu maxfreu commented Jan 11, 2021

Sorry to bother you again! But this stuff didn't let me sleep, so here is a CPU implementation. See GPU PR here.

  • parallelized
  • faster than current implementation
  • open-cv and pytorch compliant
  • tests pass <- breaking tests seem unrelated (?)

I reviewed the tests of the current implementation and found it a bit strange, actually. So this one comes with a bit different tests.

Mean benchmark times in ms, upsampling by a factor of 2,
tested on 12 threads @3.7Ghz, julia 1.7

32x32x1024x1

before after
forward 53 2.7
backward 94 0.8

196x196x128x1

before after
forward 318 21
backward 64 3.6

single threaded:

32x32x1024x1

before after
forward 50 27
backward 71 7.5

196x196x128x1

before after
forward 300 150
backward 56 35

Single core would rock with cwhn tensor layout, but parallelized they are more or less the same.

Kind regards! :)

@CarloLucibello
Copy link
Member

Well, that's an awesome improvement!
Can you benchmark single-threaded performance against current master as well?
Maybe we should also start adding benchmark scripts to some perf/ folder, even if they are not very polished.
Can you annotate with a comment tests where the outputs matches the pytorch one, so that we avoid deleting them in the future?

src/upsample.jl Outdated Show resolved Hide resolved
src/upsample.jl Outdated Show resolved Hide resolved
@maxfreu
Copy link
Contributor Author

maxfreu commented Jan 14, 2021

In case you didn't see: I posted the single core benchmark in the original post.

@CarloLucibello
Copy link
Member

ufff, maxpool tests are passing again, it's incredible how unstable they are. Can you change @test_broken to @test there?

@maxfreu
Copy link
Contributor Author

maxfreu commented Jan 16, 2021

I couldn't find any occurences of @test_broken in my pooling.jl test file. Did you mean I should remove broken=xy in gradtest? I just got a glimpse of what you mean by unstable; sometimes the AutoDiff: spatial_rank test passes (locally), sometimes not.

@CarloLucibello
Copy link
Member

Yes, maybe change

gradtest(x -> maxpool(x, pdims), x; broken=spatial_rank <= 2)

to

gradtest(x -> maxpool(x, pdims), x; broken=spatial_rank <= 0)

since I have the impression we will have to change it back soon

test/pooling.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

Can you add a test with non-integer scale?

It's a bit sad we lose CuArray support until JuliaGPU/CUDA.jl#636 is merged, but I cannot think of a way around it.

Now that you have thought a bit more about this, would you be able to extend the code to support the 1d and 3d case (in a later PR)?

Following the discussion in FluxML/Flux.jl#1468, can you add support for integer scale?
You are welcome to chime in the discussion on the interface btw

@maxfreu
Copy link
Contributor Author

maxfreu commented Jan 18, 2021

Supporting 1D and 3D is easy; just one less or more for-loop. Nearest neighbour is also easy to implement this way, as only the source index calculation is changed.

@maxfreu maxfreu mentioned this pull request Jan 18, 2021
@CarloLucibello
Copy link
Member

Supporting 1D and 3D is easy; just one less or more for-loop.

a PR would be much appreciated

src/upsample.jl Outdated Show resolved Hide resolved
src/upsample.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

can rebase and remove the changes to the pooling tests now

@maxfreu
Copy link
Contributor Author

maxfreu commented Jan 20, 2021

can rebase and remove the changes to the pooling tests now

Sorry I don't know what rebase is. Do you mean I should simply undo the changes to the pooling tests? Probably I shouldn't have commited to the master branch of my fork. (?)

@mcabbott
Copy link
Member

You might be able to merge with this website, maybe the button says "resolve conflicts". On master there is now an alternative fix to the problem solved by "broken=spatial_rank == 0) # was == 2 before" etc.

src/upsample.jl Outdated Show resolved Hide resolved
@mcabbott
Copy link
Member

Again, the way this API was set was for syntactic consistency
I would keep the consistency here.

Which other functions do you have in mind? These place the mutated dx first:

methods(NNlib.∇maxpool!)
methods(NNlib.∇softmax!)
methods(NNlib.∇conv_data!) # clearer in comments than in variable names

@DhairyaLGandhi
Copy link
Member

I'm pretty certain that is context dependent? And maybe some of those functions need to be updated accordingly then.

@maxfreu
Copy link
Contributor Author

maxfreu commented Jan 28, 2021

I think the if size(dx) == size(Δ) check (trivial case) can move earlier, to be done before allocating dx at all.

I was only thinking about the overhead of the comparison itself, not the allocation, right.

Do ∇upsample_bilinear_whcn_kernel! and ∇upsample_bilinear! need to be distinct functions at all? Even without this step, the other PR could overload ∇upsample_bilinear!(::CuArray, ...) and be sure to get the answer.

The arguments to the GPU kernel and this one are a bit different. I could bring both in line, but this would require to hoist some of the logic out of the CPU kernel, which would make things less clear maybe, but it works. My thoughts about the API go like this:

const NDA = NamedDimsArray
upsample_bilinear!(y::AbstractArray..., x) = upsample_bilinear_whcn_kernel!(y, x) # backwards compatibility
# these two could be fused into one, yes. The parent() call would have to go into the kernel then.
upsample_bilinear!(y::NDA{(:w,:h,:c,:n)}, x::...) = upsample_bilinear_whcn_kernel!(parent(y), parent(x))
upsample_bilinear!(y::NDA{(:c,:w,:h,:n)}, x)      = upsample_bilinear_cwhn_kernel!(parent(y), parent(x))

function upsample_bilinear!(y::NDA{(:w,:h,:c,:n),T,N,A}, x::...) where {T, N, A<:CuArray}
  a,b,c = ...
  threads = ...
  blocks = ...
  @cuda threads blocks upsample_bilinear_whcn_kernel!(a,b,c, parent(x), parent(y))  # <- the GPU kernel args are a bit different
  return y
end

upsample_bilinear!(y::NDA{(:c,:w,:h,:n)}, x) where {T, N, A<:CuArray} = ...

# gradient analogously

Edit: I basically don't care about the argument order. Should we vote or do you have a dictator? 🤣

@mcabbott
Copy link
Member

mcabbott commented Jan 28, 2021

OK. My suggestion here would look more like this:

upsample_bilinear!(y, x) = upsample_bilinear_whcn!(y, x)  # maybe? 
# Not sure this function need exist, `upsample_bilinear(x)` can call `upsample_bilinear_whcn!(y,x)` directly?

function upsample_bilinear_whcn!(y::AbstractArray, x:: AbstractArray)
  # direct implementation as in this PR
end

# This worker has one job, very simply dispatch, will never change:
function upsample_bilinear_whcn!(y::CuArray, x::...)
  a,b,c = ...
  threads = ...
  blocks = ...
  @cuda threads blocks upsample_bilinear_whcn_kernel!(a,b,c, parent(x), parent(y))  # the real GPU kernel
  return y
end

# These two workers can be added later, without breaking anything:
upsample_bilinear_cwhn!(y::AbstractArray, x:: AbstractArray)
upsample_bilinear_cwhn!(y::CuArray, x::...) = ...

const NDA = NamedDimsArray
# Only one function dispatches on NDA, and it does not need to load CUDA:
upsample_bilinear(x::NDA{(:w,:h,:c,:n)}, scale) = begin ... upsample_bilinear_whcn!(parent(y), parent(x))
upsample_bilinear(x::NDA{(:c,:w,:h,:n)}, scale) = begin ...  upsample_bilinear_cwhn!(parent(y), parent(x))

Re argument order, it looks fine I think, Dhairya got me worried that we were all over the map in this package, but the examples I can find seem pretty consistent. So this PR should match those, and it does.

@maxfreu
Copy link
Contributor Author

maxfreu commented Jan 28, 2021

OK. My suggestion here would look more like this:

Aah yes, molto bene :) Will massage it tomorrow. The rest should maybe be discussed on slack or zulip or so - where are you?

src/upsample.jl Outdated Show resolved Hide resolved
src/upsample.jl Outdated Show resolved Hide resolved
src/upsample.jl Outdated Show resolved Hide resolved
src/upsample.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

Tried to commit some of the suggestions but don't have the rights

maxfreu and others added 2 commits February 1, 2021 11:35
src/upsample.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

@DhairyaLGandhi I lost write access

@DhairyaLGandhi
Copy link
Member

Haven't changed anything, what does it say?

@CarloLucibello
Copy link
Member

I don't know what's happening, I lost write access yesterday, couldn't see the Merge Pull Request button, but it reappeared right now

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Feb 2, 2021

Please address the comment on the api before merging

@maxfreu
Copy link
Contributor Author

maxfreu commented Feb 2, 2021

Hi, I'm a bit late to the party: There have been many comments on the API - which changes do you refer to? This one #266 (comment)?

@CarloLucibello
Copy link
Member

Hi, I'm a bit late to the party: There have been many comments on the API - which changes do you refer to? This one #266 (comment)?

The comment is in this thread #266 (comment), but I don't think there is anything to address. I'll merge in 1 day if no objections arise.

@DhairyaLGandhi
Copy link
Member

Well, since it's an api change, I'd be careful not to merge without proper checks.

@CarloLucibello CarloLucibello merged commit 3623541 into FluxML:master Feb 5, 2021
@maxfreu
Copy link
Contributor Author

maxfreu commented Feb 5, 2021

Thanks everybody for your time and efforts in making this better! :) I'll try to finish the GPU PR next week, depending on Tim's occupancy.

@maxfreu
Copy link
Contributor Author

maxfreu commented Feb 11, 2021

JuliaGPU/CUDA.jl#636 has been merged. After the next release tag we'll be at warp speed :)

A quick test with (32,32,1024,1) on my GTX980 shows 3.3us for bilinear upsampling vs 4.4us for nearest, so I recommend the former for now (some day nearest will be faster). On CPU single threaded they are about the same, but bilinear can take advantage of more cores.

@CarloLucibello
Copy link
Member

That was some great work!

@roflmaostc
Copy link
Contributor

Reading all these several issues/PRs about the bilinear upsampling layer is like reading a good book. It was quite exciting to follow the whole discussions and seeing the final performance and outcome 😆
Great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants