-
-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve bilinear upsampling #266
Conversation
Well, that's an awesome improvement! |
In case you didn't see: I posted the single core benchmark in the original post. |
ufff, maxpool tests are passing again, it's incredible how unstable they are. Can you change |
I couldn't find any occurences of |
Yes, maybe change gradtest(x -> maxpool(x, pdims), x; broken=spatial_rank <= 2) to gradtest(x -> maxpool(x, pdims), x; broken=spatial_rank <= 0) since I have the impression we will have to change it back soon |
Can you add a test with non-integer scale? It's a bit sad we lose CuArray support until JuliaGPU/CUDA.jl#636 is merged, but I cannot think of a way around it. Now that you have thought a bit more about this, would you be able to extend the code to support the 1d and 3d case (in a later PR)? Following the discussion in FluxML/Flux.jl#1468, can you add support for integer scale? |
Supporting 1D and 3D is easy; just one less or more for-loop. Nearest neighbour is also easy to implement this way, as only the source index calculation is changed. |
a PR would be much appreciated |
can rebase and remove the changes to the pooling tests now |
Sorry I don't know what rebase is. Do you mean I should simply undo the changes to the pooling tests? Probably I shouldn't have commited to the master branch of my fork. (?) |
You might be able to merge with this website, maybe the button says "resolve conflicts". On master there is now an alternative fix to the problem solved by "broken=spatial_rank == 0) # was == 2 before" etc. |
Which other functions do you have in mind? These place the mutated
|
I'm pretty certain that is context dependent? And maybe some of those functions need to be updated accordingly then. |
I was only thinking about the overhead of the comparison itself, not the allocation, right.
The arguments to the GPU kernel and this one are a bit different. I could bring both in line, but this would require to hoist some of the logic out of the CPU kernel, which would make things less clear maybe, but it works. My thoughts about the API go like this: const NDA = NamedDimsArray
upsample_bilinear!(y::AbstractArray..., x) = upsample_bilinear_whcn_kernel!(y, x) # backwards compatibility
# these two could be fused into one, yes. The parent() call would have to go into the kernel then.
upsample_bilinear!(y::NDA{(:w,:h,:c,:n)}, x::...) = upsample_bilinear_whcn_kernel!(parent(y), parent(x))
upsample_bilinear!(y::NDA{(:c,:w,:h,:n)}, x) = upsample_bilinear_cwhn_kernel!(parent(y), parent(x))
function upsample_bilinear!(y::NDA{(:w,:h,:c,:n),T,N,A}, x::...) where {T, N, A<:CuArray}
a,b,c = ...
threads = ...
blocks = ...
@cuda threads blocks upsample_bilinear_whcn_kernel!(a,b,c, parent(x), parent(y)) # <- the GPU kernel args are a bit different
return y
end
upsample_bilinear!(y::NDA{(:c,:w,:h,:n)}, x) where {T, N, A<:CuArray} = ...
# gradient analogously Edit: I basically don't care about the argument order. Should we vote or do you have a dictator? 🤣 |
OK. My suggestion here would look more like this: upsample_bilinear!(y, x) = upsample_bilinear_whcn!(y, x) # maybe?
# Not sure this function need exist, `upsample_bilinear(x)` can call `upsample_bilinear_whcn!(y,x)` directly?
function upsample_bilinear_whcn!(y::AbstractArray, x:: AbstractArray)
# direct implementation as in this PR
end
# This worker has one job, very simply dispatch, will never change:
function upsample_bilinear_whcn!(y::CuArray, x::...)
a,b,c = ...
threads = ...
blocks = ...
@cuda threads blocks upsample_bilinear_whcn_kernel!(a,b,c, parent(x), parent(y)) # the real GPU kernel
return y
end
# These two workers can be added later, without breaking anything:
upsample_bilinear_cwhn!(y::AbstractArray, x:: AbstractArray)
upsample_bilinear_cwhn!(y::CuArray, x::...) = ...
const NDA = NamedDimsArray
# Only one function dispatches on NDA, and it does not need to load CUDA:
upsample_bilinear(x::NDA{(:w,:h,:c,:n)}, scale) = begin ... upsample_bilinear_whcn!(parent(y), parent(x))
upsample_bilinear(x::NDA{(:c,:w,:h,:n)}, scale) = begin ... upsample_bilinear_cwhn!(parent(y), parent(x)) Re argument order, it looks fine I think, Dhairya got me worried that we were all over the map in this package, but the examples I can find seem pretty consistent. So this PR should match those, and it does. |
Aah yes, molto bene :) Will massage it tomorrow. The rest should maybe be discussed on slack or zulip or so - where are you? |
Tried to commit some of the suggestions but don't have the rights |
Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>
@DhairyaLGandhi I lost write access |
Haven't changed anything, what does it say? |
I don't know what's happening, I lost write access yesterday, couldn't see the Merge Pull Request button, but it reappeared right now |
Please address the comment on the api before merging |
Hi, I'm a bit late to the party: There have been many comments on the API - which changes do you refer to? This one #266 (comment)? |
The comment is in this thread #266 (comment), but I don't think there is anything to address. I'll merge in 1 day if no objections arise. |
Well, since it's an api change, I'd be careful not to merge without proper checks. |
Thanks everybody for your time and efforts in making this better! :) I'll try to finish the GPU PR next week, depending on Tim's occupancy. |
JuliaGPU/CUDA.jl#636 has been merged. After the next release tag we'll be at warp speed :) A quick test with (32,32,1024,1) on my GTX980 shows 3.3us for bilinear upsampling vs 4.4us for nearest, so I recommend the former for now (some day nearest will be faster). On CPU single threaded they are about the same, but bilinear can take advantage of more cores. |
That was some great work! |
Reading all these several issues/PRs about the bilinear upsampling layer is like reading a good book. It was quite exciting to follow the whole discussions and seeing the final performance and outcome 😆 |
Sorry to bother you again! But this stuff didn't let me sleep, so here is a CPU implementation. See GPU PR here.
I reviewed the tests of the current implementation and found it a bit strange, actually. So this one comes with a bit different tests.
Mean benchmark times in ms, upsampling by a factor of 2,
tested on 12 threads @3.7Ghz, julia 1.7
32x32x1024x1
196x196x128x1
single threaded:
32x32x1024x1
196x196x128x1
Single core would rock with cwhn tensor layout, but parallelized they are more or less the same.
Kind regards! :)