-
-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major overhaul of NNlib #94
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty amazing!
src/conv.jl
Outdated
# Annoyingly, we must allocate with `zeros()` because if we were to use | ||
# the faster `similar()`, it may have NaNs within it, which will poison | ||
# the output because we support accumulation (even with `beta = 0` the | ||
# NaNs poison us as NaN * 0 == NaN). This is a bit of a shame, but it's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do beta = false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case it's not clear why: NaN*false == 0.0
. That's a neat trick that I had no idea about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay this is neat; looks like I can use beta=T(0)
for the BLAS calls (they do not get NaN
poisoned for some reason; perhaps they have some special handling for this case) and beta=false
for the pure Julia implementations. Thanks for the tip, @jekbradbury!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, false
is a "strong zero" in the sense that false*x
is zero(x)
and true
is a "strong unit" in the sense that true*x
is always x
.
softmax | ||
softmax(xs) = softmax!(similar(xs), xs) | ||
|
||
function softmax!(out::AbstractVecOrMat{T}, xs::AbstractVecOrMat{T}) where {T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible (without hurting performance) to make this generic over the dimension index? (related: it would be pretty nice if using the clean broadcast versions in the comments didn't hurt performance either—is that because the broadcast infrastructure doesn't insert the equivalent of @inbounds
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience, broadcasting creates small objects, and so if you're doing lots of small broadcasts, you can easily get drowned in small allocations. It's almost always faster to just write the loop yourself, unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to punt any improvements here off onto #77; this PR is big enough as it is. ;)
This is great, awesome work Elliot. It'd be great to have a once-over from others who have looked at this code, but otherwise I'm mostly only concerned about the API, which should be simple to review. Presumably we'll also need to update CuArrays to add the GPU wrappers; might be worth setting that up on a branch so that we can see how it looks. |
Alright; a little bit of progress, and some performance measurements. Looks like it's not as much of a win as I had originally measured; we tend to win on large image sizes, but not necessarily on small image sizes. This is hardly surprising, but still a little disappointing. Here are minimum benchmark results for Note that the As expected, the biggest perf gains are when there is no padding to deal with, because one of the big pieces of this PR was to separate padding considerations from the inner loop body. There are some nasty inference gotchas that can arise seemingly out of the woodwork, suddenly killing performance. (e.g. when applying |
Status? Since this is quite disruptive in terms of other people's PRs, it'd be great to get it landed -- once you're happy with it of course. |
Alright I'm calling this good. Future performance work can happen later, and since I've got a |
Great, please merge at your leisure then. |
Co-Authored-By: staticfloat <staticfloat@gmail.com>
Co-Authored-By: staticfloat <staticfloat@gmail.com>
Co-Authored-By: staticfloat <staticfloat@gmail.com>
Co-Authored-By: staticfloat <staticfloat@gmail.com>
Co-Authored-By: staticfloat <staticfloat@gmail.com>
Co-Authored-By: staticfloat <staticfloat@gmail.com>
…formance monitoring This uses the new zero-overhead instrumentation capabilities of `TimerOutputs.jl` to embed instrumentation that gets compiled out by default, but is trivially enableable (and triggers recompilation of all instrumented methods) by running `TimerOutputs.enable_debug_timings(NNlib)`.
Also specialize on kernel size, as that turns out to be helpful for performance.
Wait for a parallelized test harness to take advantage of multiple cores
63dbfa2
to
936e71a
Compare
Codecov Report
@@ Coverage Diff @@
## master #94 +/- ##
===========================================
+ Coverage 68.46% 84.94% +16.47%
===========================================
Files 9 17 +8
Lines 666 611 -55
===========================================
+ Hits 456 519 +63
+ Misses 210 92 -118
Continue to review full report at Codecov.
|
As I was on vacation last week, I tried to write a convolutional autoencoder, but was frustrated by the slow performance of our
im2col
implementation. While I alleviated that somewhat, I also got frustrated by what I saw as unnecessary complexity inNNlib
, and inadequate testing coverage. To that end, I have performed a major overhaul ofNNlib
, arriving at what I believe to be a much more "coherent" API, with significantly improved inline documentation/comments, and a test set that approaches the border of exhaustive. Many thanks to @MikeInnes, @tejank10, @avik-pal and everyone else who has braved this codebase before me. This kind of code is gnarly, and I owe the previous authors for giving me a foundation to massage, because there's no way I would have wanted to write this from scratch.This is a giant monolithic commit because so many of the changes are interleaved that I couldn't really find a good way to break it up. I realize this will be a nightmare to review, however my hope is that things will be "clean" enough, fast enough, and well-tested enough that it will be an obvious win to merge it without too many further changes. Here's the list of highlight changes:
DenseConvDims
,DepthwiseConvDims
,PoolDims
, etc... allow the type system to know things about stride, padding, dilation, flipped kernels, input shape, output shape, etc... This is beneficial to things likeim2col
, and will also be beneficial to tangential projects like XLA.jlconv2d()
, instead you just invokeconv()
with a 4-dimensional tensor.@threads
pending future threading improvements, as in my tests it either generated invalid results or was slower than the serialized versions.log_fast()
as it doesn't seem to be used anywhere.Things I still need to do before this can be merged:
Flux
APIs in a branch over there to make sure that it can coexist with these changes easily (this should be pretty easy) (Exists here: https://github.com/FluxML/Flux.jl/tree/sf/nnlib_overhaul)Things this work should enable/make easier in the future:
conv_nnpack!()
and should be trivially plug-in-able. We can then benchmark the various implementations against eachother, and choose the right kernel using a lookup table.