Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major overhaul of NNlib #94

Merged
merged 21 commits into from
Mar 28, 2019
Merged

Major overhaul of NNlib #94

merged 21 commits into from
Mar 28, 2019

Conversation

staticfloat
Copy link
Contributor

@staticfloat staticfloat commented Feb 21, 2019

As I was on vacation last week, I tried to write a convolutional autoencoder, but was frustrated by the slow performance of our im2col implementation. While I alleviated that somewhat, I also got frustrated by what I saw as unnecessary complexity in NNlib, and inadequate testing coverage. To that end, I have performed a major overhaul of NNlib, arriving at what I believe to be a much more "coherent" API, with significantly improved inline documentation/comments, and a test set that approaches the border of exhaustive. Many thanks to @MikeInnes, @tejank10, @avik-pal and everyone else who has braved this codebase before me. This kind of code is gnarly, and I owe the previous authors for giving me a foundation to massage, because there's no way I would have wanted to write this from scratch.

This is a giant monolithic commit because so many of the changes are interleaved that I couldn't really find a good way to break it up. I realize this will be a nightmare to review, however my hope is that things will be "clean" enough, fast enough, and well-tested enough that it will be an obvious win to merge it without too many further changes. Here's the list of highlight changes:

  • I added a new type-system-level operation shape system. Things like DenseConvDims, DepthwiseConvDims, PoolDims, etc... allow the type system to know things about stride, padding, dilation, flipped kernels, input shape, output shape, etc... This is beneficial to things like im2col, and will also be beneficial to tangential projects like XLA.jl
  • I changed the naming of all spatial operations to be completely shape-based, e.g. no more conv2d(), instead you just invoke conv() with a 4-dimensional tensor.
  • I added "direct" pure-Julia implementations of all convolution operations. Not intended to be highly performant, they're supposed to be used as last-ditch fallbacks and for testing, so that we can check all our implementations against eachother.
  • Beefed up testing significantly, adding "fuzzing" tests that take many minutes to run, but which have caught a LOT of bugs. I now have much higher confidence that NaN's aren't going to sneak into the output because of funny dilation/stride combinations.
  • I removed all @threads pending future threading improvements, as in my tests it either generated invalid results or was slower than the serialized versions.
  • Removed log_fast() as it doesn't seem to be used anywhere.

Things I still need to do before this can be merged:

  • Performance testing. I've made the implicit assumption that the reshape-use-generic-3d-versions-of-things strategy isn't going to be any slower than what we had before; benchmark this and prove it.
  • Update any broken Flux APIs in a branch over there to make sure that it can coexist with these changes easily (this should be pretty easy) (Exists here: https://github.com/FluxML/Flux.jl/tree/sf/nnlib_overhaul)
  • Update CuArrays/other dependent GPU packages

Things this work should enable/make easier in the future:

  • This should make it much easier to integrate things like NNPACK, MAGMA, etc... These acceleration packages should be able to define things like conv_nnpack!() and should be trivially plug-in-able. We can then benchmark the various implementations against eachother, and choose the right kernel using a lookup table.
  • I've taken pains to make sure that nonallocating versions of all function calls are truly as nonallocating as possible. This should enable us to, in Flux, track memory allocations the first time, then pass in preallocated buffers for everything and decrease our large allocations significantly.

Copy link
Contributor

@jekbradbury jekbradbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty amazing!

src/conv.jl Outdated
# Annoyingly, we must allocate with `zeros()` because if we were to use
# the faster `similar()`, it may have NaNs within it, which will poison
# the output because we support accumulation (even with `beta = 0` the
# NaNs poison us as NaN * 0 == NaN). This is a bit of a shame, but it's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do beta = false?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it's not clear why: NaN*false == 0.0. That's a neat trick that I had no idea about.

Copy link
Contributor Author

@staticfloat staticfloat Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this is neat; looks like I can use beta=T(0) for the BLAS calls (they do not get NaN poisoned for some reason; perhaps they have some special handling for this case) and beta=false for the pure Julia implementations. Thanks for the tip, @jekbradbury!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, false is a "strong zero" in the sense that false*x is zero(x) and true is a "strong unit" in the sense that true*x is always x.

src/conv.jl Outdated Show resolved Hide resolved
src/conv.jl Outdated Show resolved Hide resolved
src/conv.jl Outdated Show resolved Hide resolved
src/dim_helpers.jl Outdated Show resolved Hide resolved
src/impl/pooling_direct.jl Outdated Show resolved Hide resolved
src/pooling.jl Outdated Show resolved Hide resolved
softmax
softmax(xs) = softmax!(similar(xs), xs)

function softmax!(out::AbstractVecOrMat{T}, xs::AbstractVecOrMat{T}) where {T}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible (without hurting performance) to make this generic over the dimension index? (related: it would be pretty nice if using the clean broadcast versions in the comments didn't hurt performance either—is that because the broadcast infrastructure doesn't insert the equivalent of @inbounds?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, broadcasting creates small objects, and so if you're doing lots of small broadcasts, you can easily get drowned in small allocations. It's almost always faster to just write the loop yourself, unfortunately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to punt any improvements here off onto #77; this PR is big enough as it is. ;)

@MikeInnes
Copy link
Member

This is great, awesome work Elliot. It'd be great to have a once-over from others who have looked at this code, but otherwise I'm mostly only concerned about the API, which should be simple to review.

Presumably we'll also need to update CuArrays to add the GPU wrappers; might be worth setting that up on a branch so that we can see how it looks.

@avik-pal avik-pal mentioned this pull request Feb 24, 2019
3 tasks
src/pooling.jl Outdated Show resolved Hide resolved
Manifest.toml Outdated Show resolved Hide resolved
@staticfloat
Copy link
Contributor Author

Alright; a little bit of progress, and some performance measurements. Looks like it's not as much of a win as I had originally measured; we tend to win on large image sizes, but not necessarily on small image sizes. This is hardly surprising, but still a little disappointing. Here are minimum benchmark results for conv2d across different image sizes showing this effect:

conv2d

conv2d_data

conv2d_filter

Note that the conv2d_data() results are more or less expected, as I know that our col2im() implementation is slow; it's not as memory-friendly as it should be, some loops need to be re-ordered, but I'll need to sit down and think about how that can be done sometime in the future.

As expected, the biggest perf gains are when there is no padding to deal with, because one of the big pieces of this PR was to separate padding considerations from the inner loop body.

There are some nasty inference gotchas that can arise seemingly out of the woodwork, suddenly killing performance. (e.g. when applying @timeit_debug annotations to methods). This codebase will be a really good test suite for a future static inference proving tool, as it's the kind of code that we really should be able to prove everything about, modulo compiler limitations.

@MikeInnes
Copy link
Member

Status? Since this is quite disruptive in terms of other people's PRs, it'd be great to get it landed -- once you're happy with it of course.

@staticfloat
Copy link
Contributor Author

Alright I'm calling this good. Future performance work can happen later, and since I've got a Flux branch that works with this I think we can merge now and patch up e.g. the GPU ecosystem before the next release.

@MikeInnes
Copy link
Member

Great, please merge at your leisure then.

staticfloat and others added 20 commits March 28, 2019 12:15
Co-Authored-By: staticfloat <staticfloat@gmail.com>
Co-Authored-By: staticfloat <staticfloat@gmail.com>
Co-Authored-By: staticfloat <staticfloat@gmail.com>
Co-Authored-By: staticfloat <staticfloat@gmail.com>
Co-Authored-By: staticfloat <staticfloat@gmail.com>
Co-Authored-By: staticfloat <staticfloat@gmail.com>
…formance monitoring

This uses the new zero-overhead instrumentation capabilities of
`TimerOutputs.jl` to embed instrumentation that gets compiled out by
default, but is trivially enableable (and triggers recompilation of all
instrumented methods) by running `TimerOutputs.enable_debug_timings(NNlib)`.
Also specialize on kernel size, as that turns out to be helpful for performance.
Wait for a parallelized test harness to take advantage of multiple cores
@codecov-io
Copy link

codecov-io commented Mar 28, 2019

Codecov Report

Merging #94 into master will increase coverage by 16.47%.
The diff coverage is 85.16%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #94       +/-   ##
===========================================
+ Coverage   68.46%   84.94%   +16.47%     
===========================================
  Files           9       17        +8     
  Lines         666      611       -55     
===========================================
+ Hits          456      519       +63     
+ Misses        210       92      -118
Impacted Files Coverage Δ
src/NNlib.jl 100% <ø> (ø) ⬆️
src/gemm.jl 10% <10%> (ø)
src/pooling.jl 100% <100%> (ø)
src/activation.jl 80% <100%> (ø) ⬆️
src/dim_helpers.jl 73.33% <73.33%> (ø)
src/dim_helpers/ConvDims.jl 74.5% <74.5%> (ø)
src/impl/padding_edges.jl 78.26% <78.26%> (ø)
src/impl/pooling_direct.jl 79.09% <79.09%> (ø)
src/conv.jl 84.61% <84.61%> (+22.54%) ⬆️
src/softmax.jl 87.09% <87.09%> (+8.14%) ⬆️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11f840d...936e71a. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants