Major overhaul of NNlib #94

staticfloat · 2019-02-21T11:43:45Z

As I was on vacation last week, I tried to write a convolutional autoencoder, but was frustrated by the slow performance of our im2col implementation. While I alleviated that somewhat, I also got frustrated by what I saw as unnecessary complexity in NNlib, and inadequate testing coverage. To that end, I have performed a major overhaul of NNlib, arriving at what I believe to be a much more "coherent" API, with significantly improved inline documentation/comments, and a test set that approaches the border of exhaustive. Many thanks to @MikeInnes, @tejank10, @avik-pal and everyone else who has braved this codebase before me. This kind of code is gnarly, and I owe the previous authors for giving me a foundation to massage, because there's no way I would have wanted to write this from scratch.

This is a giant monolithic commit because so many of the changes are interleaved that I couldn't really find a good way to break it up. I realize this will be a nightmare to review, however my hope is that things will be "clean" enough, fast enough, and well-tested enough that it will be an obvious win to merge it without too many further changes. Here's the list of highlight changes:

I added a new type-system-level operation shape system. Things like DenseConvDims, DepthwiseConvDims, PoolDims, etc... allow the type system to know things about stride, padding, dilation, flipped kernels, input shape, output shape, etc... This is beneficial to things like im2col, and will also be beneficial to tangential projects like XLA.jl
I changed the naming of all spatial operations to be completely shape-based, e.g. no more conv2d(), instead you just invoke conv() with a 4-dimensional tensor.
I added "direct" pure-Julia implementations of all convolution operations. Not intended to be highly performant, they're supposed to be used as last-ditch fallbacks and for testing, so that we can check all our implementations against eachother.
Beefed up testing significantly, adding "fuzzing" tests that take many minutes to run, but which have caught a LOT of bugs. I now have much higher confidence that NaN's aren't going to sneak into the output because of funny dilation/stride combinations.
I removed all @threads pending future threading improvements, as in my tests it either generated invalid results or was slower than the serialized versions.
Removed log_fast() as it doesn't seem to be used anywhere.

Things I still need to do before this can be merged:

Performance testing. I've made the implicit assumption that the reshape-use-generic-3d-versions-of-things strategy isn't going to be any slower than what we had before; benchmark this and prove it.
Update any broken Flux APIs in a branch over there to make sure that it can coexist with these changes easily (this should be pretty easy) (Exists here: https://github.com/FluxML/Flux.jl/tree/sf/nnlib_overhaul)
Update CuArrays/other dependent GPU packages

Things this work should enable/make easier in the future:

This should make it much easier to integrate things like NNPACK, MAGMA, etc... These acceleration packages should be able to define things like conv_nnpack!() and should be trivially plug-in-able. We can then benchmark the various implementations against eachother, and choose the right kernel using a lookup table.
I've taken pains to make sure that nonallocating versions of all function calls are truly as nonallocating as possible. This should enable us to, in Flux, track memory allocations the first time, then pass in preallocated buffers for everything and decrease our large allocations significantly.

jekbradbury

This is pretty amazing!

jekbradbury · 2019-02-22T09:29:08Z

src/conv.jl

+                # Annoyingly, we must allocate with `zeros()` because if we were to use
+                # the faster `similar()`, it may have NaNs within it, which will poison
+                # the output because we support accumulation (even with `beta = 0` the
+                # NaNs poison us as NaN * 0 == NaN).  This is a bit of a shame, but it's


Can you do beta = false?

In case it's not clear why: NaN*false == 0.0. That's a neat trick that I had no idea about.

Okay this is neat; looks like I can use beta=T(0) for the BLAS calls (they do not get NaN poisoned for some reason; perhaps they have some special handling for this case) and beta=false for the pure Julia implementations. Thanks for the tip, @jekbradbury!

Yes, false is a "strong zero" in the sense that false*x is zero(x) and true is a "strong unit" in the sense that true*x is always x.

src/conv.jl

src/dim_helpers.jl

src/impl/pooling_direct.jl

src/pooling.jl

jekbradbury · 2019-02-22T10:06:36Z

src/softmax.jl

-softmax
+softmax(xs) = softmax!(similar(xs), xs)
+
+function softmax!(out::AbstractVecOrMat{T}, xs::AbstractVecOrMat{T}) where {T}


Is it possible (without hurting performance) to make this generic over the dimension index? (related: it would be pretty nice if using the clean broadcast versions in the comments didn't hurt performance either—is that because the broadcast infrastructure doesn't insert the equivalent of @inbounds?)

In my experience, broadcasting creates small objects, and so if you're doing lots of small broadcasts, you can easily get drowned in small allocations. It's almost always faster to just write the loop yourself, unfortunately.

I'm going to punt any improvements here off onto #77; this PR is big enough as it is. ;)

MikeInnes · 2019-02-22T14:54:56Z

This is great, awesome work Elliot. It'd be great to have a once-over from others who have looked at this code, but otherwise I'm mostly only concerned about the API, which should be simple to review.

Presumably we'll also need to update CuArrays to add the GPU wrappers; might be worth setting that up on a branch so that we can see how it looks.

src/pooling.jl

Manifest.toml

staticfloat · 2019-03-05T23:29:26Z

Alright; a little bit of progress, and some performance measurements. Looks like it's not as much of a win as I had originally measured; we tend to win on large image sizes, but not necessarily on small image sizes. This is hardly surprising, but still a little disappointing. Here are minimum benchmark results for conv2d across different image sizes showing this effect:

Note that the conv2d_data() results are more or less expected, as I know that our col2im() implementation is slow; it's not as memory-friendly as it should be, some loops need to be re-ordered, but I'll need to sit down and think about how that can be done sometime in the future.

As expected, the biggest perf gains are when there is no padding to deal with, because one of the big pieces of this PR was to separate padding considerations from the inner loop body.

There are some nasty inference gotchas that can arise seemingly out of the woodwork, suddenly killing performance. (e.g. when applying @timeit_debug annotations to methods). This codebase will be a really good test suite for a future static inference proving tool, as it's the kind of code that we really should be able to prove everything about, modulo compiler limitations.

MikeInnes · 2019-03-26T14:43:40Z

Status? Since this is quite disruptive in terms of other people's PRs, it'd be great to get it landed -- once you're happy with it of course.

staticfloat · 2019-03-28T07:12:36Z

Alright I'm calling this good. Future performance work can happen later, and since I've got a Flux branch that works with this I think we can merge now and patch up e.g. the GPU ecosystem before the next release.

MikeInnes · 2019-03-28T11:10:11Z

Great, please merge at your leisure then.

Co-Authored-By: staticfloat <staticfloat@gmail.com>

…formance monitoring This uses the new zero-overhead instrumentation capabilities of `TimerOutputs.jl` to embed instrumentation that gets compiled out by default, but is trivially enableable (and triggers recompilation of all instrumented methods) by running `TimerOutputs.enable_debug_timings(NNlib)`.

Also specialize on kernel size, as that turns out to be helpful for performance.

Wait for a parallelized test harness to take advantage of multiple cores

codecov-io · 2019-03-28T21:05:59Z

Codecov Report

Merging #94 into master will increase coverage by 16.47%.
The diff coverage is 85.16%.

@@             Coverage Diff             @@
##           master      #94       +/-   ##
===========================================
+ Coverage   68.46%   84.94%   +16.47%     
===========================================
  Files           9       17        +8     
  Lines         666      611       -55     
===========================================
+ Hits          456      519       +63     
+ Misses        210       92      -118

Impacted Files	Coverage Δ
src/NNlib.jl	`100% <ø> (ø)`	⬆️
src/gemm.jl	`10% <10%> (ø)`
src/pooling.jl	`100% <100%> (ø)`
src/activation.jl	`80% <100%> (ø)`	⬆️
src/dim_helpers.jl	`73.33% <73.33%> (ø)`
src/dim_helpers/ConvDims.jl	`74.5% <74.5%> (ø)`
src/impl/padding_edges.jl	`78.26% <78.26%> (ø)`
src/impl/pooling_direct.jl	`79.09% <79.09%> (ø)`
src/conv.jl	`84.61% <84.61%> (+22.54%)`	⬆️
src/softmax.jl	`87.09% <87.09%> (+8.14%)`	⬆️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11f840d...936e71a. Read the comment docs.

jekbradbury approved these changes Feb 22, 2019

View reviewed changes

avik-pal mentioned this pull request Feb 24, 2019

Upsampling Layer for Flux #95

Closed

3 tasks

avik-pal reviewed Feb 26, 2019

View reviewed changes

src/pooling.jl Outdated Show resolved Hide resolved

mbrookhart mentioned this pull request Mar 1, 2019

Add a Blocked Convolution Proof of Concept #97

Closed

StefanKarpinski reviewed Mar 5, 2019

View reviewed changes

Manifest.toml Outdated Show resolved Hide resolved

staticfloat and others added 20 commits March 28, 2019 12:15

Major overhaul

b698441

Update src/conv.jl

ce96ef1

Co-Authored-By: staticfloat <staticfloat@gmail.com>

update comment

af58647

Co-Authored-By: staticfloat <staticfloat@gmail.com>

Update src/impl/pooling_direct.jl

4eabdb7

Co-Authored-By: staticfloat <staticfloat@gmail.com>

update comment

0a70a92

Co-Authored-By: staticfloat <staticfloat@gmail.com>

Update src/conv.jl

33c3532

Co-Authored-By: staticfloat <staticfloat@gmail.com>

Update src/pooling.jl

ec55144

Co-Authored-By: staticfloat <staticfloat@gmail.com>

Add TimerOutputs to make performance measuring easier

49d5518

Add TimerOutputs for performance delving

494711e

Flip name to transpose_switchbatch()

dd1cb04

Also specialize on kernel size, as that turns out to be helpful for performance.

Add performance testing framework

f6003b1

Fix comment

66a32d9

Fix spacing

2503cd4

MaxPool backprop requires knowledge of y. :(

503fd23

Same inference workaround for col2im!() performance.

737f246

Don't instrument gemm!() directly, it breaks inference

36d50de

Don't measure pooling directly, due to inference limitations

36b4d9b

Update perf testing script

f60ae1e

Update manifests for released TimerOutputs

db44c7d

Disable fuzz testing on travis because we take too long

936e71a

Wait for a parallelized test harness to take advantage of multiple cores

staticfloat force-pushed the sf/overhaul branch from 63dbfa2 to 936e71a Compare March 28, 2019 19:16

staticfloat merged commit 60b2b92 into master Mar 28, 2019

StefanKarpinski deleted the sf/overhaul branch March 29, 2019 12:33

KristofferC mentioned this pull request Mar 29, 2019

get high level sparse wrappers working JuliaGPU/CuArrays.jl#312

Merged

This was referenced Apr 5, 2019

Bump version. #109

Merged

Dup of nnlib_overhaul [DO NOT MERGE] JuliaGPU/CuArrays.jl#315

Merged

cossio mentioned this pull request Aug 1, 2019

Does Flux support N-dimensional convolutions, where N>3? FluxML/Flux.jl#451

Open

ToucheSir mentioned this pull request Jan 6, 2022

Flux.Conv type instability FluxML/Flux.jl#1178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major overhaul of NNlib #94

Major overhaul of NNlib #94

staticfloat commented Feb 21, 2019 •

edited

Loading

jekbradbury left a comment

jekbradbury Feb 22, 2019

MikeInnes Feb 22, 2019

staticfloat Feb 23, 2019 •

edited

Loading

StefanKarpinski Mar 5, 2019

jekbradbury Feb 22, 2019

staticfloat Feb 23, 2019

staticfloat Feb 24, 2019

MikeInnes commented Feb 22, 2019

staticfloat commented Mar 5, 2019

MikeInnes commented Mar 26, 2019

staticfloat commented Mar 28, 2019

MikeInnes commented Mar 28, 2019

codecov-io commented Mar 28, 2019 •

edited

Loading

Major overhaul of NNlib #94

Major overhaul of NNlib #94

Conversation

staticfloat commented Feb 21, 2019 • edited Loading

jekbradbury left a comment

Choose a reason for hiding this comment

jekbradbury Feb 22, 2019

Choose a reason for hiding this comment

MikeInnes Feb 22, 2019

Choose a reason for hiding this comment

staticfloat Feb 23, 2019 • edited Loading

Choose a reason for hiding this comment

StefanKarpinski Mar 5, 2019

Choose a reason for hiding this comment

jekbradbury Feb 22, 2019

Choose a reason for hiding this comment

staticfloat Feb 23, 2019

Choose a reason for hiding this comment

staticfloat Feb 24, 2019

Choose a reason for hiding this comment

MikeInnes commented Feb 22, 2019

staticfloat commented Mar 5, 2019

MikeInnes commented Mar 26, 2019

staticfloat commented Mar 28, 2019

MikeInnes commented Mar 28, 2019

codecov-io commented Mar 28, 2019 • edited Loading

Codecov Report

staticfloat commented Feb 21, 2019 •

edited

Loading

staticfloat Feb 23, 2019 •

edited

Loading

codecov-io commented Mar 28, 2019 •

edited

Loading