Use `var` to speed up normalisation #1973

mcabbott · 2022-05-27T03:58:34Z

In this comparison, I think that one line accounts for Flux's extra memory use, compared to Lux's version with var. (Perhaps that wasn't supported earlier?). This PR fixes it:

julia> x = rand(rng, Float32, 128, 1000);

julia> @btime gradient(model -> sum(model(x)), model);
  4.906 ms (1305 allocations: 29.19 MiB)  # before
  3.136 ms (1219 allocations: 23.32 MiB)  # after

julia> v, re = destructure(model);

julia> @btime gradient(v -> sum(re(v)(x)), v);
  5.044 ms (1562 allocations: 29.47 MiB)  # before
  3.306 ms (1476 allocations: 23.60 MiB)  # after

compared to Lux, same machine same size:

julia> @btime gradient(p -> sum(Lux.apply(model, x, p, st)[1]), ps);
  4.069 ms (2679 allocations: 23.41 MiB)

julia> ca = ComponentArray(ps);  # to store a flat vector

julia> @btime gradient(p -> sum(Lux.apply(model, x, p, st)[1]), ca);
  4.327 ms (3061 allocations: 24.62 MiB)

julia> v, re = Optimisers.destructure(ps);  # this works too

julia> @btime gradient(v -> sum(Lux.apply(model, x, re(v), st)[1]), v);
  4.093 ms (2847 allocations: 23.69 MiB)

codecov-commenter · 2022-05-27T04:26:27Z

Codecov Report

Merging #1973 (a9d5f44) into master (28ee7b4) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1973   +/-   ##
=======================================
  Coverage   87.94%   87.94%           
=======================================
  Files          19       19           
  Lines        1485     1485           
=======================================
  Hits         1306     1306           
  Misses        179      179

Impacted Files	Coverage Δ
src/layers/normalise.jl	`88.81% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 28ee7b4...a9d5f44. Read the comment docs.

ToucheSir

I think this was even discussed at some point but fell through the cracks. LGTM.

CarloLucibello · 2022-05-27T06:15:35Z

Does the gradient propagate to the keyword argument? I think that was the problem

cossio · 2022-05-27T10:16:08Z

There don't seem to be any tests checking the gradient of BatchNorm?

mcabbott · 2022-05-27T11:35:32Z

Does the gradient propagate to the keyword argument? I think that was the problem

The gradient of the keyword is correctly zero, I think:

julia> gradient(x3, m3) do x, m
         var(x; mean=m)
       end
([-0.07914716919668174, 0.04399098883554592, 0.035156180361135936], nothing)

julia> ForwardDiff.gradient(x3) do x
         var(x; mean=m3)
       end
3-element Vector{Float64}:
 -0.07914716919668174
  0.04399098883554592
  0.035156180361135936

julia> ForwardDiff.derivative(m3) do m
         var(x3; mean=m)
       end
-1.1102230246251565e-16

Xref FluxML/Zygote.jl#478

mcabbott · 2022-05-27T13:55:50Z

You could save a few more copies here by making some μ, σ2 = mean_var(x) which computes both gradients. Or a rule for _norm_layer_forward(x, μ, σ², ϵ), or better, fuse these two.

cossio · 2022-05-27T15:30:34Z

The centered second moment <(x - m)^2> has a minimum when m coincides with the mean, m = <x>, so the gradient is correctly zero in this case.

CarloLucibello · 2022-05-27T15:37:19Z

I didn't check, but maybe we have wrong second derivatives?

mcabbott · 2022-05-27T16:07:12Z

Here's a gist: https://gist.github.com/mcabbott/57befcf926b839e5e528ace38f018a66

tl;dr is that 2nd derivatives where you compute the mean one line above are fine. If you supply the mean from completely outside, then my head hurts, it's some overparameterised 2nd order tangent story.

CarloLucibello · 2022-05-27T16:40:03Z

Ok, thanks, so I don't understand why but we are good

use var

a9d5f44

ToucheSir approved these changes May 27, 2022

View reviewed changes

mcabbott merged commit e4f8678 into FluxML:master May 27, 2022

mcabbott deleted the use_var branch May 27, 2022 04:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `var` to speed up normalisation #1973

Use `var` to speed up normalisation #1973

mcabbott commented May 27, 2022 •

edited

Loading

codecov-commenter commented May 27, 2022

ToucheSir left a comment

CarloLucibello commented May 27, 2022

cossio commented May 27, 2022

mcabbott commented May 27, 2022

mcabbott commented May 27, 2022

cossio commented May 27, 2022

CarloLucibello commented May 27, 2022

mcabbott commented May 27, 2022

CarloLucibello commented May 27, 2022

Use var to speed up normalisation #1973

Use var to speed up normalisation #1973

Conversation

mcabbott commented May 27, 2022 • edited Loading

codecov-commenter commented May 27, 2022

Codecov Report

ToucheSir left a comment

Choose a reason for hiding this comment

CarloLucibello commented May 27, 2022

cossio commented May 27, 2022

mcabbott commented May 27, 2022

mcabbott commented May 27, 2022

cossio commented May 27, 2022

CarloLucibello commented May 27, 2022

mcabbott commented May 27, 2022

CarloLucibello commented May 27, 2022

Use `var` to speed up normalisation #1973

Use `var` to speed up normalisation #1973

mcabbott commented May 27, 2022 •

edited

Loading