-
-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use var
to speed up normalisation
#1973
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1973 +/- ##
=======================================
Coverage 87.94% 87.94%
=======================================
Files 19 19
Lines 1485 1485
=======================================
Hits 1306 1306
Misses 179 179
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was even discussed at some point but fell through the cracks. LGTM.
Does the gradient propagate to the keyword argument? I think that was the problem |
There don't seem to be any tests checking the gradient of |
The gradient of the keyword is correctly zero, I think: julia> gradient(x3, m3) do x, m
var(x; mean=m)
end
([-0.07914716919668174, 0.04399098883554592, 0.035156180361135936], nothing)
julia> ForwardDiff.gradient(x3) do x
var(x; mean=m3)
end
3-element Vector{Float64}:
-0.07914716919668174
0.04399098883554592
0.035156180361135936
julia> ForwardDiff.derivative(m3) do m
var(x3; mean=m)
end
-1.1102230246251565e-16 Xref FluxML/Zygote.jl#478 |
You could save a few more copies here by making some |
The centered second moment |
I didn't check, but maybe we have wrong second derivatives? |
Here's a gist: https://gist.github.com/mcabbott/57befcf926b839e5e528ace38f018a66 tl;dr is that 2nd derivatives where you compute the mean one line above are fine. If you supply the mean from completely outside, then my head hurts, it's some overparameterised 2nd order tangent story. |
Ok, thanks, so I don't understand why but we are good |
In this comparison, I think that one line accounts for Flux's extra memory use, compared to Lux's version with
var
. (Perhaps that wasn't supported earlier?). This PR fixes it:compared to Lux, same machine same size: