Improved time to first gradient #151

theabhirath · 2022-04-27T02:53:29Z

Edit: Initially this had some benchmarks that weren't completely accurate because I'd left the REPL running and it wasn't the first Zygote.gradient. The DenseNet benchmark is pretty accurate in this regard.

This PR (building on the work done by @DhairyaLGandhi in #150) uses a Flux v0.13 feature (namely, the fact that Chain(::Vector) is valid syntax, along with returning a Chain as a output from conv_bn to halve compilation time for most models (and for some models, even better). From a cold start (first Zygote.gradient):

julia> model = DenseNet();

julia> ip = rand(Float32, 224, 224, 3, 1);

julia> @time Zygote.gradient((m,x) -> sum(m(x)), model, ip);
 78.400696 seconds (124.71 M allocations: 11.321 GiB, 1.71% gc time, 96.65% compilation time)

julia> @time Zygote.gradient((m,x) -> sum(m(x)), model, ip);
 28.161918 seconds (88.19 M allocations: 8.970 GiB, 3.66% gc time, 89.48% compilation time)

theabhirath · 2022-04-27T03:43:31Z

Seems to have eased up some memory pressure as well - Ubuntu tests on nightly are now passing 🎉

ToucheSir · 2022-04-27T04:44:42Z

This is the first time that compilation time for a first gradient has gone under 90%. I can't believe my eyes. Is it safe to say that DenseNet TTFG is no longer a concern either?

theabhirath · 2022-04-27T04:54:16Z

Well, there's probably a way to get it down even further but for now, this improvement looks pretty surreal (Chain(::Vector) is an absolute beast 😳).

Before:

julia> model = DenseNet();

julia> ip = rand(Float32, 224, 224, 3, 1);

julia> @time Zygote.gradient((m,x) -> sum(m(x)), model, ip);
 78.400696 seconds (124.71 M allocations: 11.321 GiB, 1.71% gc time, 96.65% compilation time)

This PR:

julia> @time Zygote.gradient((m,x) -> sum(m(x)), model, ip);
 28.161918 seconds (88.19 M allocations: 8.970 GiB, 3.66% gc time, 89.48% compilation time)

theabhirath · 2022-04-27T05:29:37Z

This is the first time that compilation time for a first gradient has gone under 90%

This might be slightly misleading, I think I left the REPL running 😅 The exact benchmarks of the improvements are varying over runs, but one thing is clear - in every case (including a completely fresh REPL), there's at least 2x improvement. There is also a common trend of about 17-18 seconds that Zygote itself takes to compile the first gradient call - not sure if there's some way that can go down, but that should help because I think the models are currently doing all they can

theabhirath · 2022-04-27T07:01:21Z

More benchmarks. This is how long it took to run the tests in Feb:

And this is today, this PR:

Definitely a step in the right direction 😁

DhairyaLGandhi · 2022-04-27T07:21:17Z

there's at least 2x improvement

That's right and something I showed in #150 as well. Difference being that I see an order of magnitude more on first compile which is very curious.

17-18 seconds that Zygote itself takes to compile the first gradient call - not sure if there's some way that can go down

Yes, there is. I have experimented with precompile statements. In fact we used to do a call to gradient in Zygote during precompilation for exactly this reason. We were able to shave the "compile zygote" time by almost an order of magnitude iirc.

I think the models are currently doing all they can

That's not strictly the case 😅 . If you were to try an older Metalhead + Flux + Zygote, you would see ~4x faster TTFGs in some cases. There are still some tricks we can apply to get compilation pressure eased off, mostly to do with caching and stability.

ToucheSir · 2022-04-27T14:11:01Z

There are still some tricks we can apply to get compilation pressure eased off, mostly to do with caching and stability.

I'm only aware of the switch from custom layer types to Chain, do you have any pointers to example code from back in the day that shows this?

darsnack · 2022-04-27T14:11:52Z

Can you test this without returning a Chain for conv_bn but still using a Chain(::Vector) everywhere? Just to separate how much benefit is coming from each technique. DNNs are...deep, so having a "fix" for the long chains issue will be important outside of Metalhead.

This is really a sad state of affairs for Zygote that Julia 1.5 -> 1.6 caused such major performance regressions for the most basic operation in ML.

darsnack · 2022-04-27T14:19:34Z

Even the "old" Metalhead used a flat chain for VGG, and DenseNet used a flat chain only replacing SkipConnection for a custom struct. Focusing on Metalhead as the source of regressions is a red herring IMO. There are issues on Flux/Zygote pointing out TTFG regressions with Zygote v0.6 for models not from Metalhead.

darsnack · 2022-04-27T14:38:34Z

@ToucheSir I tried testing with FluxML/Zygote.jl#1195 and an afoldl implementation of Chain to see if it helps. Sadly, it doesn't appear to make a difference though I'm not sure that I set everything up correctly.

Another curiosity is that the CI test times have not gone down for this PR. @theabhirath are both the screenshots of the tests above with the same Julia version? I know you like to run nightly so I'd be curious if Julia versions are making a big difference here.

theabhirath · 2022-04-27T14:40:52Z

Can you test this without returning a Chain for conv_bn but still using a Chain(::Vector) everywhere

There doesn't seem to be much difference in gradient times but it shaves some time off the forward pass (returning Chain for conv_bn that is)

theabhirath · 2022-04-27T14:46:29Z

are both the screenshots of the tests above with the same Julia version?

Well, they're both nightly 😅 But there's been no major PRs to master that I think could've changed things this drastically, and nothing else has changed between the runs

Another curiosity is that the CI test times have not gone down for this PR

That may be limited by memory? I'm on a 16 gigs machine, while IIRC the runners have lesser to work with (7, I think?). Not sure if the difference should still show up in some fashion though

ToucheSir · 2022-04-27T15:08:57Z

@darsnack that PR won't help TTFG much since Zygote + IRTools still has to churn through all of the control flow in afoldl. IIRC it should be strictly worse than using Chain{Vector}. The main benefits come at runtime.

What might help is optimizing the AD compilation pass itself. I have a local IRTools branch that shaves ~10s off TTFG for ViT through a combination of precompilation and reducing memory allocations in one particularly time-consuming function. However, it's unclear how much mileage is left for this approach, as profiling suggests a lot of time is spent in inference or LLVM. Perhaps 1.8/9 will help with those?

darsnack · 2022-04-27T17:35:34Z

There doesn't seem to be much difference in gradient times but it shaves some time off the forward pass (returning Chain for conv_bn that is)

Just for clarity: you're saying that Chain(::Vector) contributes most of the TTFG improvement or nested Chains (as a result of returning Chain from conv_bn)?

theabhirath · 2022-04-27T17:57:25Z

Just for clarity: you're saying that Chain(::Vector) contributes most of the TTFG improvement or nested Chains (as a result of returning Chain from conv_bn)?

Chain(::Vector) primarily contributes to reducing TTFG. The nested Chains are helping reduce inference time a bit - 'bout 20-100 ms knocked off the forward pass depending on the model you check

DhairyaLGandhi · 2022-04-30T20:28:38Z

Great, thanks @theabhirath ! I think this is good to go since there is plenty of improvement in there already and we can move ahead with the compilation tirade.

Returning nested Chains and returning Chain out of conv_bn actually contribute a lot.

One final thing would be to reenable testing gradients out of the models. Those are skipped currently.

theabhirath · 2022-05-01T02:39:52Z

One final thing would be to reenable testing gradients out of the models. Those are skipped currently.

The memory issues on GA actions prevents this - testing locally does take a lot of memory (I've had to intervene to ensure it doesn't write too much into my swap)

darsnack · 2022-05-01T04:28:06Z

This is my fault for not commenting, but I would actually prefer a follow up PR to remove the nesting. Not just because arbitrary nesting makes iteration and indexing inconvenient, but more practically because nesting is a breaking change. And it doesn't seem necessary when using Chain(::Vector) (which makes sense it should only affect the inference for tuples).

theabhirath · 2022-05-01T11:59:35Z

I have a ready followup PR, but here's something I noted while testing with the gradtests on.

With conv_bn returning a Chain, time taken to run the tests:

Without (i.e. manually splatting conv_bn everywhere):

Now I tested TTFG for some models and they stayed the same - but clearly, this shows that having conv_bn as a Chain helps subsequent gradients. Not sure what the way forward is - like I said, I have a ready follow-up PR in case we choose to revert but this approach seems to be yielding better results (the breaking change part is definitely annoying though, not sure how to circumvent that)

darsnack · 2022-05-01T13:02:30Z

No a breaking change is okay if it is actually making a difference.

darsnack · 2022-05-01T13:47:32Z

Why is the test time so different for AlexNet? It contains no conv_bns and the merged code doesn't attempt to use Chain(::Vector) either. i.e. shouldn't it be the exact same on those two screen shots?

I feel like we need a more rigorous benchmarking environment beyond ] test to make these decisions.

theabhirath · 2022-05-01T14:18:40Z

True, this is completely local and it's not really much of a benchmark because I've just run the tests twice 😅

Improved compilation times for first gradient

91cd788

Fix backbone

1e78c3c

Formatting, try to get all tests to pass

9f5295a

DhairyaLGandhi approved these changes Apr 30, 2022

View reviewed changes

ToucheSir merged commit 792076f into FluxML:master May 1, 2022

theabhirath deleted the conv_bn branch May 1, 2022 02:41

ToucheSir mentioned this pull request May 2, 2022

Type-stable rrule for applychain(::Vector) FluxML/Flux.jl#1957

Open

theabhirath mentioned this pull request May 14, 2022

Revert returning Chain for conv_bn #155

Merged

theabhirath mentioned this pull request May 14, 2022

Expose nclasses option for DenseNet #154

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved time to first gradient #151

Improved time to first gradient #151

theabhirath commented Apr 27, 2022 •

edited

Loading

theabhirath commented Apr 27, 2022

ToucheSir commented Apr 27, 2022

theabhirath commented Apr 27, 2022 •

edited

Loading

theabhirath commented Apr 27, 2022 •

edited

Loading

theabhirath commented Apr 27, 2022

DhairyaLGandhi commented Apr 27, 2022 •

edited

Loading

ToucheSir commented Apr 27, 2022

darsnack commented Apr 27, 2022

darsnack commented Apr 27, 2022

darsnack commented Apr 27, 2022

theabhirath commented Apr 27, 2022 •

edited

Loading

theabhirath commented Apr 27, 2022 •

edited

Loading

ToucheSir commented Apr 27, 2022

darsnack commented Apr 27, 2022

theabhirath commented Apr 27, 2022 •

edited

Loading

DhairyaLGandhi commented Apr 30, 2022

theabhirath commented May 1, 2022

darsnack commented May 1, 2022

theabhirath commented May 1, 2022 •

edited

Loading

darsnack commented May 1, 2022

darsnack commented May 1, 2022

theabhirath commented May 1, 2022

Improved time to first gradient #151

Improved time to first gradient #151

Conversation

theabhirath commented Apr 27, 2022 • edited Loading

theabhirath commented Apr 27, 2022

ToucheSir commented Apr 27, 2022

theabhirath commented Apr 27, 2022 • edited Loading

theabhirath commented Apr 27, 2022 • edited Loading

theabhirath commented Apr 27, 2022

DhairyaLGandhi commented Apr 27, 2022 • edited Loading

ToucheSir commented Apr 27, 2022

darsnack commented Apr 27, 2022

darsnack commented Apr 27, 2022

darsnack commented Apr 27, 2022

theabhirath commented Apr 27, 2022 • edited Loading

theabhirath commented Apr 27, 2022 • edited Loading

ToucheSir commented Apr 27, 2022

darsnack commented Apr 27, 2022

theabhirath commented Apr 27, 2022 • edited Loading

DhairyaLGandhi commented Apr 30, 2022

theabhirath commented May 1, 2022

darsnack commented May 1, 2022

theabhirath commented May 1, 2022 • edited Loading

darsnack commented May 1, 2022

darsnack commented May 1, 2022

theabhirath commented May 1, 2022

theabhirath commented Apr 27, 2022 •

edited

Loading

theabhirath commented Apr 27, 2022 •

edited

Loading

theabhirath commented Apr 27, 2022 •

edited

Loading

DhairyaLGandhi commented Apr 27, 2022 •

edited

Loading

theabhirath commented Apr 27, 2022 •

edited

Loading

theabhirath commented Apr 27, 2022 •

edited

Loading

theabhirath commented Apr 27, 2022 •

edited

Loading

theabhirath commented May 1, 2022 •

edited

Loading