Change aggregation to aggregate optimizer parameters #296

ehoelzl · 2020-11-26T10:33:21Z

For now all aggregation is done on the model level: Before optimization, we aggregate the model weights or gradients.

However, some optimizers have states and distributing training using such optimizers can require additional information to be shared. Which is why we should aggregate the optimizer's parameters/state dict instead of the model's gradients/weights.

tvogels · 2020-11-26T11:17:53Z

This is an interesting point. Are you mainly talking about Adam? There are quite a few choices to be made that are somehow related to this:

Do you sync the batch norm state (non-optimized running averages) or not?
Do you run Adam or momentum before or after synchronization? If you average the gradients and then run Adam on the averaged gradient, you don't need to communicate Adam's stage. I think both options could be valid. I would say that averaging raw gradients before using any special optimization algorithm is the most 'theoretically sound'.

Let me know if you would be interested in having a quick chat.

ehoelzl · 2020-11-26T13:39:17Z

@tvogels Not necessarily only Adam. It would be nice to have the option to communicate other optimizer information in general.

All optimization steps for now are done after synchronization with all nodes, and the only information that is communicated is the weight gradients or weights themselves (not states whatsoever).

In practice all works fine, but as we (or anyone using MLBench) will be adding more complex optimizers, having the (easily integrated) option to communicate states and not relying solely on the model can be useful. We could then have the possibility of having optimizers that share other information than just the weights, and thus have the option to implement more complex distributed optimizers.

For sure, let's maybe discuss with the team tomorrow 15:30 (usual MLBench meeting) ?

martinjaggi assigned tvogels Nov 26, 2020

ehoelzl added the conceptual-change Changes related to how models are trained label Nov 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change aggregation to aggregate optimizer parameters #296

Change aggregation to aggregate optimizer parameters #296

ehoelzl commented Nov 26, 2020

tvogels commented Nov 26, 2020 •

edited

Loading

ehoelzl commented Nov 26, 2020

Change aggregation to aggregate optimizer parameters #296

Change aggregation to aggregate optimizer parameters #296

Comments

ehoelzl commented Nov 26, 2020

tvogels commented Nov 26, 2020 • edited Loading

ehoelzl commented Nov 26, 2020

tvogels commented Nov 26, 2020 •

edited

Loading