You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For now all aggregation is done on the model level: Before optimization, we aggregate the model weights or gradients.
However, some optimizers have states and distributing training using such optimizers can require additional information to be shared. Which is why we should aggregate the optimizer's parameters/state dict instead of the model's gradients/weights.
The text was updated successfully, but these errors were encountered:
This is an interesting point. Are you mainly talking about Adam? There are quite a few choices to be made that are somehow related to this:
Do you sync the batch norm state (non-optimized running averages) or not?
Do you run Adam or momentum before or after synchronization? If you average the gradients and then run Adam on the averaged gradient, you don't need to communicate Adam's stage. I think both options could be valid. I would say that averaging raw gradients before using any special optimization algorithm is the most 'theoretically sound'.
Let me know if you would be interested in having a quick chat.
@tvogels Not necessarily only Adam. It would be nice to have the option to communicate other optimizer information in general.
All optimization steps for now are done after synchronization with all nodes, and the only information that is communicated is the weight gradients or weights themselves (not states whatsoever).
In practice all works fine, but as we (or anyone using MLBench) will be adding more complex optimizers, having the (easily integrated) option to communicate states and not relying solely on the model can be useful. We could then have the possibility of having optimizers that share other information than just the weights, and thus have the option to implement more complex distributed optimizers.
For sure, let's maybe discuss with the team tomorrow 15:30 (usual MLBench meeting) ?
For now all aggregation is done on the model level: Before optimization, we aggregate the model weights or gradients.
However, some optimizers have states and distributing training using such optimizers can require additional information to be shared. Which is why we should aggregate the optimizer's parameters/state dict instead of the model's gradients/weights.
The text was updated successfully, but these errors were encountered: