Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change aggregation to aggregate optimizer parameters #296

Open
ehoelzl opened this issue Nov 26, 2020 · 2 comments
Open

Change aggregation to aggregate optimizer parameters #296

ehoelzl opened this issue Nov 26, 2020 · 2 comments
Assignees
Labels
conceptual-change Changes related to how models are trained

Comments

@ehoelzl
Copy link
Contributor

ehoelzl commented Nov 26, 2020

For now all aggregation is done on the model level: Before optimization, we aggregate the model weights or gradients.

However, some optimizers have states and distributing training using such optimizers can require additional information to be shared. Which is why we should aggregate the optimizer's parameters/state dict instead of the model's gradients/weights.

@tvogels
Copy link

tvogels commented Nov 26, 2020

This is an interesting point. Are you mainly talking about Adam? There are quite a few choices to be made that are somehow related to this:

  • Do you sync the batch norm state (non-optimized running averages) or not?
  • Do you run Adam or momentum before or after synchronization? If you average the gradients and then run Adam on the averaged gradient, you don't need to communicate Adam's stage. I think both options could be valid. I would say that averaging raw gradients before using any special optimization algorithm is the most 'theoretically sound'.

Let me know if you would be interested in having a quick chat.

@ehoelzl
Copy link
Contributor Author

ehoelzl commented Nov 26, 2020

@tvogels Not necessarily only Adam. It would be nice to have the option to communicate other optimizer information in general.

All optimization steps for now are done after synchronization with all nodes, and the only information that is communicated is the weight gradients or weights themselves (not states whatsoever).

In practice all works fine, but as we (or anyone using MLBench) will be adding more complex optimizers, having the (easily integrated) option to communicate states and not relying solely on the model can be useful. We could then have the possibility of having optimizers that share other information than just the weights, and thus have the option to implement more complex distributed optimizers.

For sure, let's maybe discuss with the team tomorrow 15:30 (usual MLBench meeting) ?

@ehoelzl ehoelzl added the conceptual-change Changes related to how models are trained label Nov 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conceptual-change Changes related to how models are trained
Projects
None yet
Development

No branches or pull requests

2 participants