You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
Thanks for the amazing framework. I have a doubt regarding the utility of the all_gather_list function, that gathers the tensors across the GPUs. When we are training in DDP, the gradients are synchronized before the parameter updates, therefore, why is this step needed? Is it just to collate the loss or number of correct predictions or the rank (in evaluation)? If yes, then couldn't one gather all of them after computing the loss, instead of exchanging the question and context representations first and then going forward with it?
Thanks!
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hi,
Thanks for the amazing framework. I have a doubt regarding the utility of the all_gather_list function, that gathers the tensors across the GPUs. When we are training in DDP, the gradients are synchronized before the parameter updates, therefore, why is this step needed? Is it just to collate the loss or number of correct predictions or the rank (in evaluation)? If yes, then couldn't one gather all of them after computing the loss, instead of exchanging the question and context representations first and then going forward with it?
Thanks!
The text was updated successfully, but these errors were encountered: