-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU training vs Distributed training #53
Comments
Hi, Thanks for the feedback, it's always interesting to compare the various possible ways to train the model indeed. The most likely cause for (2) is that MRPC is a small dataset and the model shows a high variance in the results depending on the initialization of the weights for example (see the original BERT repo on that also). The distributed and multi-gpu setups probably do not use the random generators in the exact same order which lead to different initializations. You can have an intuition of that by training with different seeds, you will see there is easily a 10% variation in the final accuracy... If you can do that, a better way to compare the results would thus be to take something like 10 different seeds for each training condition and compare the mean and standard deviation of the results. |
Thanks for your feedback! After some investigations, it looks like I have included the following fix in my PR #58
|
Add step time metrics via `xla_execution_time_step`
* rename modeling to models * rename autogptq_extension to autogptq_next_ext
* disable sharding by default * Fix loading check of checkpoint_format
Hi,
I have a question about Multi-GPU vs Distributed training, probably unrelated to BERT itself.
I have a 4-GPU server, and was trying to run
run_classifier.py
in two ways:(a) run single-node distributed training with 4 processes and minibatch of 32 each
(b) run Multi-GPU training with minibatch of 128, and all other hyperparams keep the same
Intuitively I believe a and b should yield the closed accuracy and training times. Below please find my observations:
The first looks like reasonable since I guess the loss.mean() is done by CPU which may be slower than using NCCL directly? However, I don't quite understand the second observation. Can you please give any hint or reference about the possible cause?
Thanks!
The text was updated successfully, but these errors were encountered: