Closed
Description
Hi,
I have a question about Multi-GPU vs Distributed training, probably unrelated to BERT itself.
I have a 4-GPU server, and was trying to run run_classifier.py
in two ways:
(a) run single-node distributed training with 4 processes and minibatch of 32 each
(b) run Multi-GPU training with minibatch of 128, and all other hyperparams keep the same
Intuitively I believe a and b should yield the closed accuracy and training times. Below please find my observations:
- (a) runs ~20% faster than (b).
- (b) yields a better final evaluation accuracy of ~4% than (a)
The first looks like reasonable since I guess the loss.mean() is done by CPU which may be slower than using NCCL directly? However, I don't quite understand the second observation. Can you please give any hint or reference about the possible cause?
Thanks!
Metadata
Metadata
Assignees
Labels
No labels