Multi-GPU training vs Distributed training

Hi,

I have a question about Multi-GPU vs Distributed training, probably unrelated to BERT itself.

I have a 4-GPU server, and was trying to run `run_classifier.py` in two ways:

(a) run single-node distributed training with 4 processes and minibatch of 32 each
(b) run Multi-GPU training with minibatch of 128, and all other hyperparams keep the same

Intuitively I believe a and b should yield the closed accuracy and training times. Below please find my observations:
1.  (a) runs ~20% faster than (b). 
2.  (b) yields a better final evaluation accuracy of ~4% than (a)

The first looks like reasonable since I guess the loss.mean() is done by CPU which may be slower than using NCCL directly?  However, I don't quite understand the second observation. Can you please give any hint or reference about the possible cause?

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-GPU training vs Distributed training #53

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-GPU training vs Distributed training #53

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions