Skip to content

Multi-GPU training vs Distributed training #53

Closed
@llidev

Description

@llidev

Hi,

I have a question about Multi-GPU vs Distributed training, probably unrelated to BERT itself.

I have a 4-GPU server, and was trying to run run_classifier.py in two ways:

(a) run single-node distributed training with 4 processes and minibatch of 32 each
(b) run Multi-GPU training with minibatch of 128, and all other hyperparams keep the same

Intuitively I believe a and b should yield the closed accuracy and training times. Below please find my observations:

  1. (a) runs ~20% faster than (b).
  2. (b) yields a better final evaluation accuracy of ~4% than (a)

The first looks like reasonable since I guess the loss.mean() is done by CPU which may be slower than using NCCL directly? However, I don't quite understand the second observation. Can you please give any hint or reference about the possible cause?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions