Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU training vs Distributed training #53

Closed
llidev opened this issue Nov 24, 2018 · 2 comments
Closed

Multi-GPU training vs Distributed training #53

llidev opened this issue Nov 24, 2018 · 2 comments

Comments

@llidev
Copy link
Contributor

llidev commented Nov 24, 2018

Hi,

I have a question about Multi-GPU vs Distributed training, probably unrelated to BERT itself.

I have a 4-GPU server, and was trying to run run_classifier.py in two ways:

(a) run single-node distributed training with 4 processes and minibatch of 32 each
(b) run Multi-GPU training with minibatch of 128, and all other hyperparams keep the same

Intuitively I believe a and b should yield the closed accuracy and training times. Below please find my observations:

  1. (a) runs ~20% faster than (b).
  2. (b) yields a better final evaluation accuracy of ~4% than (a)

The first looks like reasonable since I guess the loss.mean() is done by CPU which may be slower than using NCCL directly? However, I don't quite understand the second observation. Can you please give any hint or reference about the possible cause?

Thanks!

@thomwolf
Copy link
Member

Hi,

Thanks for the feedback, it's always interesting to compare the various possible ways to train the model indeed.

The most likely cause for (2) is that MRPC is a small dataset and the model shows a high variance in the results depending on the initialization of the weights for example (see the original BERT repo on that also). The distributed and multi-gpu setups probably do not use the random generators in the exact same order which lead to different initializations.

You can have an intuition of that by training with different seeds, you will see there is easily a 10% variation in the final accuracy...

If you can do that, a better way to compare the results would thus be to take something like 10 different seeds for each training condition and compare the mean and standard deviation of the results.

@llidev
Copy link
Contributor Author

llidev commented Nov 27, 2018

Thanks for your feedback!

After some investigations, it looks like t_total is not set properly for distributed training in BertAdam. The actual t_total per distributed worker should be divided by the worker count.

I have included the following fix in my PR #58

    t_total = num_train_steps
    if args.local_rank != -1:
        t_total = t_total // torch.distributed.get_world_size()
    optimizer = BertAdam(optimizer_grouped_parameters,
                         lr=args.learning_rate,
                         warmup=args.warmup_proportion,
                         t_total=t_total)

jameshennessytempus pushed a commit to jameshennessytempus/transformers that referenced this issue Jun 1, 2023
jonb377 added a commit to jonb377/hf-transformers that referenced this issue Apr 5, 2024
Add step time metrics via `xla_execution_time_step`
ZYC-ModelCloud pushed a commit to ZYC-ModelCloud/transformers that referenced this issue Nov 14, 2024
* rename modeling to models

* rename autogptq_extension to autogptq_next_ext
ZYC-ModelCloud pushed a commit to ZYC-ModelCloud/transformers that referenced this issue Nov 14, 2024
* disable sharding by default

* Fix loading check of checkpoint_format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants