-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--fp causes an issue when running example scripts in distributed mode #4657
Comments
I've tried
|
I've also tried 3 different machines. All ubuntu 18.04, but with different GPUs sets. 2 Tesla V100-SXM2, 2 P100-SXM2, and 2 Tesla M40, but still get the same error. |
Can you install the repo from source and try again? There have been some issues with PyTorch upstream that Julien addressed here: #4300. So you can try with the latest master branch. |
@BramVanroy, that merge request appears to have been merged prior to v2.10.0 release. I've installed both |
The one thing I can think of that you can try is specifically setting the current device for each process. Can you try cloning the library and installing in dev mode, and adding a line here: transformers/examples/language-modeling/run_language_modeling.py Lines 134 to 136 in 0866669
So that it looks like this: model_args, data_args, training_args = parser.parse_args_into_dataclasses()
torch.cuda.set_device(training_args.device)
if data_args.eval_data_file is None and training_args.do_eval: |
Thanks @BramVanroy , you suggestion worked. I really appreciate it. |
Re-opening so that we can close this in a PR. |
@BramVanroy, while your suggestion works for multiple GPUs. I get the following error when trying to use a single GPU.
and
|
@CMobley7 Thanks for the update! I pushed another update to my PR, can you try that one out? When we are not using DDP (and local_rank is -1), we do not specify the GPU id to use. It's best to strictly select that main device, so now we select it by using index 0. (This will still work if you set different devices with CUDA_VISIBLE_DEVICES, it'll just select the first device available in that environment). |
@BramVanroy , I can confirm that the changes made in #4728 successfully fix the apex issues with both a single and multiple GPUs. I've tested on 3 different machines. All ubuntu 18.04, but with different GPUs sets. 2 Tesla V100-SXM2, 2 P100-SXM2, and 2 Tesla M40. Thanks for your help. |
* manually set device in trainer args * check if current device is cuda before set_device * Explicitly set GPU ID when using single GPU This addresses #4657 (comment)
Thank you @CMobley7 for the extensive testing, this is very valuable. And thanks @BramVanroy for fixing! |
🐛 Bug
Information
Model I am using (Bert, XLNet ...):
roberta-large
Language I am using the model on (English, Chinese ...):
English
The problem arises when using:
The tasks I am working on is:
run_language_modeling.py
and the SST-2 task withrun_glue.py
To reproduce
If I run either of the following commands, I get the error included below. However, if I remove
--fp
, everything works normally. Also, if I add--fp
, but run it non-distributed, everything works normally. So, it appears there is an issue with my running-fp
in a distributed fashion. I haven't had an issue with this before; so, I'm not sure what the problem is. Any ideas? Thanks in advance.I installed apex in two different way, but still get the same results.
Environment info
transformers
version: 2.10.0The text was updated successfully, but these errors were encountered: