-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run_glue.py RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3 #1801
Comments
This problem comes out from multiple GPUs usage. The error you have reported says that you have parameters or the buffers of the model in two different locations. BEFORE: Moreover, the person who opened the issue @ahotrod says this: Have had many successful SQuAD fine-tuning runs on PyTorch 1.2.0 with Pytorch-Transformers 1.2.0, maybe even Transformers 2.0.0, and Apex 0.1. New environment built with the latest versions (Pytorch 1.3.0, Transformers 2.1.1) spawns data parallel related error above Please, keep us updated on this topic!
|
thanks a lot. it works!!!! :) |
@TheEdoardo93 After the change of |
As stated in the official docs, if you use You can read more information here and here.
|
@TheEdoardo93 I tested transformers/examples/run_glue.py Line 425 in b0ee7c7
torch.nn.DataParallel . It means that torch.nn.DataParallel is smart enough even if you are defining torch.device to be cuda:0 , if there are several gpus available, it will use all of them.
|
manually set 'args.n_gpu = 1' works for me |
Just a simple |
but then you are not able to use more than 1 gpu, right? |
I have trid but it doesn't work for me. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
🐛 Bug
Model I am using (Bert, XLNet....): Bert
Language I am using the model on (English, Chinese....): English
The problem arise when using:
The tasks I am working on is:
To Reproduce
Steps to reproduce the behavior:
I've tested using
python -m pytest -sv ./transformers/tests/
python -m pytest -sv ./examples/
and it works fine without couple of tesks.
after test, i downloaded glue datafile via
https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e
and tried run_glue.py
pip install -r ./examples/requirements.txt
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC
python ./examples/run_glue.py
--model_type bert
--model_name_or_path bert-base-uncased
--task_name $TASK_NAME
--do_train
--do_eval
--do_lower_case
--data_dir $GLUE_DIR/$TASK_NAME
--max_seq_length 128
--per_gpu_eval_batch_size=8
--per_gpu_train_batch_size=8
--learning_rate 2e-5
--num_train_epochs 3.0
--output_dir /tmp/$TASK_NAME/
and i got this error.
11/11/2019 21:10:50 - INFO - __main__ - Total optimization steps = 345 Epoch: 0%| | 0/3 [00:00<?, ?it/sTraceback (most recent call last): | 0/115 [00:00<?, ?it/s] File "./examples/run_glue.py", line 552, in <module> main() File "./examples/run_glue.py", line 503, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "./examples/run_glue.py", line 146, in train outputs = model(**inputs) File "/home/insublee/anaconda3/envs/py_torch4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/home/insublee/anaconda3/envs/py_torch4/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 146, in forward "them on device: {}".format(self.src_device_obj, t.device)) RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3
Environment
Additional context
thank you.
The text was updated successfully, but these errors were encountered: