Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_glue.py RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3 #1801

Closed
4 tasks
insublee opened this issue Nov 12, 2019 · 10 comments
Labels

Comments

@insublee
Copy link

insublee commented Nov 12, 2019

🐛 Bug

Model I am using (Bert, XLNet....): Bert

Language I am using the model on (English, Chinese....): English

The problem arise when using:

  • the official example scripts: (give details) : transformers/examples/run_glue.py
  • my own modified scripts: (give details)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name) : MRPC
  • my own task or dataset: (give details)

To Reproduce

Steps to reproduce the behavior:

I've tested using
python -m pytest -sv ./transformers/tests/
python -m pytest -sv ./examples/
and it works fine without couple of tesks.

after test, i downloaded glue datafile via
https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e
and tried run_glue.py

pip install -r ./examples/requirements.txt
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC

python ./examples/run_glue.py
--model_type bert
--model_name_or_path bert-base-uncased
--task_name $TASK_NAME
--do_train
--do_eval
--do_lower_case
--data_dir $GLUE_DIR/$TASK_NAME
--max_seq_length 128
--per_gpu_eval_batch_size=8
--per_gpu_train_batch_size=8
--learning_rate 2e-5
--num_train_epochs 3.0
--output_dir /tmp/$TASK_NAME/

and i got this error.

11/11/2019 21:10:50 - INFO - __main__ - Total optimization steps = 345 Epoch: 0%| | 0/3 [00:00<?, ?it/sTraceback (most recent call last): | 0/115 [00:00<?, ?it/s] File "./examples/run_glue.py", line 552, in <module> main() File "./examples/run_glue.py", line 503, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "./examples/run_glue.py", line 146, in train outputs = model(**inputs) File "/home/insublee/anaconda3/envs/py_torch4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/home/insublee/anaconda3/envs/py_torch4/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 146, in forward "them on device: {}".format(self.src_device_obj, t.device)) RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3

Environment

  • OS: ubuntu16.04LTS
  • Python version: 3.7.5
  • PyTorch version: 1.2.0
  • PyTorch Transformers version (or branch): 2.1.1
  • Using GPU ? 4-way 2080ti
  • Distributed of parallel setup ? cuda10.0 cudnn 7.6.4
  • Any other relevant information:

Additional context

thank you.

@TheEdoardo93
Copy link

This problem comes out from multiple GPUs usage. The error you have reported says that you have parameters or the buffers of the model in two different locations.
Said this, it's probably related to #1504 issue. Reading comments in the #1504 issue, i saw that @h-sugi suggests 4 days ago to modify the source code in run_**.py like this:

BEFORE: device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
AFTER: device = torch.device("cuda:0" if torch.cuda.is_available() and not args.no_cuda else "cpu").

Moreover, the person who opened the issue @ahotrod says this: Have had many successful SQuAD fine-tuning runs on PyTorch 1.2.0 with Pytorch-Transformers 1.2.0, maybe even Transformers 2.0.0, and Apex 0.1. New environment built with the latest versions (Pytorch 1.3.0, Transformers 2.1.1) spawns data parallel related error above

Please, keep us updated on this topic!

Bug

Model I am using (Bert, XLNet....): Bert

Language I am using the model on (English, Chinese....): English

The problem arise when using:

  • the official example scripts: (give details) : transformers/examples/run_glue.py
  • my own modified scripts: (give details)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name) : MRPC
  • my own task or dataset: (give details)

To Reproduce

Steps to reproduce the behavior:

I've tested using
python -m pytest -sv ./transformers/tests/
python -m pytest -sv ./examples/
and it works fine without couple of tesks.

after test, i downloaded glue datafile via
https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e
and tried run_glue.py

pip install -r ./examples/requirements.txt
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC

python ./examples/run_glue.py
--model_type bert
--model_name_or_path bert-base-uncased
--task_name $TASK_NAME
--do_train
--do_eval
--do_lower_case
--data_dir $GLUE_DIR/$TASK_NAME
--max_seq_length 128
--per_gpu_eval_batch_size=8
--per_gpu_train_batch_size=8
--learning_rate 2e-5
--num_train_epochs 3.0
--output_dir /tmp/$TASK_NAME/

and i got this error.

11/11/2019 21:10:50 - INFO - __main__ - Total optimization steps = 345 Epoch: 0%| | 0/3 [00:00<?, ?it/sTraceback (most recent call last): | 0/115 [00:00<?, ?it/s] File "./examples/run_glue.py", line 552, in <module> main() File "./examples/run_glue.py", line 503, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "./examples/run_glue.py", line 146, in train outputs = model(**inputs) File "/home/insublee/anaconda3/envs/py_torch4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/home/insublee/anaconda3/envs/py_torch4/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 146, in forward "them on device: {}".format(self.src_device_obj, t.device)) RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3

Environment

  • OS: ubuntu16.04LTS
  • Python version: 3.7.5
  • PyTorch version: 1.2.0
  • PyTorch Transformers version (or branch): 2.1.1
  • Using GPU ? 4-way 2080ti
  • Distributed of parallel setup ? cuda10.0 cudnn 7.6.4
  • Any other relevant information:

Additional context

thank you.

@insublee
Copy link
Author

thanks a lot. it works!!!! :)

@rezasanatkar
Copy link

@TheEdoardo93 After the change of cuda to cuda:0, will we still have multiple GPU usage for the jobs?

@TheEdoardo93
Copy link

As stated in the official docs, if you use torch.device('cuda:0') you will use only a GPU. If you want to use multiple GPUs, you can easily run your operations on multiple GPUs by making your model run parallelly using DataParallel: model = nn.DataParallel(model)

You can read more information here and here.

@TheEdoardo93 After the change of cuda to cuda:0, will we still have multiple GPU usage for the jobs?

@rezasanatkar
Copy link

@TheEdoardo93 I tested run_glue.py on a multi-gpu machine. Even, after changing "cuda" to "cuda:0" in this line

device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
, the training job will still use both GPUs relying on torch.nn.DataParallel. It means that torch.nn.DataParallel is smart enough even if you are defining torch.device to be cuda:0, if there are several gpus available, it will use all of them.

@zjcerwin
Copy link

@TheEdoardo93 I tested run_glue.py on a multi-gpu machine. Even, after changing "cuda" to "cuda:0" in this line

device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")

, the training job will still use both GPUs relying on torch.nn.DataParallel. It means that torch.nn.DataParallel is smart enough even if you are defining torch.device to be cuda:0, if there are several gpus available, it will use all of them.

manually set 'args.n_gpu = 1' works for me

@ghost
Copy link

ghost commented Feb 3, 2020

Just a simple os.environ['CUDA_VISIBLE_DEVICES'] = 'GPU_NUM' at the beginning of the script should work.

@yuvalkirstain
Copy link
Contributor

@TheEdoardo93 I tested run_glue.py on a multi-gpu machine. Even, after changing "cuda" to "cuda:0" in this line

device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")

, the training job will still use both GPUs relying on torch.nn.DataParallel. It means that torch.nn.DataParallel is smart enough even if you are defining torch.device to be cuda:0, if there are several gpus available, it will use all of them.

manually set 'args.n_gpu = 1' works for me

but then you are not able to use more than 1 gpu, right?

@ksboy
Copy link

ksboy commented Mar 31, 2020

@TheEdoardo93 I tested run_glue.py on a multi-gpu machine. Even, after changing "cuda" to "cuda:0" in this line

device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")

, the training job will still use both GPUs relying on torch.nn.DataParallel. It means that torch.nn.DataParallel is smart enough even if you are defining torch.device to be cuda:0, if there are several gpus available, it will use all of them.

I have trid but it doesn't work for me.

@stale
Copy link

stale bot commented May 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants