run_glue.py RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3 #1801

insublee · 2019-11-12T05:22:52Z

🐛 Bug

Model I am using (Bert, XLNet....): Bert

Language I am using the model on (English, Chinese....): English

The problem arise when using:

the official example scripts: (give details) : transformers/examples/run_glue.py
my own modified scripts: (give details)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name) : MRPC
my own task or dataset: (give details)

To Reproduce

Steps to reproduce the behavior:

I've tested using
python -m pytest -sv ./transformers/tests/
python -m pytest -sv ./examples/
and it works fine without couple of tesks.

after test, i downloaded glue datafile via
https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e
and tried run_glue.py

pip install -r ./examples/requirements.txt
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC

python ./examples/run_glue.py
--model_type bert
--model_name_or_path bert-base-uncased
--task_name $TASK_NAME
--do_train
--do_eval
--do_lower_case
--data_dir $GLUE_DIR/$TASK_NAME
--max_seq_length 128
--per_gpu_eval_batch_size=8
--per_gpu_train_batch_size=8
--learning_rate 2e-5
--num_train_epochs 3.0
--output_dir /tmp/$TASK_NAME/

and i got this error.

11/11/2019 21:10:50 - INFO - __main__ - Total optimization steps = 345 Epoch: 0%| | 0/3 [00:00<?, ?it/sTraceback (most recent call last): | 0/115 [00:00<?, ?it/s] File "./examples/run_glue.py", line 552, in <module> main() File "./examples/run_glue.py", line 503, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "./examples/run_glue.py", line 146, in train outputs = model(**inputs) File "/home/insublee/anaconda3/envs/py_torch4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/home/insublee/anaconda3/envs/py_torch4/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 146, in forward "them on device: {}".format(self.src_device_obj, t.device)) RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3

Environment

OS: ubuntu16.04LTS
Python version: 3.7.5
PyTorch version: 1.2.0
PyTorch Transformers version (or branch): 2.1.1
Using GPU ? 4-way 2080ti
Distributed of parallel setup ? cuda10.0 cudnn 7.6.4
Any other relevant information:

Additional context

thank you.

The text was updated successfully, but these errors were encountered:

TheEdoardo93 · 2019-11-12T08:09:31Z

This problem comes out from multiple GPUs usage. The error you have reported says that you have parameters or the buffers of the model in two different locations.
Said this, it's probably related to #1504 issue. Reading comments in the #1504 issue, i saw that @h-sugi suggests 4 days ago to modify the source code in run_**.py like this:

BEFORE: device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
AFTER: device = torch.device("cuda:0" if torch.cuda.is_available() and not args.no_cuda else "cpu").

Moreover, the person who opened the issue @ahotrod says this: Have had many successful SQuAD fine-tuning runs on PyTorch 1.2.0 with Pytorch-Transformers 1.2.0, maybe even Transformers 2.0.0, and Apex 0.1. New environment built with the latest versions (Pytorch 1.3.0, Transformers 2.1.1) spawns data parallel related error above

Please, keep us updated on this topic!

Bug

Model I am using (Bert, XLNet....): Bert

Language I am using the model on (English, Chinese....): English

The problem arise when using:

the official example scripts: (give details) : transformers/examples/run_glue.py

my own modified scripts: (give details)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name) : MRPC

my own task or dataset: (give details)

To Reproduce

Steps to reproduce the behavior:

I've tested using
python -m pytest -sv ./transformers/tests/
python -m pytest -sv ./examples/
and it works fine without couple of tesks.

after test, i downloaded glue datafile via
https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e
and tried run_glue.py

pip install -r ./examples/requirements.txt
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC

python ./examples/run_glue.py
--model_type bert
--model_name_or_path bert-base-uncased
--task_name $TASK_NAME
--do_train
--do_eval
--do_lower_case
--data_dir $GLUE_DIR/$TASK_NAME
--max_seq_length 128
--per_gpu_eval_batch_size=8
--per_gpu_train_batch_size=8
--learning_rate 2e-5
--num_train_epochs 3.0
--output_dir /tmp/$TASK_NAME/

and i got this error.

11/11/2019 21:10:50 - INFO - __main__ - Total optimization steps = 345 Epoch: 0%| | 0/3 [00:00<?, ?it/sTraceback (most recent call last): | 0/115 [00:00<?, ?it/s] File "./examples/run_glue.py", line 552, in <module> main() File "./examples/run_glue.py", line 503, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "./examples/run_glue.py", line 146, in train outputs = model(**inputs) File "/home/insublee/anaconda3/envs/py_torch4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/home/insublee/anaconda3/envs/py_torch4/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 146, in forward "them on device: {}".format(self.src_device_obj, t.device)) RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3

Environment

OS: ubuntu16.04LTS

Python version: 3.7.5

PyTorch version: 1.2.0

PyTorch Transformers version (or branch): 2.1.1

Using GPU ? 4-way 2080ti

Distributed of parallel setup ? cuda10.0 cudnn 7.6.4

Any other relevant information:

Additional context

thank you.

insublee · 2019-11-13T02:42:10Z

thanks a lot. it works!!!! :)

rezasanatkar · 2019-11-28T17:05:54Z

@TheEdoardo93 After the change of cuda to cuda:0, will we still have multiple GPU usage for the jobs?

TheEdoardo93 · 2019-11-29T08:21:38Z

As stated in the official docs, if you use torch.device('cuda:0') you will use only a GPU. If you want to use multiple GPUs, you can easily run your operations on multiple GPUs by making your model run parallelly using DataParallel: model = nn.DataParallel(model)

You can read more information here and here.

@TheEdoardo93 After the change of cuda to cuda:0, will we still have multiple GPU usage for the jobs?

rezasanatkar · 2019-11-30T07:57:47Z

@TheEdoardo93 I tested run_glue.py on a multi-gpu machine. Even, after changing "cuda" to "cuda:0" in this line

transformers/examples/run_glue.py

Line 425 in b0ee7c7

    
           device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")

, the training job will still use both GPUs relying on torch.nn.DataParallel. It means that torch.nn.DataParallel is smart enough even if you are defining torch.device to be cuda:0, if there are several gpus available, it will use all of them.

zjcerwin · 2020-01-15T06:48:38Z

@TheEdoardo93 I tested run_glue.py on a multi-gpu machine. Even, after changing "cuda" to "cuda:0" in this line

transformers/examples/run_glue.py

Line 425 in b0ee7c7

device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")

, the training job will still use both GPUs relying on torch.nn.DataParallel. It means that torch.nn.DataParallel is smart enough even if you are defining torch.device to be cuda:0, if there are several gpus available, it will use all of them.

manually set 'args.n_gpu = 1' works for me

ghost · 2020-02-03T08:55:50Z

Just a simple os.environ['CUDA_VISIBLE_DEVICES'] = 'GPU_NUM' at the beginning of the script should work.

yuvalkirstain · 2020-03-17T16:40:04Z

@TheEdoardo93 I tested run_glue.py on a multi-gpu machine. Even, after changing "cuda" to "cuda:0" in this line

transformers/examples/run_glue.py

Line 425 in b0ee7c7

device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")

, the training job will still use both GPUs relying on torch.nn.DataParallel. It means that torch.nn.DataParallel is smart enough even if you are defining torch.device to be cuda:0, if there are several gpus available, it will use all of them.

manually set 'args.n_gpu = 1' works for me

but then you are not able to use more than 1 gpu, right?

ksboy · 2020-03-31T01:07:12Z

@TheEdoardo93 I tested run_glue.py on a multi-gpu machine. Even, after changing "cuda" to "cuda:0" in this line

transformers/examples/run_glue.py

Line 425 in b0ee7c7

device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")

, the training job will still use both GPUs relying on torch.nn.DataParallel. It means that torch.nn.DataParallel is smart enough even if you are defining torch.device to be cuda:0, if there are several gpus available, it will use all of them.

I have trid but it doesn't work for me.

stale · 2020-05-30T05:48:03Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

marrrcin mentioned this issue Mar 9, 2020

Provide comprehensive guide & best-practices for run_language_modeling.py #3192

Closed

and-kul mentioned this issue Apr 17, 2020

Fix bug in run_*.py scripts: double wrap into DataParallel during eval #3842

Merged

stale bot added the wontfix label May 30, 2020

stale bot closed this as completed Jun 6, 2020

jcheong0428 mentioned this issue May 26, 2021

Exception Occurred when using Detector cosanlab/py-feat#94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run_glue.py RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3 #1801

run_glue.py RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3 #1801

insublee commented Nov 12, 2019 •

edited

Loading

TheEdoardo93 commented Nov 12, 2019

Bug

To Reproduce

Environment

Additional context

insublee commented Nov 13, 2019

rezasanatkar commented Nov 28, 2019

TheEdoardo93 commented Nov 29, 2019

rezasanatkar commented Nov 30, 2019

zjcerwin commented Jan 15, 2020

ghost commented Feb 3, 2020

yuvalkirstain commented Mar 17, 2020

ksboy commented Mar 31, 2020

stale bot commented May 30, 2020

run_glue.py RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3 #1801

run_glue.py RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3 #1801

Comments

insublee commented Nov 12, 2019 • edited Loading

🐛 Bug

To Reproduce

Environment

Additional context

TheEdoardo93 commented Nov 12, 2019

Bug

To Reproduce

Environment

Additional context

insublee commented Nov 13, 2019

rezasanatkar commented Nov 28, 2019

TheEdoardo93 commented Nov 29, 2019

rezasanatkar commented Nov 30, 2019

zjcerwin commented Jan 15, 2020

ghost commented Feb 3, 2020

yuvalkirstain commented Mar 17, 2020

ksboy commented Mar 31, 2020

stale bot commented May 30, 2020

insublee commented Nov 12, 2019 •

edited

Loading