Got RuntimeError: Expected all tensors to be on the same device with RM training #3531

toanbku · 2023-06-28T01:56:54Z

toanbku
Jun 28, 2023

Hi, I got the error RuntimeError: Expected all tensors to be on the same device when running this command: python trainer_rm.py --configs defaults_rm oasst-rm-1-pythia-1.4b. Could you help me to fix it? Thanks

I'm using

4x RTX 3090
AMD Ryzen Threadripper PRO 3955WX 16-Cores
pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel

root@C.6468167:/workspace/OA/model/model_training$ python trainer_rm.py --configs defaults_rm oasst-rm-1-pythia-1.4b
[2023-06-28 01:43:44,710] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
  warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//127.0.0.1'), PosixPath('8080/jm/4/31574'), PosixPath('http')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
trainig_conf = Namespace(rng_seed=2703368087, is_reward_model=True, pooling='last', learning_rate='8e-6', gradient_checkpointing=False, gradient_accumulation_steps=4, per_device_train_batch_size=1, per_device_eval_batch_size=5, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon='1e-12', weight_decay=0.0, warmup_steps=50, eval_steps=500, save_steps=1000, save_strategy='steps', max_length=2048, num_train_epochs=2, logging_steps=10, max_grad_norm=2.0, save_total_limit=4, dtype='float32', eval_accumulation_steps=None, freeze_layer=None, cache_dir='.cache', loss_fn='RMLoss', score_l2_reg=0.001, eval_size=None, log_dir='base', quantization=False, seq2seqmodel=False, fuse_gelu=True, log_wandb=True, verbose=False, output_dir='.saved_models_rm', use_custom_sampler=True, residual_dropout=0.01, use_flash_attention=True, sort_by_length=False, per_digit_tokens=False, datasets_extra=[], metrics=['accuracy', 'kendalltau'], deepspeed_config='configs/zero_config.json', max_replies=5, residual_dropout_lima=False, datasets=[{'webgpt': {'val_split': 0.05, 'max_val_set': 1000}}], model_name='andreaskoepf/pythia-1.4b-gpt4all-pretrain', use_system_tag=False, system_property_dropout=0.5, system_add_length=False, wandb_entity='open-assistant', local_rank=-1, deepspeed=False, resume_from_checkpoint=False, show_dataset_stats=False, world_size=1)
RNG seed: 2703368087
You are using a model of type gpt_neox to instantiate a model of type gpt_neox_reward_model. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at andreaskoepf/pythia-1.4b-gpt4all-pretrain were not used when initializing GPTNeoXRewardModel: ['embed_out.weight']
- This IS expected if you are initializing GPTNeoXRewardModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPTNeoXRewardModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPTNeoXRewardModel were not initialized from the model checkpoint at andreaskoepf/pythia-1.4b-gpt4all-pretrain and are newly initialized: ['out_proj.bias', 'out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Number of trainable parameters: 1311M
Found cached dataset webgpt_comparisons (/root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 309.25it/s]
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.15.4
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
WARNING:root:Custom sampler found!
/opt/conda/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
  0%|                                                                                                                                                                                                | 0/2114 [00:00<?, ?it/s]You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2371: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
  warnings.warn(
Traceback (most recent call last):
  File "/workspace/OA/model/model_training/trainer_rm.py", line 334, in <module>
    main()
  File "/workspace/OA/model/model_training/trainer_rm.py", line 328, in main
    trainer.train(resume_from_checkpoint=training_conf.resume_from_checkpoint)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1639, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1906, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2652, in training_step
    loss = self.compute_loss(model, inputs)
  File "/workspace/OA/model/model_training/trainer_rm.py", line 50, in compute_loss
    logits = model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/OA/model/model_training/models/reward_model.py", line 63, in forward
    outputs = self.gpt_neox(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 553, in forward
    outputs = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 320, in forward
    attention_layer_outputs = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/OA/model/model_training/models/patching.py", line 36, in _patched_attn_forward
    out = module.old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 116, in forward
    qkv = self.query_key_value(hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /workspace/OA/model/model_training/trainer_rm.py:334 in <module>                                 │
│                                                                                                  │
│   331                                                                                            │
│   332                                                                                            │
│   333 if __name__ == "__main__":                                                                 │
│ ❱ 334 │   main()                                                                                 │
│   335                                                                                            │
│                                                                                                  │
│ /workspace/OA/model/model_training/trainer_rm.py:328 in main                                     │
│                                                                                                  │
│   325 │   │   tokenizer=tokenizer,                                                               │
│   326 │   │   compute_metrics=compute_metrics,                                                   │
│   327 │   )                                                                                      │
│ ❱ 328 │   trainer.train(resume_from_checkpoint=training_conf.resume_from_checkpoint)             │
│   329 │   trainer.save_model()                                                                   │
│   330 │   tokenizer.save_pretrained(output_dir)                                                  │
│   331                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1639 in train                    │
│                                                                                                  │
│   1636 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1637 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1638 │   │   )                                                                                 │
│ ❱ 1639 │   │   return inner_training_loop(                                                       │
│   1640 │   │   │   args=args,                                                                    │
│   1641 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1642 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1906 in _inner_training_loop     │
│                                                                                                  │
│   1903 │   │   │   │   │   with model.no_sync():                                                 │
│   1904 │   │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                  │
│   1905 │   │   │   │   else:                                                                     │
│ ❱ 1906 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                      │
│   1907 │   │   │   │                                                                             │
│   1908 │   │   │   │   if (                                                                      │
│   1909 │   │   │   │   │   args.logging_nan_inf_filter                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2652 in training_step            │
│                                                                                                  │
│   2649 │   │   │   return loss_mb.reduce_mean().detach().to(self.args.device)                    │
│   2650 │   │                                                                                     │
│   2651 │   │   with self.compute_loss_context_manager():                                         │
│ ❱ 2652 │   │   │   loss = self.compute_loss(model, inputs)                                       │
│   2653 │   │                                                                                     │
│   2654 │   │   if self.args.n_gpu > 1:                                                           │
│   2655 │   │   │   loss = loss.mean()  # mean() to average on multi-gpu parallel training        │
│                                                                                                  │
│ /workspace/OA/model/model_training/trainer_rm.py:50 in compute_loss                              │
│                                                                                                  │
│    47 │   def compute_loss(self, model, inputs, return_logits=False):                            │
│    48 │   │   batch, cu_lens = inputs                                                            │
│    49 │   │                                                                                      │
│ ❱  50 │   │   logits = model(                                                                    │
│    51 │   │   │   input_ids=batch["input_ids"],                                                  │
│    52 │   │   │   attention_mask=batch["attention_mask"],                                        │
│    53 │   │   ).logits                                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py:171 in forward        │
│                                                                                                  │
│   168 │   │   │   if len(self.device_ids) == 1:                                                  │
│   169 │   │   │   │   return self.module(*inputs[0], **kwargs[0])                                │
│   170 │   │   │   replicas = self.replicate(self.module, self.device_ids[:len(inputs)])          │
│ ❱ 171 │   │   │   outputs = self.parallel_apply(replicas, inputs, kwargs)                        │
│   172 │   │   │   return self.gather(outputs, self.output_device)                                │
│   173 │                                                                                          │
│   174 │   def replicate(self, module, device_ids):                                               │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py:181 in parallel_apply │
│                                                                                                  │
│   178 │   │   return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)                    │
│   179 │                                                                                          │
│   180 │   def parallel_apply(self, replicas, inputs, kwargs):                                    │
│ ❱ 181 │   │   return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])   │
│   182 │                                                                                          │
│   183 │   def gather(self, outputs, output_device):                                              │
│   184 │   │   return gather(outputs, output_device, dim=self.dim)                                │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py:89 in parallel_apply │
│                                                                                                  │
│   86 │   for i in range(len(inputs)):                                                            │
│   87 │   │   output = results[i]                                                                 │
│   88 │   │   if isinstance(output, ExceptionWrapper):                                            │
│ ❱ 89 │   │   │   output.reraise()                                                                │
│   90 │   │   outputs.append(output)                                                              │
│   91 │   return outputs                                                                          │
│   92                                                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/_utils.py:644 in reraise                           │
│                                                                                                  │
│   641 │   │   │   # If the exception takes multiple arguments, don't try to                      │
│   642 │   │   │   # instantiate since we don't know how to                                       │
│   643 │   │   │   raise RuntimeError(msg) from None                                              │
│ ❱ 644 │   │   raise exception                                                                    │
│   645                                                                                            │
│   646                                                                                            │
│   647 def _get_available_device_type():                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/OA/model/model_training/models/reward_model.py", line 63, in forward
    outputs = self.gpt_neox(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 553, in forward
    outputs = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 320, in forward
    attention_layer_outputs = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/OA/model/model_training/models/patching.py", line 36, in _patched_attn_forward
    out = module.old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 116, in forward
    qkv = self.query_key_value(hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /workspace/OA/model/model_training/wandb/offline-run-20230628_014420-b8s8ohdb
wandb: Find logs at: ./wandb/offline-run-20230628_014420-b8s8ohdb/logs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Got RuntimeError: Expected all tensors to be on the same device with RM training #3531

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Got RuntimeError: Expected all tensors to be on the same device with RM training #3531

toanbku Jun 28, 2023

Replies: 0 comments

toanbku
Jun 28, 2023