Multi-node fine-tuning getting RuntimeError: : CUDA error: invalid device ordinal #90

qiuosier · 2023-08-03T03:41:36Z

System Info

PyTorch version: 2.0.1
CUDA used to build PyTorch: 11.7
GCC version: (Anaconda gcc) 11.2.0
Libc version: glibc-2.17
Python platform: Linux-5.4.17-2136.319.1.3.el7uek.x86_64-x86_64-with-glibc2.17
Python version: 3.9.16 (main, May 15 2023, 23:46:34) [GCC 11.2.0] (64-bit runtime)
CUDA_MODULE_LOADING set to: LAZY
CUDA runtime version: 11.7.99
Is CUDA available: True
Nvidia driver version: 510.108.03

Running on 3 nodes, each node has 2 A10 GPUs.

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

This line torch.cuda.set_device(rank) should use local_rank instead of rank. Otherwise the rank would be "invalid device ordinal" for nodes other than the first node (which has local_rank==rank).

When running on single node, local rank is the same as rank. However, when running on multi nodes, rank can go from zero to the total number of GPUs minus one. torch.cuda.set_device(rank) will hit this error when rank is greater than the number of GPUs on the particular node.

Also in this line. the evaluation() should be called with local_rank when available.

eval_ppl, eval_epoch_loss = evaluation(model, train_config, eval_dataloader, local_rank if local_rank else rank, tokenizer)

The error for multi-node training goes away after I did the above change.

The command I used to start the fine tuning:

torchrun llama_finetuning.py --enable_fsdp --use_peft --peft_method lora --pure_bf16 --model_name /home/user/llama --output_dir /home/user/outputs

Error logs

Traceback (most recent call last):
   File "/home/datascience/decompressed_artifact/code/llama_finetuning.py", line 237, in <module>
          fire.Fire(main)
   File "/home/datascience/conda/pytorch20_p39_gpu_v1/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
         component_trace = _Fire(component, args, parsed_flag_args, context, name)component_trace = _Fire(component, args, parsed_flag_args, context, name)
   File "/home/datascience/conda/pytorch20_p39_gpu_v1/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
         component, remaining_args = _CallAndUpdateTrace(component, remaining_args = _CallAndUpdateTrace(
   File "/home/datascience/conda/pytorch20_p39_gpu_v1/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
         component = fn(*varargs, **kwargs)    
   File "/home/datascience/decompressed_artifact/code/llama_finetuning.py", line 86, in main
         torch.cuda.set_device(rank)    
   File "/home/datascience/conda/pytorch20_p39_gpu_v1/lib/python3.9/site-packages/torch/cuda/__init__.py", line 350, in set_device
         torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)
 
 RuntimeErrorRuntimeError: : CUDA error: invalid device ordinal

Expected behavior

There should be no " invalid device ordinal" error.

The text was updated successfully, but these errors were encountered:

qiuosier · 2023-08-03T18:17:06Z

@lchu-ibm Thanks you for putting in the fix.
Actually I found another line that also need to be fixed. The evaluation() should be called with local_rank when available.

eval_ppl, eval_epoch_loss = evaluation(model, train_config, eval_dataloader, local_rank if local_rank else rank, tokenizer)

Could you take a look at this also?

lchu-ibm · 2023-08-03T18:21:50Z

@qiuosier Good catch! yea I think that needs to be fixed as well.

cc @HamidShojanazeri

lchu-ibm · 2023-08-03T19:31:23Z

I can confirm I am able to run end-to-end with these two fix on ranks. cc @HamidShojanazeri

HamidShojanazeri · 2023-08-04T17:52:07Z

Thanks @qiuosier and @lchu-ibm, I will close this issue as been addressed in the this PR

lchu-ibm added a commit to lchu-ibm/llama-recipes that referenced this issue Aug 3, 2023

fix meta-llama#90

0c51b47

lchu-ibm added a commit to lchu-ibm/llama-recipes that referenced this issue Aug 3, 2023

further fix meta-llama#90

80a4c36

HamidShojanazeri closed this as completed Aug 4, 2023

lchu-ibm mentioned this issue Aug 6, 2023

save cpu mem by leveraging FSDP rank0 broadcasting #77

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node fine-tuning getting RuntimeError: : CUDA error: invalid device ordinal #90

Multi-node fine-tuning getting RuntimeError: : CUDA error: invalid device ordinal #90

qiuosier commented Aug 3, 2023 •

edited

Loading

qiuosier commented Aug 3, 2023 •

edited

Loading

lchu-ibm commented Aug 3, 2023 •

edited

Loading

lchu-ibm commented Aug 3, 2023

HamidShojanazeri commented Aug 4, 2023

Multi-node fine-tuning getting RuntimeError: : CUDA error: invalid device ordinal #90

Multi-node fine-tuning getting RuntimeError: : CUDA error: invalid device ordinal #90

Comments

qiuosier commented Aug 3, 2023 • edited Loading

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

qiuosier commented Aug 3, 2023 • edited Loading

lchu-ibm commented Aug 3, 2023 • edited Loading

lchu-ibm commented Aug 3, 2023

HamidShojanazeri commented Aug 4, 2023

qiuosier commented Aug 3, 2023 •

edited

Loading

qiuosier commented Aug 3, 2023 •

edited

Loading

lchu-ibm commented Aug 3, 2023 •

edited

Loading