-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-node fine-tuning getting RuntimeError: : CUDA error: invalid device ordinal #90
Closed
1 of 2 tasks
Comments
lchu-ibm
added a commit
to lchu-ibm/llama-recipes
that referenced
this issue
Aug 3, 2023
@lchu-ibm Thanks you for putting in the fix.
Could you take a look at this also? |
@qiuosier Good catch! yea I think that needs to be fixed as well. |
lchu-ibm
added a commit
to lchu-ibm/llama-recipes
that referenced
this issue
Aug 3, 2023
I can confirm I am able to run end-to-end with these two fix on ranks. cc @HamidShojanazeri |
5 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
System Info
PyTorch version: 2.0.1
CUDA used to build PyTorch: 11.7
GCC version: (Anaconda gcc) 11.2.0
Libc version: glibc-2.17
Python platform: Linux-5.4.17-2136.319.1.3.el7uek.x86_64-x86_64-with-glibc2.17
Python version: 3.9.16 (main, May 15 2023, 23:46:34) [GCC 11.2.0] (64-bit runtime)
CUDA_MODULE_LOADING set to: LAZY
CUDA runtime version: 11.7.99
Is CUDA available: True
Nvidia driver version: 510.108.03
Running on 3 nodes, each node has 2 A10 GPUs.
Information
🐛 Describe the bug
This line
torch.cuda.set_device(rank)
should uselocal_rank
instead ofrank
. Otherwise the rank would be "invalid device ordinal" for nodes other than the first node (which haslocal_rank==rank
).When running on single node, local rank is the same as rank. However, when running on multi nodes, rank can go from zero to the total number of GPUs minus one.
torch.cuda.set_device(rank)
will hit this error when rank is greater than the number of GPUs on the particular node.Also in this line. the evaluation() should be called with
local_rank
when available.The error for multi-node training goes away after I did the above change.
The command I used to start the fine tuning:
Error logs
Expected behavior
There should be no " invalid device ordinal" error.
The text was updated successfully, but these errors were encountered: