-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Unable to run multi-node #305
Comments
Hi @iidsample Thanks for trying out HugeCTR! |
Hi @shijieliu, Thanks for your reply. Is there some way to launch without slurm. Like just on a bunch of nodes. It will be great help if you can provide some direction or steps to do so. Thank you. |
The key idea for launching multi-node training in HugeCTR is to use mpi. Like https://github.com/NVIDIA-Merlin/HugeCTR/blob/master/tutorial/multinode-training/run_multinode.sh#L110 suggests. So the steps can be:
|
Hi @shijieliu, Thanks for your quick reply. Unfortunately I have been having a lot trouble setting up mpi in the container to launch training. Thank you so much for your help. |
Hi, I have been trying to run HugeCTR in distributed mode. When I try to run I have made sure that the number of GPU's passed is correct in vvgpu parameter. It will be great if you can help me with this. |
Hi @iidsample Could you provide more detailed log and scripts? THX! |
Hi @iidsample, We are wondering if you have solved the problem? Thanks! |
Hi @iidsample ,now we have a multinode tutorial(https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/tutorial/multinode-training) update, you can use script in tutorial to submit a multinode task with MPI. Please check if this update works for you. |
Hi @iidsample , because this issue is opened for a long time ,and we will close issue now . If you have another question , you can reopen this issue , and comment. |
Describe the bug
Followed the instructions provided in https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/tutorial/multinode-training and setup the environment exactly as suggested. Including building HugeCTR separately with MULTI_NODE_ENABLED. However when trying to run it using run_multinode.sh receive the following error -
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Successful execution of script.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: