Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to run multi-node #305

Closed
iidsample opened this issue Apr 8, 2022 · 9 comments
Closed

[BUG] Unable to run multi-node #305

iidsample opened this issue Apr 8, 2022 · 9 comments
Labels
bug It's a bug / potential bug and need verification fea::user experience P2 Better to have

Comments

@iidsample
Copy link

Describe the bug
Followed the instructions provided in https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/tutorial/multinode-training and setup the environment exactly as suggested. Including building HugeCTR separately with MULTI_NODE_ENABLED. However when trying to run it using run_multinode.sh receive the following error -

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

To Reproduce
Steps to reproduce the behavior:

  1. Build docker container using instructions provided here - https://nvidia-merlin.github.io/HugeCTR/master/hugectr_contributor_guide.html#how-to-start-your-development
  2. Configured build directory in run_multinode.sh
  3. bash run_multinode.sh
    Expected behavior
    Successful execution of script.

Environment (please complete the following information):

@shijieliu
Copy link
Collaborator

shijieliu commented Apr 8, 2022

Hi @iidsample Thanks for trying out HugeCTR!
About the multinot-training tutorial, unfortunatly it's currently out of data and will be removed in next release. For now, we provide docker image in Merlin ngc which already support multi-node training for HugeCTR. You can choose to use some cluster job scheduler like srun to launch job on multinode.
Thanks!

@iidsample
Copy link
Author

Hi @shijieliu,

Thanks for your reply. Is there some way to launch without slurm. Like just on a bunch of nodes. It will be great help if you can provide some direction or steps to do so. Thank you.

@shijieliu
Copy link
Collaborator

The key idea for launching multi-node training in HugeCTR is to use mpi. Like https://github.com/NVIDIA-Merlin/HugeCTR/blob/master/tutorial/multinode-training/run_multinode.sh#L110 suggests. So the steps can be:

  1. install and configure mpi in a bunch of nodes
  2. use the docker image in Merlin ngc to lanuch container in each node. Use mpi in container to launch training.

@iidsample
Copy link
Author

Hi @shijieliu,

Thanks for your quick reply. Unfortunately I have been having a lot trouble setting up mpi in the container to launch training.
Essentially running mpirun from within the container. By any chance are you aware of a resource or have a guide about running mpi from within the container.

Thank you so much for your help.

@iidsample
Copy link
Author

Hi,

I have been trying to run HugeCTR in distributed mode. When I try to run
mpirun with dcn_2node_8gpu.py i get the following error -
Runtime error: Error: the MPI total rank doesn't match the node count

I have made sure that the number of GPU's passed is correct in vvgpu parameter.

It will be great if you can help me with this.

@shijieliu
Copy link
Collaborator

Hi @iidsample

Could you provide more detailed log and scripts? THX!

@zehuanw
Copy link
Collaborator

zehuanw commented May 2, 2022

Hi @iidsample, We are wondering if you have solved the problem? Thanks!

@minseokl minseokl added bug It's a bug / potential bug and need verification P2 Better to have fea::user experience labels May 17, 2022
@kanghui0204
Copy link
Collaborator

Hi @iidsample ,now we have a multinode tutorial(https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/tutorial/multinode-training) update, you can use script in tutorial to submit a multinode task with MPI. Please check if this update works for you.

@kanghui0204
Copy link
Collaborator

Hi @iidsample , because this issue is opened for a long time ,and we will close issue now . If you have another question , you can reopen this issue , and comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug It's a bug / potential bug and need verification fea::user experience P2 Better to have
Projects
None yet
Development

No branches or pull requests

5 participants