[BUG] Unable to run multi-node #305

iidsample · 2022-04-08T02:08:37Z

Describe the bug
Followed the instructions provided in https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/tutorial/multinode-training and setup the environment exactly as suggested. Including building HugeCTR separately with MULTI_NODE_ENABLED. However when trying to run it using run_multinode.sh receive the following error -

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

To Reproduce
Steps to reproduce the behavior:

Build docker container using instructions provided here - https://nvidia-merlin.github.io/HugeCTR/master/hugectr_contributor_guide.html#how-to-start-your-development
Configured build directory in run_multinode.sh
bash run_multinode.sh
Expected behavior
Successful execution of script.

Environment (please complete the following information):

OS: Ubuntu 18.04
Graphic card: Nvidia P100
CUDA version: CUDA 11.2
Docker image - Followed the docker file provided here https://github.com/NVIDIA-Merlin/Merlin/blob/main/docker/training/dockerfile.ctr

The text was updated successfully, but these errors were encountered:

shijieliu · 2022-04-08T08:07:19Z

Hi @iidsample Thanks for trying out HugeCTR!
About the multinot-training tutorial, unfortunatly it's currently out of data and will be removed in next release. For now, we provide docker image in Merlin ngc which already support multi-node training for HugeCTR. You can choose to use some cluster job scheduler like srun to launch job on multinode.
Thanks!

iidsample · 2022-04-08T10:06:11Z

Hi @shijieliu,

Thanks for your reply. Is there some way to launch without slurm. Like just on a bunch of nodes. It will be great help if you can provide some direction or steps to do so. Thank you.

shijieliu · 2022-04-11T01:16:59Z

The key idea for launching multi-node training in HugeCTR is to use mpi. Like https://github.com/NVIDIA-Merlin/HugeCTR/blob/master/tutorial/multinode-training/run_multinode.sh#L110 suggests. So the steps can be:

install and configure mpi in a bunch of nodes
use the docker image in Merlin ngc to lanuch container in each node. Use mpi in container to launch training.

iidsample · 2022-04-11T01:58:52Z

Hi @shijieliu,

Thanks for your quick reply. Unfortunately I have been having a lot trouble setting up mpi in the container to launch training.
Essentially running mpirun from within the container. By any chance are you aware of a resource or have a guide about running mpi from within the container.

Thank you so much for your help.

iidsample · 2022-04-18T14:37:50Z

Hi,

I have been trying to run HugeCTR in distributed mode. When I try to run
mpirun with dcn_2node_8gpu.py i get the following error -
Runtime error: Error: the MPI total rank doesn't match the node count

I have made sure that the number of GPU's passed is correct in vvgpu parameter.

It will be great if you can help me with this.

shijieliu · 2022-04-19T06:38:43Z

Hi @iidsample

Could you provide more detailed log and scripts? THX!

zehuanw · 2022-05-02T02:04:28Z

Hi @iidsample, We are wondering if you have solved the problem? Thanks!

kanghui0204 · 2022-08-19T06:07:45Z

Hi @iidsample ,now we have a multinode tutorial(https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/tutorial/multinode-training) update, you can use script in tutorial to submit a multinode task with MPI. Please check if this update works for you.

kanghui0204 · 2022-09-05T00:36:26Z

Hi @iidsample ， because this issue is opened for a long time ,and we will close issue now . If you have another question , you can reopen this issue , and comment.

minseokl added bug It's a bug / potential bug and need verification P2 Better to have fea::user experience labels May 17, 2022

kanghui0204 closed this as completed Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unable to run multi-node #305

[BUG] Unable to run multi-node #305

iidsample commented Apr 8, 2022

shijieliu commented Apr 8, 2022 •

edited

Loading

iidsample commented Apr 8, 2022

shijieliu commented Apr 11, 2022

iidsample commented Apr 11, 2022

iidsample commented Apr 18, 2022

shijieliu commented Apr 19, 2022

zehuanw commented May 2, 2022

kanghui0204 commented Aug 19, 2022

kanghui0204 commented Sep 5, 2022

[BUG] Unable to run multi-node #305

[BUG] Unable to run multi-node #305

Comments

iidsample commented Apr 8, 2022

shijieliu commented Apr 8, 2022 • edited Loading

iidsample commented Apr 8, 2022

shijieliu commented Apr 11, 2022

iidsample commented Apr 11, 2022

iidsample commented Apr 18, 2022

shijieliu commented Apr 19, 2022

zehuanw commented May 2, 2022

kanghui0204 commented Aug 19, 2022

kanghui0204 commented Sep 5, 2022

shijieliu commented Apr 8, 2022 •

edited

Loading