Running NCCL mpi test accros multiple nodes #33

sharannarang · 2016-06-28T22:19:20Z

Hi,

I've built and run the mpi_test on 1 node with 8 TitanX gpus successfully. I use srun to launch the mpi test and it passes. However, the test fails when run across 2 nodes with 8 TitanX gpus per node. I use the following command line:

srun -N2 -n16 --gres=gpu:8 -p TitanXx8 build/test/mpi/mpi_test 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

The test fails with the following error:

WARN src/core.cu:225 failed to allocate 2101248 byte device buffer
WARN src/core.cu:596 rank 12 failed to allocate device buffer
WARN src/core.cu:683 rank 12 failed to allocate communicator
NCCL Init failed (10) 'cuda malloc failed'

Does NCCL run across multiple nodes?

The text was updated successfully, but these errors were encountered:

sjeaugey · 2016-06-28T22:25:40Z

No, indeed, NCCL doesn't run across multiple nodes.

sharannarang · 2016-06-28T22:30:04Z

Are there any plans to add this support?

sjeaugey · 2017-08-04T16:22:30Z

Inter-node communication has been implemented in NCCL2, which is now available at https://developer.nvidia.com/nccl.

Summary: When concurrent collective/p2p are sent via multiple NCCL communicators, ctran mapper register/deregister/search paths can be called by multiple threads concurrently. Thus, we need ensure thread-safety for the global timer for registration. This patch fixes it by adding mutex for all accesses to the global variables used. Differential Revision: D51083701

Summary: Pull Request resolved: facebookresearch#33 When concurrent collective/p2p are sent via multiple NCCL communicators, ctran mapper register/deregister/search paths can be called by multiple threads concurrently. Thus, we need ensure thread-safety for the global timer for registration. This patch fixes it by adding mutex for all accesses to the global variables used. Reviewed By: wesbland Differential Revision: D51083701 fbshipit-source-id: ca0ba40484f9c871780fc99623e0c9d8224328e3

tsirif mentioned this issue Jul 20, 2016

New control interface, extending worker/controller for multi-gpu/node mila-iqia/platoon#66

Merged

sjeaugey closed this as completed Aug 4, 2017

himanshucodz55 mentioned this issue Jul 25, 2022

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms #708

Open

raninbowlalala mentioned this issue Jul 4, 2023

2 allreduce and a allgather hang in multi-node #899

Open

acphile mentioned this issue Sep 29, 2023

Question about ncclCommAbort stuck issue #1013

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running NCCL mpi test accros multiple nodes #33

Running NCCL mpi test accros multiple nodes #33

sharannarang commented Jun 28, 2016

sjeaugey commented Jun 28, 2016

sharannarang commented Jun 28, 2016

sjeaugey commented Aug 4, 2017

Running NCCL mpi test accros multiple nodes #33

Running NCCL mpi test accros multiple nodes #33

Comments

sharannarang commented Jun 28, 2016

sjeaugey commented Jun 28, 2016

sharannarang commented Jun 28, 2016

sjeaugey commented Aug 4, 2017