all_reduce_test stop. #30

ClaireYang · 2016-06-17T15:30:18Z

I used CentOS 7.0 and CUDA 7.5 on the server with 6pcs Tesla cards, it stop and has no response when running ./all_reduce_test 10000000 under single folder.

My GPU topo is as below

CPU 0 -- GPU0
-- GPU1
-- GPU2
CPU 1 -- GPU3
-- GPU4
-- GPU5
Even I ran with ./all_reduce_test 2 0 1, it still didn't run.
Do I need to install MPI even if I use tests in single folder? Is single test valid for multi-CPU as the topo above?
I checked ACSCtl, all are negative. I don't know what I can do.

sjeaugey · 2016-06-17T17:03:14Z

Could this be the same problem as in #19, i.e. you need to turn off ACS ?

ClaireYang · 2016-06-18T00:11:02Z

I used Lspci –vvv | grep ACSCtl to check, and all have disabled ACSCtl. So I don't know what I can do.

ClaireYang · 2016-08-04T04:32:42Z

An new driver can be fixed this issue, which has been posted on nvidia website.

ClaireYang closed this as completed Aug 4, 2016

himanshucodz55 mentioned this issue Jul 25, 2022

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms #708

Open

raninbowlalala mentioned this issue Jul 4, 2023

2 allreduce and a allgather hang in multi-node #899

Open

acphile mentioned this issue Sep 29, 2023

Question about ncclCommAbort stuck issue #1013

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all_reduce_test stop. #30

all_reduce_test stop. #30

ClaireYang commented Jun 17, 2016 •

edited

Loading

sjeaugey commented Jun 17, 2016

ClaireYang commented Jun 18, 2016 •

edited

Loading

ClaireYang commented Aug 4, 2016

all_reduce_test stop. #30

all_reduce_test stop. #30

Comments

ClaireYang commented Jun 17, 2016 • edited Loading

sjeaugey commented Jun 17, 2016

ClaireYang commented Jun 18, 2016 • edited Loading

ClaireYang commented Aug 4, 2016

ClaireYang commented Jun 17, 2016 •

edited

Loading

ClaireYang commented Jun 18, 2016 •

edited

Loading