Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The program only works on 1 GPU when using nccl 2.0 #123

Closed
NanXiao opened this issue Dec 21, 2017 · 7 comments
Closed

The program only works on 1 GPU when using nccl 2.0 #123

NanXiao opened this issue Dec 21, 2017 · 7 comments

Comments

@NanXiao
Copy link

NanXiao commented Dec 21, 2017

Hi all,

I download nccl 2.0, and try to run reduce_test.cu file using nccl 2.0 (modify test_utilities.h to adapt to nccl 2.0):

$ nvcc -gencode=arch=compute_60,code=sm_60 -I/usr/local/nccl/include -o reduce_test reduce_test.cu /usr/local/nccl/lib/libnccl.so -lcudart -lrt -lcuda -lcurand -lnvToolsExt

I find the program only runs when specifying 1 GPU:

$ ./reduce_test 10000000 1
# Using devices
#   Rank  0 uses device  0 [0x3b] Tesla P100-PCIE-16GB

#                                                       out-of-place                    in-place
#      bytes             N    type      op  root    time  algbw  busbw      res     time  algbw  busbw      res
    10000000      10000000    char     sum    0    0.050  199.48  199.48    0e+00    0.006  1674.48  1674.48    0e+00
    10000000      10000000    char    prod    0    0.048  206.75  206.75    0e+00    0.005  1882.88  1882.88    0e+00
    10000000      10000000    char     max    0    0.049  205.09  205.09    0e+00    0.005  1836.55  1836.55    0e+00
    10000000      10000000    char     min    0    0.048  210.11  210.11    0e+00    0.005  1860.12  1860.12    0e+00
    10000000       2500000     int     sum    0    0.053  188.78  188.78    0e+00    0.005  1960.40  1960.40    0e+00
    10000000       2500000     int    prod    0    0.052  190.78  190.78    0e+00    0.005  1876.88  1876.88    0e+00
    10000000       2500000     int     max    0    0.053  188.99  188.99    0e+00    0.005  1867.76  1867.76    0e+00
    10000000       2500000     int     min    0    0.054  186.78  186.78    0e+00    0.006  1814.55  1814.55    0e+00
    10000000       2500000   float     sum    0    0.054  186.52  186.52    0e+00    0.005  1858.39  1858.39    0e+00
    10000000       2500000   float    prod    0    0.053  189.09  189.09    0e+00    0.005  1882.18  1882.18    0e+00
    10000000       2500000   float     max    0    0.054  186.87  186.87    0e+00    0.005  1986.89  1986.89    0e+00
    10000000       2500000   float     min    0    0.053  188.70  188.70    0e+00    0.005  1886.08  1886.08    0e+00
    10000000       1250000  double     sum    0    0.053  188.24  188.24    0e+00    0.005  1855.63  1855.63    0e+00
    10000000       1250000  double    prod    0    0.053  189.08  189.08    0e+00    0.006  1754.39  1754.39    0e+00
    10000000       1250000  double     max    0    0.052  190.65  190.65    0e+00    0.005  1960.40  1960.40    0e+00
    10000000       1250000  double     min    0    0.053  190.10  190.10    0e+00    0.005  1906.21  1906.21    0e+00
    10000000       1250000   int64     sum    0    0.052  190.50  190.50    0e+00    0.006  1786.35  1786.35    0e+00
    10000000       1250000   int64    prod    0    0.053  187.95  187.95    0e+00    0.006  1777.46  1777.46    0e+00
    10000000       1250000   int64     max    0    0.052  192.79  192.79    0e+00    0.006  1652.35  1652.35    0e+00
    10000000       1250000   int64     min    0    0.053  189.98  189.98    0e+00    0.006  1725.33  1725.33    0e+00
    10000000       1250000  uint64     sum    0    0.053  190.16  190.16    0e+00    0.005  1819.51  1819.51    0e+00
    10000000       1250000  uint64    prod    0    0.052  191.68  191.68    0e+00    0.005  1878.29  1878.29    0e+00
    10000000       1250000  uint64     max    0    0.052  191.09  191.09    0e+00    0.005  1884.66  1884.66    0e+00
    10000000       1250000  uint64     min    0    0.052  191.30  191.30    0e+00    0.005  1893.22  1893.22    0e+00

 Out of bounds values : 0 OK
 Avg bus bandwidth    : 1018.59

When I want to utilize all 4 GPUs, the program seems hang:

$ ./reduce_test 10000000
# Using devices
#   Rank  0 uses device  0 [0x3b] Tesla P100-PCIE-16GB
#   Rank  1 uses device  1 [0x5e] Tesla P100-PCIE-16GB
#   Rank  2 uses device  2 [0xaf] Tesla P100-PCIE-16GB
#   Rank  3 uses device  3 [0xd8] Tesla P100-PCIE-16GB

#                                                       out-of-place                    in-place
#      bytes             N    type      op  root    time  algbw  busbw      res     time  algbw  busbw      res

Could anyone give some suggestions of this issue? Or can provide some example on using nccl 2.0?

P.S. the nccl 1.0 works fine on my server.

@cliffwoolley
Copy link
Collaborator

Please take a look at #19 and see if that might be related.

@sjeaugey
Copy link
Member

Please make sure you recompile the NCCL tests with the correct nccl.h when switching from NCCL 1 to NCCL 2.

The nccl.h changed between NCCL 1 and NCCL 2 and compiling the tests with a NCCL 1 nccl.h will cause a hang when running with NCCL 2.

@NanXiao
Copy link
Author

NanXiao commented Dec 21, 2017

@cliffwoolley @sjeaugey Thanks for your response!

I download NCCL 1, build and run the test, all is good! So I think this will prove all hardwares are OK.

Then I copy reduce_test.cu and test_utilities.h into another folder, and make sure they use the NCCL 2 header file and library.

Besides the header file, is there any other possibility? Such as compile option?

Thanks!

@sjeaugey
Copy link
Member

I've seen a nccl.h (from NCCL 1) in /usr/local/include take precedence over the one specified on the command line.

I added a printf to the NCCL tests to display the version you compiled against. Just to double check, can you update the NCCL tests, compile and run again ?

@NanXiao
Copy link
Author

NanXiao commented Dec 22, 2017

@sjeaugey Sorry for interrupting u again.

I've seen a nccl.h (from NCCL 1) in /usr/local/include take precedence over the one specified on the command line.

I don't install NCCL 1 so my /usr/local/include is empty.

I added a printf to the NCCL tests to display the version you compiled against. Just to double check, can you update the NCCL tests, compile and run again ?

You mean you update the NCCL code? But I find there is no change in this repository.

Thanks!

@sjeaugey
Copy link
Member

Oh, OK, I see the problem now. You tried running the tests from NCCL 1 with NCCL 2. This is not supposed to work.

Please use the NCCL tests instead (https://github.com/nvidia/nccl-tests).

@NanXiao
Copy link
Author

NanXiao commented Dec 22, 2017

@sjeaugey Thanks very much! That's the point. Because I don't know nccl-tests, so just want to use test files from NCCL 1 instead. Thanks again for your time and help!

@NanXiao NanXiao closed this as completed Dec 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants