-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/nccl dso #5001
Feature/nccl dso #5001
Conversation
cmake/configure.cmake
Outdated
@@ -68,6 +68,25 @@ else() | |||
if(NOT CUDNN_FOUND) | |||
message(FATAL_ERROR "Paddle need cudnn to compile") | |||
endif() | |||
if (NOT NCCL_INCLUDE_DIR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for teaching me the logic here in our offline discussion @reyoung . I think we can shorten L71-L82 into
if (NOT NCCL_INCLUDE_DIR)
message(FATAL_ERROR "Paddle needs the NCCL header file")
endif()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cmake/configure.cmake
Outdated
message(FATAL_ERROR "Paddle need nccl header to compile") | ||
endif() | ||
endif() | ||
if (NOT WITH_DSO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we can marge the two conditions on L84 and L85 into one
if (NOT WITH_DSO AND NOT NCCL_LIBRARY)
message(FATAL_ERROR "Paddle needs the NCCL libraries with WITH_DSO=OFF")
endif()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if(NOT ${CMAKE_SYSTEM_PROCESSOR}) | ||
set(TARGET_ARCH ${CMAKE_SYSTEM_PROCESSOR}) | ||
endif() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for curiosity, could we automatically download NCCL and put the header and library files into the build/third_party
directory? We don't have to do this in this PR, but could just file an issue recording this idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nccl
will be packaged to official CUDA docker image.
We can download nccl
header in our Paddle source code. But to keep it simple, we just assume the developers are installed CUDA and related libraries well.
NOTE @dzhwinter 's issue, https://gitlab.com/nvidia/cuda/issues/10#note_43614309
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we shouldn't. This PR just moves NCCL from build/third_party
to the docker built-in environment. Partly of this PR is cherry-picked from #4818, which automate download NCCL and build it locally. In #4818, we have to build it early than any other module, so we need to add NCCL as a dependency, which is not correct comparing with DSO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand your point -- some users might already have a special version of NCCL installed on their systems, so PaddlePaddle should try to use it.
@@ -157,6 +162,14 @@ void GetLapackDsoHandle(void** dso_handle) { | |||
#endif | |||
} | |||
|
|||
void GetNcclDsoHandle(void** dso_handle) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nccl => NCCL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -157,6 +162,14 @@ void GetLapackDsoHandle(void** dso_handle) { | |||
#endif | |||
} | |||
|
|||
void GetNcclDsoHandle(void** dso_handle) { | |||
#if defined(__APPLE__) || defined(__OSX__) | |||
GetDsoHandleFromSearchPath(FLAGS_nccl_dir, "libnccl.dylib", dso_handle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some future PR, we should rename Dso into DSO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should use DSO for abbreviation dynamic shared object.
* @param **dso_handle dso handler | ||
* | ||
*/ | ||
void GetNcclDsoHandle(void** dso_handle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but should we rename
GetNcclDsoHandle => dso::LoadNCCL
GetLapackDsoHandle => dso::LoadLAPACK
cherry-pick @dzhwinter 's commit to this PR, complete cmake file, and some unit tests.