-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Fix horovod build failure when mxnet is built from source #15213
Conversation
@PatricZhao @yuxihu @TaoLv Please help to review. |
Seems cmake is not happy with the change :( |
Unfortunately this is a hotfix that can unblock user from building horovod with mxnet from source. I agree we need a better solution in the long run to solve the dual building system issue and exporting header files. For 1.5 release, we can ask users to build using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. Please change mxnet_version.h
to mkldnn_version.h
in the PR description.
@@ -0,0 +1 @@ | |||
../../3rdparty/mkldnn/include/mkldnn.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to link this file? We only miss mkldnn_version.h, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier it was linked to the entire 3rdparty/mkldnn/include folder. However it does not contain mkldnn_version.h there. In this change, I only link to required files instead of linking to the entire folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you verify if it works? I remember mkldnn.h may depend on other header files in the original folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I could verify it works fine in my local branch.
make -j$(nproc)
export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH
pip install --no-cache-dir horovod
Here is the files in my include:
(mxnet) ubuntu@ip-172-31-7-88:~/src/mxnet$ ls include/mkldnn/
mkldnn.h mkldnn_version.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Please retrigger the CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the including chain is: ndarray.h -> mkldnn.hpp -> mkldnn.h -> mkldnn_types.h and mkldnn_version.h.
I don't think re-trigger CI is going to pass, is there any way to fix CMake errors in CI? |
@TaoLv This is not working because there are two issues in CI: After some more thoughts, I think a more elegant fix may be modifying the make script in MKLDNN module such that it copies mkldnn_version.h to the include folder after compilation is complete? This way it solves problem for both cmake and make, and it does not cause any CI issue in MXNet. This problem is not limited to horovod. Any downstream project that needs to build mxnet from source will fail due to this. I think this is currently a blocker for MXNet release. @PatricZhao Your advice is highly appreciated. Thanks! |
My two cents are that we must not break make as well as cmake. Both are supported and should be kept. |
@apeforest I think we need add an install target to the Makefile and copy libraries and including files to the install destination. Downstream projects like horovod should find libraries and headers there, not in the source code of mxnet or mkldnn. @szha @marcoabreu
|
@TaoLv There is an issue with this approach: |
My two cents:
|
@TaoLv Thanks for your suggestion.
Please help to review again. |
@mxnet-label-bot add [pr-awaiting-review, Distributed] |
Thanks for the fix!! I have verified this works. Please trigger CI again.
horovod:
result:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great that the horovod issue is fixed :)
@TaoLv please review the PR again
@@ -68,6 +68,9 @@ | |||
# Git submodules under different licenses | |||
'3rdparty', | |||
|
|||
# 3rdparty headerfiles under different licenses | |||
'include/mkldnn', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain? MKL-DNN is also under Apache-2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine mkldnn would be licensed to intel instead of ASF even though the license is apache 2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license check would fail if this is not whitelisted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve now. But just for the record, I don't really like this solution but take it as a hot fix to make horovod happy and to unblock the 1.5.0 release. We need consider a better way to expose headers to downstream projects. Looking for headers from source code is definitely not a good choice and is not guaranteed - that's why horovod's build was broken when MKL-DNN is updated.
@@ -1 +0,0 @@ | |||
../3rdparty/mkldnn/include |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just change this to the install path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we do not need to export all the header files under the install path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's simpler that way. Otherwise next time if there's any change mkldnn header we need to go through this all over again.
I have created feature request to better address this #15255 Can we merge this PR if it all looks good? |
* fix horovod build when mxnet is built from source * copy header file in mkldnn * copy mkldnn_version header file to include * remove unnecessary line * fix the untracked file * whitelist mkldnn headerfile * exlude lint path
Description
Fix #15211
The problem was due to the missing mkldnn_version.h file under include/mkldnn.
The mkldnn_version.h is generated at compile time by MKLDNN library. It is linked to 3rdparty/mkldnn/build/include when mxnet is built using make, the default build system currently supported in MXNet.
An earlier PR #14899 fixed the issue when mxnet is installed by pip. This PR is to fix the build error when mxnet is built from source.
Thanks @TaoLv for rootcausing the problem and his suggestions.