-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[BUGFIX][1.8.x] Temporary fix for RNN with oneDNN seg faults/core dumps #19308
Conversation
Hey @bgawrych , Thanks for submitting the PR
CI supported jobs: [centos-gpu, centos-cpu, windows-cpu, sanity, website, windows-gpu, clang, unix-gpu, unix-cpu, miscellaneous, edge] Note: |
@mxnet-bot run ci [windows-cpu, unix-cpu] |
Jenkins CI successfully triggered : [windows-cpu, unix-cpu] |
@mxnet-bot run ci [windows-cpu, unix-gpu] |
Jenkins CI successfully triggered : [windows-cpu, unix-gpu] |
@mxnet-bot run ci [unix-gpu] |
Jenkins CI successfully triggered : [unix-gpu] |
Hey @szha @samskalicky Do you know what's wrong with unix-gpu CI on 1.8 branch ?
|
@mxnet-bot run ci [unix-gpu] |
Jenkins CI successfully triggered : [unix-gpu] |
LGTM |
@ciyongch @zixuanweeei please help take a review. |
@anko-intel what's the timeline of 1.8? Is it possible to release a new patch version for 1.7? |
According to https://cwiki.apache.org/confluence/display/MXNET/1.8.0+Release+Plan+and+Status November the 1st will be the date of the official release. I think we are unable to prepare 1.7 patch version earlier. |
src/operator/nn/mkldnn/mkldnn_rnn.cc
Outdated
@@ -47,6 +47,14 @@ inline int GetRnnGatesNum(int mode) { | |||
} | |||
} | |||
|
|||
// Bug in oneDNN >= 1.6 in memory descriptor comparision operators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo? oneDNN <= 1.6
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, done
src/operator/nn/mkldnn/mkldnn_rnn.cc
Outdated
// for specific dims and strides in descriptors == operator can return `true` | ||
// but get_size() function will return different size | ||
// TODO(bgawrych): Remove with oneDNN 1.7 upgrade | ||
bool CheckMemDescEquality(const mkldnn::memory::desc &left, const mkldnn::memory::desc &right) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to make this function as static inline
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done :)
@ciyongch @pengzhao-intel @samskalicky Can we merge this change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and merging now.
Please file an issue to track the oneDNN upgrade and remove the temp solution. @bgawrych |
Thanks @bgawrych, can you also backport this PR to the v1.x branch so it stays in sync? |
Description
This fix is workaround for problem with oneDNN memory descriptors comparison (size calculations) which causes segmentation faults in operator destructor - reported issue #19022.
Fix for this issue will be delivered with oneDNN 1.7 - however we would like to fix this issue in 1.8 version as well
Checklist
Essentials
Changes