-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Enable MKL-DNN FullyConnected backward #17318
Conversation
This is ready for review now. I also changed the cpp test to address the FullyConnectedOp failure reported before. The main reason is the tensor shape is relative large and the max value in the tensor is 50. It will lead to float number accumulation error. @rongzha1 @ciyongch @pengzhao-intel |
@@ -330,7 +330,7 @@ inline void PrintVerifyMsg(const NDArrayAttrs &arr1, const NDArrayAttrs &arr2) { | |||
*/ | |||
inline std::vector<NDArrayAttrs> GetTestInputArrays( | |||
int types = ArrayTypes::All, bool rand = false, | |||
std::vector<float> scale = {1}, bool spatial_data_format = false) { | |||
std::vector<float> scale = {1}, bool spatial_data_format = false, int max = 50) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the new parameter for better usability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #17318 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, the failure bothered us a lot in the past:)
const TBlob &blob = arr->data(); | ||
mshadow::default_real_t *data = blob.dptr<mshadow::default_real_t>(); | ||
int size = blob.Size(); | ||
|
||
for (int i = 0; i < size; i++) | ||
if (is_rand) { | ||
data[i] = (std::rand() % 100) - 50; | ||
data[i] = (std::rand() % (max * 2)) - max; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change to data[i] = std::rand() * 1.0 / RAND_MAX - 0.5;
? As max = 1
will only generate two values: -1.0and 0.0 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I don't want to affect other test case which still use the default max=50 to generate integers in [50, 50). But For the FullyConnectedOp, I want to generate relative small numbers. With the given code, the elements will be -1 and 0. Any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've no idea about why the range was set to [-50, 50) previously, and I can't figure out any specific reasons to use this range for the tests (any upper bound test?). It'll be great if you have any background for it.
But anyway, the tensors with only two values (-1 and 0, 50% are 0) might not be a good candidate for the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, i will change to generate float numbers in [-max, max) rather than integer numbers. Previously I thought sparse (say 50% zeros) is also a way to avoid float number accumulation error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ciyongch Failed to do so. There are cases doing bit correction check. It seems we cannot pass the tests with these floating numbers.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17318/17/pipeline/
https://github.com/apache/incubator-mxnet/blob/master/tests/cpp/include/test_mkldnn.h#L605
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With memcmp
to check the results, then the only choice is integer numbers. Is it reasonable to check the results by AssertEqual
within a small enough threshold like 1e-6, then we can keep the floating number with better distribution?
Or we can just increase max
to filling more different numbers other than only -1 and 0.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted the changes for floating numbers. Changing ``memcmpto
AssertEqual` is out of the scope of this PR, so I will keep it as is.
Or we can just increase max to filling more different numbers other than only -1 and 0.
I was thinking about including number 2 into the generated tensor but found that with the given shapes, there still has chance to get error. That means for the worst case, the intermediate accumulation value will be > 2^24, so the 1 will be ignored when accumulating another 1 to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's keep it as is for now.
} else { | ||
data[i] = i % 100 - 50; | ||
data[i] = i % (max * 2) - max; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, how about change to something like data[i] = i * 2.0 / size - 1.0
to generate [-1.0, 1.0)?
This reverts commit 56d873f.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* fix mkldnn fc bwd bug due to data inplace * enable mkldnn fc bwd * fix cpp tests * try: fix random seed * fix cpp test * loose rtol for fc cpp test * improve error message * limit max value for mkldnn tensors * limit the max value of test tensors * fix lint * remove fixed random seed * address review comments * Revert "address review comments" This reverts commit 56d873f. Co-authored-by: rongzha1 <rong.a.zhang@intel.com>
* fix mkldnn fc bwd bug due to data inplace * enable mkldnn fc bwd * fix cpp tests * try: fix random seed * fix cpp test * loose rtol for fc cpp test * improve error message * limit max value for mkldnn tensors * limit the max value of test tensors * fix lint * remove fixed random seed * address review comments * Revert "address review comments" This reverts commit 56d873f. Co-authored-by: rongzha1 <rong.a.zhang@intel.com>
Description
Discussed with @rongzha1 offline. This PR incorporates the code changes in #16890 which was reverted previously.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments