-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[v1.x] [MKLDNN] 2 Conv Tests Failed with MKLDNN + ACL #20265
Comments
Here is the source code of the first test case @with_seed()
def test_convolution_grouping():
for dim in [1, 2, 3]:
num_filter = 4
for num_group in [1, 2]:
kernel = (3,) * dim
shape = (1, 4) + (9,) * dim The test will only fail when So the test case will check if the Conv op will produce the same output and gradients (for x, w, b) when
command to run test:
|
Here is the source code of the second test case https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_operator.py#L1750-L1829 @with_seed()
def test_convolution_independent_gradients():
ctx = mx.cpu()
atol = 1.0e-3
rtol = 1.0e-3
reqs = ["null", "write", "add"]
var_names = ["x", "w", "b"]
dims = [2]
num_bases = [1, 8]
kernel_xs = [3, 5]
stride_xs = [1, 2]
pad_xs = [0, 1]
in_sizes = [7, 32]
no_biases = [True, False] This test will only fail when failure:
command to run test:
|
@szha @anko-intel @josephevans I have collected more info about the two test failures. I've also prepared a clean ec2 instance with this specific mxnet build so that others can quickly get access. Please let me know about anything to try or if I can add your secret to the instance. Help is much appreciated! |
Hi @Zha0q1, could you provide this container and how to build it. |
Hi @Zha0q1 , can you confirm that gemm:acl is called for the tests on grouped convolutions?
|
Hi @cfRod Thanks for taking a look at this! Yeah I will prepare a docker image for you today. I am not sure about which kernel it was using -- our observation was that (MXNet + OneDNN on aarch64) had no issues whatsoever, and (MXNet + OneDNN + ACL) would fail the two tests, so it was related to the ACL integration |
Hi @Zha0q1 ,
Cheers that would be great. |
Hi @cfRod thanks for you reply! Hmm that's weird, anyways I am creating a docker now, will share it shortly |
@cfRod I was able to reproduce the test failure within a docker container. I uploaded the image to here
p.s. The second test would cause a crash for some reason, but this was not an issue in my non-docker ubuntu enviroment. I think for now we can start with the first test.
|
@Zha0q1, Thanks I was able to pull and reproduce your issue. |
Hi @Zha0q1 , This is my WIP analysis so far (see verbose logs to support this)
We've seen a similar issue with primitive caching in TensorFlow and I am looking to confirm the hypothesis, possibly by modifying the test. |
@cfRod, thanks for the detailed analysis, this makes a lot of sense! Just to understand it better, is this a OneDNN & ACL integration issue (rather than MXNet & OneDNN or TensorFlow & OneDNN integration issue)? Another quick questions is that do you work with the OneDNN or Arm? I am trying to think if we need to loop in more people to fix this : ) |
Hi @Zha0q1, in this case, it is an issue with oneDNN and ACL rather than a framework specific issue.
Yes. I am part of the team that is working on this. |
Thanks 👍 👍 |
Hi @cfRod I have been tracking your discuss on the other thread, thanks for working on this! Meanwhile were you able to reproduce the other test failure |
Would appreciate your input @cfRod :) |
Yes, this seems to be an issue which shows up with -D_GLIBCXX_ASSERTIONS turned on. Is your non-docker mxnet built with -DCMAKE_BUILD_TYPE=Release? |
Right I think the non-docker build is RELEASE. So you were able to reproduce |
Yes, I was able to reproduce the issue outside docker with a -DCMAKE_BUILD_TYPE=Debug build. |
Hi @Zha0q1, |
Yeah I think that's exactly right :) |
Related question raised here: #20457 |
This fix is a workaround for the accuracy issue observed when MXNet is built with Compute Library (ACL). This change includes: * Updating MXNet's AddSign function to generate unique hashes for MKLDNN-ACL backend. * Adding DNNL_AARCH64_USE_ACL to CMakeLists.txt * Adding Crefeda Rodrigues to the contributors list Signed-off-by: cfRod <crefeda.rodrigues@arm.com>
This fix is a workaround for the accuracy issue observed when MXNet is built with Compute Library (ACL). This change includes: * Updating MXNet's AddSign function to generate unique hashes for MKLDNN-ACL backend. * Adding DNNL_AARCH64_USE_ACL to CMakeLists.txt * Adding Crefeda Rodrigues to the contributors list Signed-off-by: cfRod <crefeda.rodrigues@arm.com>
This fix is a workaround for the accuracy issue observed when MXNet is built with Compute Library (ACL). This change includes: * Updating MXNet's AddSign function to generate unique hashes for MKLDNN-ACL backend. * Adding DNNL_AARCH64_USE_ACL to CMakeLists.txt * Adding Crefeda Rodrigues to the contributors list Signed-off-by: cfRod <crefeda.rodrigues@arm.com>
…20482) This fix is a workaround for the accuracy issue observed when MXNet is built with Compute Library (ACL). This change includes: * Updating MXNet's AddSign function to generate unique hashes for MKLDNN-ACL backend. * Adding DNNL_AARCH64_USE_ACL to CMakeLists.txt * Adding Crefeda Rodrigues to the contributors list Signed-off-by: cfRod <crefeda.rodrigues@arm.com>
Thanks @cfRod for adding the workaround for |
Hi mseth10, I was able to reproduce the crash due to GLIBC turned on but I have not looked further on the second issue. |
…. (apache#20482) This fix is a workaround for the accuracy issue observed when MXNet is built with Compute Library (ACL). This change includes: * Updating MXNet's AddSign function to generate unique hashes for MKLDNN-ACL backend. * Adding DNNL_AARCH64_USE_ACL to CMakeLists.txt * Adding Crefeda Rodrigues to the contributors list Signed-off-by: cfRod <crefeda.rodrigues@arm.com>
…20482) (#20921) This fix is a workaround for the accuracy issue observed when MXNet is built with Compute Library (ACL). This change includes: * Updating MXNet's AddSign function to generate unique hashes for MKLDNN-ACL backend. * Adding DNNL_AARCH64_USE_ACL to CMakeLists.txt * Adding Crefeda Rodrigues to the contributors list Signed-off-by: cfRod <crefeda.rodrigues@arm.com> Co-authored-by: Crefeda Rodrigues <65665931+cfRod@users.noreply.github.com>
I was trying to build MXNet 1.x with MKLDNN with ACL (Arm Compute Library) integration on an Arm instance. I used this cmake config file to integrate MKLDNN with ACL. The build was very performant and would surely benefit MXNet users hugely. I got a 3-4X boost with (16/64, 3, 512, 512) on Resnet compared to MKLDNN with no integration. However two operator unit test failed on this build:
I tried multiple mkldnn versions (v1.x now points to mkldnn release 2.0 beta 10, I also tried release 2.1 and 2.2) and ACl versions (20.08 and 21.2, 20.08 is Aug 2020 and 21.2 is the latest) and the failures persisted.
This got me suspect that there is some integration issue with MXNet-MKLDNN (more likely) or MKLDNN-ACL. Would someone from the Intel team help share some insights on this?
Would you help triage?@szha @leezu
CC @josephevans @mseth10 @waytrue17 @sandeep-krishnamurthy
The text was updated successfully, but these errors were encountered: