Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][GFX1030] Random Memory access faults on gfx1030. #1613

Open
shurale-nkn opened this issue Jun 30, 2022 · 9 comments
Open

[BUG][GFX1030] Random Memory access faults on gfx1030. #1613

shurale-nkn opened this issue Jun 30, 2022 · 9 comments

Comments

@shurale-nkn
Copy link
Contributor

shurale-nkn commented Jun 30, 2022

[Keywords]:
test; gfx1030;

[Description]:
Random Memory access faults on gfx1030.
5 different PRs failed at a random stage, but always with gfx1030.

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/test-int8-mlir-nonxdlops/2/pipeline

log info
Full Tests I / Fp16 Hip All gfx1030

NODE_NAME = ixt-sjc2-16

27/103 Test  #24: test_gru ..............................................   Passed   20.59 sec

[2022-06-26T20:30:09.726Z]         Start  26: test_handle_test

[2022-06-26T22:23:59.557Z]  28/103 Test  #12: test_conv2d ...........................................***Failed  8388.27 sec

[2022-06-26T22:23:59.557Z] Memory access fault by GPU node-1 (Agent handle: 0xebf2f0) on address 0x7fb9990c8000. Reason: Page not present or supervisor privilege.

[2022-06-26T22:23:59.557Z] CMake Error at test_test_conv2d.cmake:7 (message):

[2022-06-26T22:23:59.557Z]   Test failed

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/int8-perf-config-tuning/16/pipeline

log info
Full Tests I / Fp16 Hip All gfx1030

NODE_NAME = rocm-framework-19.amd.com

3/106 Test  #14: test_conv3d ............................................   Passed  238.85 sec

[2022-06-25T06:40:04.951Z]         Start  45: test_soft_max

[2022-06-25T06:41:12.731Z]   4/106 Test  #12: test_conv2d ............................................***Failed  307.92 sec

[2022-06-25T06:41:12.731Z] Memory access fault by GPU node-2 (Agent handle: 0x1227680) on address 0x7f1e5756a000. Reason: Page not present or supervisor privilege.

[2022-06-25T06:41:12.731Z] CMake Error at test_test_conv2d.cmake:7 (message):

[2022-06-25T06:41:12.731Z]   Test failed

[2022-06-25T06:41:12.731Z]

[2022-06-25T06:41:12.731Z]

[2022-06-25T06:41:12.731Z]

[2022-06-25T06:41:12.731Z]         Start  69: test_conv_for_implicit_gemm

[2022-06-25T06:42:49.280Z]   5/106 Test  #28: test_immed_conv3d ......................................   Passed  401.59 sec

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/int8-perf-config-tuning/17/pipeline/625

log info
Full Tests II / Fp32 OpenCL All gfx1030

NODE_NAME = ixt-sjc2-16

 
34/107 Test  #35: test_mdgraph ...........................................   Passed    0.45 sec

[2022-06-26T10:50:29.529Z]         Start  36: test_na_inference

[2022-06-26T10:50:31.821Z]  35/107 Test  #36: test_na_inference ......................................***Failed    1.99 sec

[2022-06-26T10:50:31.821Z] Memory access fault by GPU node-1 (Agent handle: 0x55e307726530) on address 0x7f3e4319e000. Reason: Page not present or supervisor privilege.

[2022-06-26T10:50:31.821Z] CMake Error at test_test_na_inference.cmake:7 (message):

[2022-06-26T10:50:31.821Z]   Test failed

[2022-06-26T10:50:31.821Z]

[2022-06-26T10:50:31.821Z]

[2022-06-26T10:50:31.821Z]

[2022-06-26T10:50:31.821Z]         Start  37: test_na_train

[2022-06-26T10:52:28.950Z]  36/107 Test  #37: test_na_train ..........................................   Passed  110.72 sec

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/jd%2Fck_integration/64/pipeline/255

log info
Full Tests I / Fp16 Hip All gfx1030

NODE_NAME = rocm-framework-19.amd.com


[2022-06-27T16:32:36.646Z]  61/107 Test  #97: test_conv_igemm_dynamic_dlops_nchwc_chwnc_fwd_fp16x4 ...   Passed  110.25 sec

[2022-06-27T16:32:36.646Z]         Start  99: test_conv_igemm_dynamic_dlops_nchwc_chwnc_fwd_fp16x8

[2022-06-27T16:32:44.922Z]  62/107 Test  #99: test_conv_igemm_dynamic_dlops_nchwc_chwnc_fwd_fp16x8 ...***Failed   12.83 sec


[2022-06-27T16:32:44.922Z] /home/jenkins/workspace/MLLibs_MIOpen_jd_ck_integration/build/bin/test_conv2d --half --cmode convfp16 --pmode default --group-count 1 --disable-backward-data --disable-backward-weights --input 32 160 73 73 --weights 160 1 1 64 --batch_size 32 --input_channels 160 --output_channels 64 --spatial_dim_elements 73 73 --filter_dims 1 1 --pads_strides_dilations 0 0 1 1 1 1 --trans_output_pads 0 0 --in_layout NCHW --fil_layout CHWN --out_layout NCHW --output_type int32 --int8_vectorize 0 --vector_length 8 --tensor_vect 1

[2022-06-27T16:32:44.922Z] error: 0

[2022-06-27T16:32:44.922Z] Max diff: 0

[2022-06-27T16:32:44.922Z] Forward convolution: ConvAsmImplicitGemmGTCDynamicFwdDlopsNCHWC

[2022-06-27T16:32:44.922Z] Input tensor: 32, 20, 73, 73

[2022-06-27T16:32:44.922Z] Weights tensor: 20, 1, 1, 64

[2022-06-27T16:32:44.922Z] Output tensor: 32, 8, 73, 73

[2022-06-27T16:32:44.922Z] Filter: conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1},

[2022-06-27T16:32:44.922Z] Memory access fault by GPU node-2 (Agent handle: 0x1d63c00) on address 0x7fb315c7a000. Reason: Page not present or supervisor privilege.

[2022-06-27T16:32:44.922Z] Aborted (core dumped)

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/jd%2Fck_integration/67/pipeline/310

log info
Full Tests II / Fp32 OpenCL All gfx1030

NODE_NAME = ixt-sjc2-16

57/104 Test #72: test_rnn_extra ........................................***Failed 27.72 sec

….

[2022-06-29T18:08:10.070Z] ../bin/test_rnn_vanilla --float --batch-size 32 --seq-len 3 --vector-len 128 --hidden-size 128 --num-layers 1 --no-dhy --use-dropout 0 --in-mode 0 --bias-mode 1 --dir-mode 0 --rnn-mode 0 --batch-seq 32 32 32

[2022-06-29T18:08:10.070Z] error: 2.61185e-09

[2022-06-29T18:08:10.070Z] Max diff: 2.98023e-07

[2022-06-29T18:08:10.070Z] Mismatch at 3: 0.0993099 != 0.0993099

[2022-06-29T18:08:10.070Z] ./bin/MIOpenDriver rnn -n 32,32,32 -m relu -k 3 -H 128 -W 128 -l 1 -F 0 -r 0 -b 1 -p 0 -U 0

[2022-06-29T18:08:10.070Z] Backward Weights RNN vanilla:

[2022-06-29T18:08:10.070Z] Memory access fault by GPU node-1 (Agent handle: 0x559742012550) on address 0x7f7768be4000. Reason: Page not present or supervisor privilege.

[2022-06-29T18:08:10.070Z] Aborted (core dumped)

[2022-06-29T18:08:10.070Z] test/CMakeFiles/test_rnn_extra.dir/build.make:57: recipe for target 'test/CMakeFiles/test_rnn_extra' failed

[2022-06-29T18:08:10.070Z] make[7]: *** [test/CMakeFiles/test_rnn_extra] Error 134

[2022-06-29T18:08:10.070Z] CMakeFiles/Makefile2:12913: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/all' failed

[2022-06-29T18:08:10.070Z] make[6]: *** [test/CMakeFiles/test_rnn_extra.dir/all] Error 2

[2022-06-29T18:08:10.070Z] CMakeFiles/Makefile2:12920: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/rule' failed

[2022-06-29T18:08:10.071Z] make[5]: *** [test/CMakeFiles/test_rnn_extra.dir/rule] Error 2

[2022-06-29T18:08:10.071Z] Makefile:2309: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/rule' failed

[2022-06-29T18:08:10.071Z] make[4]: *** [test/CMakeFiles/test_rnn_extra.dir/rule] Error 2

[2022-06-29T18:08:10.071Z]

[2022-06-29T18:08:10.071Z]         Start  73: test_gru_extra

[2022-06-29T18:09:03.794Z]  58/104 Test  #73: test_gru_extra ........................................   Passed   50.99 sec

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/dfeng_int8_quantization_api/2/pipeline/1554

log info
NODE_NAME = rocm-framework-19.amd.com
 

[2022-06-28T20:36:58.595Z]  64/106 Test #101: test_conv_ck_igemm_fwd_v6r1_dlops_nchw .................***Failed   28.99 sec

[2022-06-28T20:36:58.595Z] [  2%] Built target sqlite_memvfs

[2022-06-28T20:36:58.595Z] [  2%] Built target addkernels

[2022-06-28T20:36:58.595Z] [100%] Built target MIOpen

[2022-06-28T20:36:58.595Z] [100%] Built target test_conv2d

[2022-06-28T20:36:58.595Z] Scanning dependencies of target test_conv_ck_igemm_fwd_v6r1_dlops_nchw

[2022-06-28T20:36:58.595Z] /home/jenkins/workspace/Open_dfeng_int8_quantization_api/build/bin/test_conv2d --half --cmode conv --pmode default --group-count 1 --disable-backward-data --disable-backward-weights --input 128 1024 14 14 --weights 2048 1024 1 1 --batch_size 128 --input_channels 1024 --output_channels 2048 --spatial_dim_elements 14 14 --filter_dims 1 1 --pads_strides_dilations 0 0 2 2 1 1 --trans_output_pads 0 0 --in_layout NCHW --fil_layout NCHW --out_layout NCHW --tensor_vect 0 --vector_length 1

[2022-06-28T20:36:58.595Z] Memory access fault by GPU node-2 (Agent handle: 0x80e5a0) on address 0x7f8f694e8000. Reason: Page not present or supervisor privilege.

[2022-06-28T20:36:58.595Z] Aborted (core dumped)

[2022-06-28T20:36:58.595Z] test/CMakeFiles/test_conv_ck_igemm_fwd_v6r1_dlops_nchw.dir/build.make:57: recipe for target 'test/CMakeFiles/test_conv_ck_igemm_fwd_v6r1_dlops_nchw' failed

[2022-06-28T20:36:58.595Z] make[7]: *** [test/CMakeFiles/test_conv_ck_igemm_fwd_v6r1_dlops_nchw] Error 134

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/fix_1549/7/pipeline/1640

log info
NODE_NAME = ixt-sjc2-22

[2022-06-28T10:19:22.347Z] 58/107 Test #72: test_gru_extra .........................................***Failed 13.69 sec
….
[2022-06-28T10:19:22.348Z] ../bin/test_gru --float --batch-size 32 --seq-len 3 --vector-len 128 --hidden-size 128 --num-layers 1 --no-hx --no-dhy --use-dropout 0 --in-mode 0 --bias-mode 0 --dir-mode 0 --batch-seq 32 32 32 
[2022-06-28T10:19:22.349Z] error: 4.26209e-08
[2022-06-28T10:19:22.349Z] Max diff: 2.98023e-08
[2022-06-28T10:19:22.349Z] Mismatch at 1: -0.0144987 != -0.0144987
[2022-06-28T10:19:22.349Z] ./bin/MIOpenDriver rnn -n 32,32,32 -m gru -k 3 -H 128 -W 128 -l 1 -F 0 -r 0 -b 0 -p 0
[2022-06-28T10:19:22.349Z] inputMode: 0 biasMode: 0 dirMode: 0
[2022-06-28T10:19:22.349Z] hz: 128 batch_n: 96 seqLength: 3 inputLen: 128 numLayers: 1
[2022-06-28T10:19:22.349Z] Forward Inference GRU: 
[2022-06-28T10:19:22.349Z] Output tensor output failed verification.
[2022-06-28T10:19:22.349Z] Memory access fault by GPU node-1 (Agent handle: 0x55a84333b040) on address 0x7f13174ea000. Reason: Page not present or supervisor privilege.
[2022-06-28T10:19:22.349Z] Aborted (core dumped)

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/fix_1549/9/pipeline/625

log info
NODE_NAME = ixt-sjc2-22

[2022-06-29T06:55:33.965Z] 57/107 Test #71: test_rnn_extra .........................................***Failed 73.52 sec

../bin/test_rnn_vanilla --float --batch-size 32 --seq-len 3 --vector-len 128 --hidden-size 128 --num-layers 1 --no-dhx --use-dropout 0 --in-mode 0 --bias-mode 1 --dir-mode 0 --rnn-mode 1 --batch-seq 32 32 32 
[2022-06-29T06:55:33.972Z] error: 4.23637e-09
[2022-06-29T06:55:33.972Z] Max diff: 8.34465e-07
[2022-06-29T06:55:33.972Z] Mismatch at 4: 0.404517 != 0.404518
[2022-06-29T06:55:33.972Z] ./bin/MIOpenDriver rnn -n 32,32,32 -m tanh -k 3 -H 128 -W 128 -l 1 -F 0 -r 0 -b 1 -p 0 -U 0
[2022-06-29T06:55:33.972Z] Backward Weights RNN vanilla: 
[2022-06-29T06:55:33.972Z] Memory access fault by GPU node-1 (Agent handle: 0x560ccf50e640) on address 0x7fb3233b0000. Reason: Page not present or supervisor privilege.
[2022-06-29T06:55:33.972Z] Aborted (core dumped)
@shurale-nkn
Copy link
Contributor Author

@atamazov FYI

@atamazov
Copy link
Contributor

atamazov commented Jul 1, 2022

I'll try to look into this.

@shurale-nkn
Copy link
Contributor Author

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/develop/688/pipeline/1641

log info
Start  71: test_rnn_extra
[2022-07-05T02:22:03.245Z]  57/107 Test  #71: test_rnn_extra .........................................***Failed    1.04 sec
[2022-07-05T02:22:03.245Z] [  2%] Built target sqlite_memvfs
[2022-07-05T02:22:03.245Z] [  2%] Built target addkernels
[2022-07-05T02:22:03.245Z] [ 97%] Built target MIOpen
[2022-07-05T02:22:03.245Z] [100%] Built target test_rnn_vanilla
[2022-07-05T02:22:03.245Z] Scanning dependencies of target test_rnn_extra
[2022-07-05T02:22:03.245Z] MIOpen(HIP): Info [get_device_name] Raw device name: gfx1030
[2022-07-05T02:22:03.245Z] MIOpen(HIP): Info [Handle] stream: 0, device_id: 0
[2022-07-05T02:22:03.245Z] ../bin/test_rnn_vanilla --float --batch-size 32 --seq-len 3 --vector-len 128 --hidden-size 128 --num-layers 1 --no-hx --use-dropout 0 --in-mode 0 --bias-mode 0 --dir-mode 0 --rnn-mode 0 --batch-seq 32 32 32
[2022-07-05T02:22:03.245Z] Memory access fault by GPU node-1 (Agent handle: 0x227a460) on address 0x7f5059ff2000. Reason: Page not present or supervisor privilege.
[2022-07-05T02:22:03.245Z] Aborted (core dumped)
[2022-07-05T02:22:03.245Z] test/CMakeFiles/test_rnn_extra.dir/build.make:57: recipe for target 'test/CMakeFiles/test_rnn_extra' failed
[2022-07-05T02:22:03.245Z] make[7]: *** [test/CMakeFiles/test_rnn_extra] Error 134
[2022-07-05T02:22:03.245Z] CMakeFiles/Makefile2:12926: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/all' failed
[2022-07-05T02:22:03.245Z] make[6]: *** [test/CMakeFiles/test_rnn_extra.dir/all] Error 2
[2022-07-05T02:22:03.245Z] CMakeFiles/Makefile2:12933: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/rule' failed
[2022-07-05T02:22:03.245Z] make[5]: *** [test/CMakeFiles/test_rnn_extra.dir/rule] Error 2
[2022-07-05T02:22:03.245Z] Makefile:2234: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/rule' failed
[2022-07-05T02:22:03.245Z] make[4]: *** [test/CMakeFiles/test_rnn_extra.dir/rule] Error 2
[2022-07-05T02:22:03.245Z]
[2022-07-05T02:22:03.245Z]         Start  72: test_gru_extra
[2022-07-05T02:24:31.186Z]  58/107 Test  #72: test_gru_extra .........................................   Passed  138.00 sec
[2022-07-05T02:24:31.187Z]         Start  73: test_lstm_extra

@aska-0096
Copy link
Collaborator

It looks like the issue has been solved, let me close this issue.
Please feel free to re-open it if not resolved yet.

@shurale-nkn
Copy link
Contributor Author

Not fixed!

@shurale-nkn shurale-nkn reopened this Aug 4, 2022
@aska-0096 aska-0096 unpinned this issue Aug 4, 2022
@aska-0096 aska-0096 pinned this issue Aug 4, 2022
@aska-0096
Copy link
Collaborator

Not fixed!

Sorry for that. Also pin it back.

@junliume junliume unpinned this issue Aug 19, 2022
@tangerdream
Copy link

So is there a way to solve this problem?

@atamazov
Copy link
Contributor

@shurale-nkn Is it possible to reliably reproduce the issue?

@ppanchad-amd
Copy link

@shurale-nkn Is this fixed with latest ROCm 6.0.2 (HIP 6.0.32831)? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants