-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGFPE (DIV/0) in ConvOclBwdWrW2::GetSolution() #70
Comments
@daniellowell per discussion could you help try reproduce the issue? TF-specific log is this line:
Basically it means grappler failed to properly execute one optimization pass,
|
@whchung FYI, This warning also appear in CUDA(NVIDIA) GPU. |
@syoyo is it possible to dump the detailed MIOpen log? Rerun the model with the environment variable set to MIOPEN_LOG_LEVEL=6 |
Here is a log with Besides, I got success to build MIOpen from source code. So I may try to debug code with MIOpen with Debug build or ASAN enabled to find more precise location where segfault happens. |
Debug build would probably help to find exact place of the problem (there are some assertions which may fire and gdb can show exact file/line where SIGFPE occurs). |
Unfortunately, the log ends with
so it is missing the problem config information. More detailed log can be obtained with both |
Please find attached log with
It looks workSpaceSize become invalid value |
Thanks. The config looks weird:
|
@syoyo Thanks for bug report. The reason is identified, please expect a fix soon. |
@atamazov Thanks! BTW, I got a place where DIV/0 happens by building Debug version of MIOpen
|
Here is the patch which solves this issue and #72. The fixes will be included into next MIOpen release. |
@syoyo Please close the issue if the above resolves it. |
I've confirmed given patch solves this issue. No more DIV/0 seg fault. |
This comment has been minimized.
This comment has been minimized.
Fix has been included in 1.7.1: a478ac8 |
49e3e3a62 clang format db80b1777 update to using TestPerfCfgParams for pdb validity checks e48a4fd3a format a4f85842c exception for non-tunable solvers in params check d58c42bbd Check params at end of perf tuning (#70) 1a3b47c7b Return status for failed compile commands (#69) d59962752 out_layout -> in_layout 6ba7a8f3f Rename conv_mode to mode (#64) 513a3da1b [bg/LWPTUNA-173] (#65) e05dcb421 perf db validation fix (#68) 260d9465d Add INT8 as a data_type v2 (#67) b6a5b2a77 sync with fin folder in miopen (#62) 0e03399ec prep for Palamida scan (#63) e6bd05c33 Performance db testing (#61) 30d699b9e Perf Eval Update (#60) 3535b948c PerfCompile and PerfEval changes (#59) de79468d2 remove unneccessary solution check, add check for previously modified kernel names (#56) 6924286a2 miopen hash update (#55) 530399575 Refactor googletest infra to align with MIOpen (#53) 71c50d146 Datatype fix for BN (#57) 8abe2f5c6 Perf Eval updates, Add find info (#51) e1c1ef0f5 filter find compile by solver input (#54) 722feea66 sp/chk precomp kernel 264 (#41) b9aba2034 Batch norm find compile (#50) 359f3da80 Fix missing link directives in fin binary (#48) a4020c1ba Cache Miss Fixes (#46) 2ec7ef44d Enable google test and compiling fin in the CI (#47) 8b6b453bc Applicability support for batch norm (#45) 44323aae9 Perf compile/eval for fin (#42) ebd9aa6bd update member name (#43) d6d798efe add cu count (#39) 8e1989a9f Add find option for selecting only dynamic solvers (#38) 0e164bf66 setting json version (#37) f3f7fed18 Remove function redefinition (#36) e1de51a58 Performance DB de-serialize test (#34) 043cdcdaa Layout support in Fin (#33) 3a1d58236 Hotfix (#32) ee3f0d543 4.4 Tuning Bugfixes (#31) 832dbe234 Tunability Reporting (#27) a564a229f include gfx90a_110 (#28) git-subtree-dir: fin git-subtree-split: 49e3e3a62a7cc54adacbeea95680d35f9a4685de
Ubuntu 18.04
ROCm 2.0
VEGA56
python 3.6(conda) + ROCm TensorFlow 1.12
MIOpen-hip
When I run waveglow-tensorflow
https://github.com/b04901014/waveglow-tensorflow
Floating point exception(segmentation faulut) happens inside
miopen::solver::ConvOclBwdWrW2::GetSolution
for some reason.How to reproduce
Setup hparams.py(e.g. edit path to LSJpeech) as described in waveglow-tensorflow's README.
Reduce
wavnet_channels
andwavenet_layers
to 256 and 7 respectively, since default configuration does not fit into VEGA's 8G GPU mem.https://github.com/b04901014/waveglow-tensorflow/blob/master/src/hparams.py#L80
Then run
python main.py
.I have disabled auto-tuning by setting
TF_CUDNN_USE_AUTOTUNE=0
, but this does not affect the issue: https://stackoverflow.com/questions/45063489/first-tf-session-run-performs-dramatically-different-from-later-runs-whyFollowing is the gdb trace.
The text was updated successfully, but these errors were encountered: