Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workSpaceSize check assertion fails in ConvolutionBackwardWeights() - DEBUG builds only #72

Closed
syoyo opened this issue Jan 12, 2019 · 10 comments

Comments

@syoyo
Copy link

syoyo commented Jan 12, 2019

System

  • Ubuntu 18.04
  • ROCm 2.0
  • MIOpen master branch + Debug build
  • VEGA56(gfx900)

I have encountered assertion failure when trying to run keithito's tacotron.

https://github.com/keithito/tacotron

How to reproduce

Download LJ-Speech data and setup as described in Tacotron's README.

Set MIOPEN_ENABLE_LOGGING=1 and MIOPEN_LOG_LEVEL=6

Then, run train.py

$ gdb python
$ set args train.py
...
python: /home/syoyo/work/MIOpen/src/ocl/convolutionocl.cpp:4111: auto miopen::ConvolutionDescriptor::ConvolutionBackwardWeights(miopen::Handle &, const void *, const miopen::TensorDescriptor &, ConstData_t, const miopen::TensorDescriptor &, ConstData_t, miopenConvBwdWeightsAlgorithm_t, const void *, const miopen::TensorDescriptor &, Data_t, Data_t, size_t)::(anonymous class)::operator()(type-parameter-0-0) const: Assertion `(workSpace != nullptr) && (workSpaceSize >= workSpaceSizeDirect)' failed.
Thread 47 "python" received signal SIGABRT, Aborted.
...
$1 = (Data_t &) @0x7fff067f8f98: 0x60f235400
(gdb) p workSpaceSize
$2 = (size_t &) @0x7fff067f8fa0: 1048576
(gdb) p workSpaceSizeDirect
$3 = 4194304

Full log: issue_72_full_log.txt

workSpaceSizeDirect is 4x larger(due to sizeof(float)?) than workSpaceSize.

I have modified MIOpen source code to print the value of BackwardWeightsGetWorkSpaceSizeDirect

@atamazov
Copy link
Contributor

atamazov commented Jan 15, 2019

Explanation:

  • More than one Solution is suitable for Direct WrW algo
  • One (ConvOclBwdWrW2) wants 4MB workspace, another one (ConvOclBwdWrW53) needs only 1MB
  • The second Solution (ConvOclBwdWrW53) is faster and then it is selected
  • The application provides 1MB workspace, just as needed for ConvOclBwdWrW53.
  • The assertion checks for max of all available Direct Solutions, which is 4MB, and assertion fails.

driver command to reproduce:

./bin/MIOpenDriver conv -n 32 -c 128 -H 1 -W 111 -k 128 -c 128 -y 1 -x 2 -p 0 -q 0 -V 0 -t 1 -w 1

I'll provide a fix soon.

@atamazov
Copy link
Contributor

@syoyo The patch which shall resolve the issue resides here. Please close the issue if it fixes the problem. Thanks.

syoyo added a commit to syoyo/MIOpen that referenced this issue Jan 16, 2019
Fixes ROCm#70(DIV/0 seg fault)
Fixes ROCm#72(assertion failure)
@syoyo
Copy link
Author

syoyo commented Jan 16, 2019

@atamazov At least the patch now avoids assertion failure, but an application now goes into infinite loop of compiling MIOpen conv kernels(by glimpsing MIOpen logs, it looks it infinitely try to search conv kernel)

@atamazov
Copy link
Contributor

@syoyo Theoretically, this could be not an infinite loop but auto-tuning (exhaustive search) process. Please check if autotuning is enabled in the environment.

Anyway, I am unable to reproduce the "loop" problem with this config (./bin/MIOpenDriver conv -n 32 -c 128 -H 1 -W 111 -k 128 -c 128 -y 1 -x 2 -p 0 -q 0 -V 0 -t 1 -w 1). Please attach log. Thanks.

@syoyo
Copy link
Author

syoyo commented Jan 16, 2019

I am unable to reproduce the "loop" problem with this config (./bin/MIOpenDriver conv -n 32 -c 128 -H 1 -W 111 -k 128 -c 128 -y 1 -x 2 -p 0 -q 0 -V 0 -t 1 -w 1).

Ah, I meant "loop" occurs when I run tacotron app, not MIOpenDriver solely.
I'll also look into autotuning document.

Currently I'm running a learning task for #70 regression. I'll attach log of #72 loop issue after that(it may take half or one day)

@atamazov
Copy link
Contributor

@syoyo I highly recommend opening another issue for the loop problem. Let's keep this issue dedicated to workspace size check assertion. Thanks.

@syoyo
Copy link
Author

syoyo commented Jan 16, 2019

@atamazov I understand. I'll open another issue once ready to submit log file

I highly recommend opening another issue for the loop problem

So this issue is ready to close once PR #73 has been merged(Inherently its @atamazov's patch)

@atamazov
Copy link
Contributor

I am not sure that #73 will be merged. AFAIK we are going to release 1.7.1 instead, which shall contain a set of fixes & improvements, including fix for #70 and #72.

@atamazov atamazov changed the title workSpaceSize check assertion fails in miopen::ConvolutionDescriptor::ConvolutionBackwardWeights workSpaceSize check assertion fails in ConvolutionBackwardWeights() - DEBUG builds only Jan 17, 2019
@atamazov
Copy link
Contributor

Note that the issue only occurs with Debug builds of MIOpen (i.e. those which have assertions enabled).

@syoyo
Copy link
Author

syoyo commented Feb 7, 2019

Fix has been included in 1.7.1: a478ac8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants