Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGFPE (DIV/0) in ConvOclBwdWrW2::GetSolution() #70

Closed
syoyo opened this issue Jan 10, 2019 · 15 comments
Closed

SIGFPE (DIV/0) in ConvOclBwdWrW2::GetSolution() #70

syoyo opened this issue Jan 10, 2019 · 15 comments

Comments

@syoyo
Copy link

syoyo commented Jan 10, 2019

Ubuntu 18.04
ROCm 2.0
VEGA56
python 3.6(conda) + ROCm TensorFlow 1.12
MIOpen-hip

When I run waveglow-tensorflow

https://github.com/b04901014/waveglow-tensorflow

Floating point exception(segmentation faulut) happens inside miopen::solver::ConvOclBwdWrW2::GetSolution for some reason.

How to reproduce

Setup hparams.py(e.g. edit path to LSJpeech) as described in waveglow-tensorflow's README.

Reduce wavnet_channels and wavenet_layers to 256 and 7 respectively, since default configuration does not fit into VEGA's 8G GPU mem.

https://github.com/b04901014/waveglow-tensorflow/blob/master/src/hparams.py#L80

Then run python main.py.

I have disabled auto-tuning by setting TF_CUDNN_USE_AUTOTUNE=0, but this does not affect the issue: https://stackoverflow.com/questions/45063489/first-tf-session-run-performs-dramatically-different-from-later-runs-why

Following is the gdb trace.

Starting program: /home/syoyo/miniconda3/envs/py36/bin/python main.py 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3933700 (LWP 15677)]
[New Thread 0x7ffff3132700 (LWP 15678)]
[New Thread 0x7fffee931700 (LWP 15679)]
[New Thread 0x7fffec130700 (LWP 15680)]
[New Thread 0x7fffe992f700 (LWP 15681)]
[New Thread 0x7fffe712e700 (LWP 15682)]
[New Thread 0x7fffe692d700 (LWP 15683)]
[New Thread 0x7fffe212c700 (LWP 15684)]
[New Thread 0x7fffe192b700 (LWP 15685)]
[New Thread 0x7fffdd12a700 (LWP 15686)]
[New Thread 0x7fffdc929700 (LWP 15687)]
[New Thread 0x7fffd8128700 (LWP 15688)]
[New Thread 0x7fffd5927700 (LWP 15689)]
[New Thread 0x7fffd3126700 (LWP 15690)]
[New Thread 0x7fffd2925700 (LWP 15691)]
[New Thread 0x7fffa9422700 (LWP 15694)]
[Thread 0x7fffd8128700 (LWP 15688) exited]
[Thread 0x7fffd2925700 (LWP 15691) exited]
[Thread 0x7fffd3126700 (LWP 15690) exited]
[Thread 0x7fffd5927700 (LWP 15689) exited]
[Thread 0x7fffdc929700 (LWP 15687) exited]
[Thread 0x7fffdd12a700 (LWP 15686) exited]
[Thread 0x7fffe192b700 (LWP 15685) exited]
[Thread 0x7fffe212c700 (LWP 15684) exited]
[Thread 0x7fffe692d700 (LWP 15683) exited]
[Thread 0x7fffe712e700 (LWP 15682) exited]
[Thread 0x7fffe992f700 (LWP 15681) exited]
[Thread 0x7fffec130700 (LWP 15680) exited]
[Thread 0x7fffee931700 (LWP 15679) exited]
[Thread 0x7ffff3132700 (LWP 15678) exited]
[Thread 0x7ffff3933700 (LWP 15677) exited]
WARNING:tensorflow:From /home/syoyo/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/distributions/distribution.py:265: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/syoyo/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/distributions/bernoulli.py:169: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
[New Thread 0x7fffd2925700 (LWP 15700)]
Time Segments of audio for training: 16384
2019-01-10 21:01:47.951037: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
[New Thread 0x7fffd3126700 (LWP 15703)]
[New Thread 0x7fffd5927700 (LWP 15704)]
[New Thread 0x7fffd8128700 (LWP 15705)]
[New Thread 0x7fff8bf9e700 (LWP 15706)]
[New Thread 0x7fff8b79d700 (LWP 15707)]
[New Thread 0x7fff8af9c700 (LWP 15708)]
[New Thread 0x7fff8a79b700 (LWP 15709)]
[New Thread 0x7fff89f9a700 (LWP 15710)]
[New Thread 0x7fff89799700 (LWP 15711)]
[New Thread 0x7fff88f98700 (LWP 15712)]
[New Thread 0x7fff5bfff700 (LWP 15713)]
[New Thread 0x7fff5b7fe700 (LWP 15714)]
[New Thread 0x7fff53fff700 (LWP 15715)]
[New Thread 0x7fff5affd700 (LWP 15716)]
[New Thread 0x7fff5a7fc700 (LWP 15717)]
[New Thread 0x7fff59ffb700 (LWP 15718)]
[New Thread 0x7fff597fa700 (LWP 15719)]
2019-01-10 21:01:47.953664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1530] Found device 0 with properties: 
name: Vega [Radeon RX Vega]
AMDGPU ISA: gfx900
memoryClockRate (GHz) 1.59
pciBusID 0000:28:00.0
Total memory: 7.98GiB
Free memory: 7.73GiB
2019-01-10 21:01:47.953701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Adding visible gpu devices: 0
2019-01-10 21:01:47.953730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-10 21:01:47.953740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1057]      0 
2019-01-10 21:01:47.953749: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1070] 0:   N 
2019-01-10 21:01:47.953799: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7767 MB memory) -> physical GPU (device: 0, name: Vega [Radeon RX Vega], pci bus id: 0000:28:00.0)
[New Thread 0x7fff58ff9700 (LWP 15720)]
[New Thread 0x7fff537fe700 (LWP 15721)]
[New Thread 0x7fff52ffd700 (LWP 15722)]
[New Thread 0x7fff527fc700 (LWP 15723)]
[New Thread 0x7fff51ffb700 (LWP 15724)]
[New Thread 0x7fff517fa700 (LWP 15725)]
[New Thread 0x7fff50ff9700 (LWP 15726)]
[New Thread 0x7fff23fff700 (LWP 15727)]
[New Thread 0x7fff237fe700 (LWP 15728)]
[New Thread 0x7fff22ffd700 (LWP 15729)]
[New Thread 0x7fff227fc700 (LWP 15730)]
[New Thread 0x7fff21ffb700 (LWP 15731)]
[New Thread 0x7fff217fa700 (LWP 15732)]
[New Thread 0x7fff20ff9700 (LWP 15733)]
[New Thread 0x7ffefffff700 (LWP 15734)]
[New Thread 0x7ffeff7fe700 (LWP 15735)]
[New Thread 0x7ffefeffd700 (LWP 15736)]
[New Thread 0x7ffefe7fc700 (LWP 15737)]
Total number of parameters: 102664808
Total number of audio/text pair for training: 11790
Total number of audio/text pair for validation: 1310
Error Loading Model! Training From Initial State...
[New Thread 0x7ffeef53f700 (LWP 15751)]
[Thread 0x7ffeef53f700 (LWP 15751) exited]
[New Thread 0x7ffeef53f700 (LWP 15752)]
[Thread 0x7ffeef53f700 (LWP 15752) exited]
[New Thread 0x7ffeef53f700 (LWP 15753)]
[Thread 0x7ffeef53f700 (LWP 15753) exited]
[New Thread 0x7ffeef53f700 (LWP 15754)]
[Thread 0x7ffeef53f700 (LWP 15754) exited]
[New Thread 0x7ffeef53f700 (LWP 15755)]
[Thread 0x7ffeef53f700 (LWP 15755) exited]
Initialized
[New Thread 0x7ffeef53f700 (LWP 15756)]
Sampling to '../samples'
[New Thread 0x7ffed4d45700 (LWP 15831)]
  0%|                                                                                                                  | 0/5 [00:00<?, ?it/s]Sampling to '../samples/Epoch_0-1.wav' ...
mels done
[New Thread 0x7ffecffff700 (LWP 15835)]
[Thread 0x7ffecffff700 (LWP 15835) exited]
[New Thread 0x7ffecffff700 (LWP 15836)]
[Thread 0x7ffecffff700 (LWP 15836) exited]
[New Thread 0x7ffecffff700 (LWP 15837)]
[Thread 0x7ffecffff700 (LWP 15837) exited]
[New Thread 0x7ffecffff700 (LWP 15838)]
[Thread 0x7ffecffff700 (LWP 15838) exited]
out done
write wav done
 20%|█████████████████████▏                                                                                    | 1/5 [00:38<02:32, 38.14s/it]Sampling to '../samples/Epoch_0-2.wav' ...
mels done
out done
write wav done
 40%|██████████████████████████████████████████▍                                                               | 2/5 [01:00<01:40, 33.51s/it]Sampling to '../samples/Epoch_0-3.wav' ...
mels done
out done
write wav done
 60%|███████████████████████████████████████████████████████████████▌                                          | 3/5 [01:31<01:05, 32.80s/it]Sampling to '../samples/Epoch_0-4.wav' ...
mels done
out done
write wav done
 80%|████████████████████████████████████████████████████████████████████████████████████▊                     | 4/5 [01:59<00:31, 31.23s/it]Sampling to '../samples/Epoch_0-5.wav' ...
mels done
out done
write wav done
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:26<00:00, 29.85s/it]
[New Thread 0x7ffecffff700 (LWP 18163)]
[Thread 0x7ffecffff700 (LWP 18163) exited]
[New Thread 0x7ffecffff700 (LWP 18164)]
[Thread 0x7ffecffff700 (LWP 18164) exited]
[New Thread 0x7ffecffff700 (LWP 18165)]
[Thread 0x7ffecffff700 (LWP 18165) exited]
Current learning rate: 1.000000e-04
[New Thread 0x7ffecffff700 (LWP 18166)]
  0%|                                                                                                               | 0/2947 [00:00<?, ?it/s]dequeing...
dequeued.
2019-01-10 21:04:56.901119: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node ArithmeticOptimizer/HoistCommonFactor_Add_add_24 is missing output properties at position :0 (num_outputs=0)
[Thread 0x7ffed4d45700 (LWP 15831) exited]
[New Thread 0x7ffed4d45700 (LWP 18167)]
[Thread 0x7ffed4d45700 (LWP 18167) exited]
[New Thread 0x7ffed4d45700 (LWP 18168)]
[Thread 0x7ffed4d45700 (LWP 18168) exited]
[New Thread 0x7ffed4d45700 (LWP 18169)]
[Thread 0x7ffed4d45700 (LWP 18169) exited]
[New Thread 0x7ffed4d45700 (LWP 18170)]
[Thread 0x7ffed4d45700 (LWP 18170) exited]

Thread 46 "python" received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7fff227fc700 (LWP 15730)]
0x00007fffb4cb329d in miopen::solver::ConvOclBwdWrW2::GetSolution(miopen::ConvolutionContext const&) const ()
   from /opt/rocm/lib/libMIOpen.so.1
(gdb) bt
#0  0x00007fffb4cb329d in miopen::solver::ConvOclBwdWrW2::GetSolution(miopen::ConvolutionContext const&) const ()
   from /opt/rocm/lib/libMIOpen.so.1
#1  0x00007fffb4cb1157 in miopen::solver::ConvOclBwdWrW2::IsApplicable(miopen::ConvolutionContext const&) const ()
   from /opt/rocm/lib/libMIOpen.so.1
#2  0x00007fffb4d6393e in std::vector<miopen::solver::ConvSolution, std::allocator<miopen::solver::ConvSolution> > miopen::solver::SearchForAllSolutions<miopen::solver::ConvAsmBwdWrW1x1, miopen::solver::ConvAsmBwdWrW3x3, miopen::solver::ConvOclBwdWrW2, miopen::solver::ConvOclBwdWrW53, miopen::solver::ConvOclBwdWrW1x1, miopen::ConvolutionContext, miopen::MultiFileDb, miopen::solver::ConvSolution>(miopen::ConvolutionContext const&, miopen::MultiFileDb)::{lambda(auto:1)#1}::operator()<miopen::solver::ConvOclBwdWrW2> () from /opt/rocm/lib/libMIOpen.so.1
#3  0x00007fffb4d447c0 in std::vector<miopen::solver::ConvSolution, std::allocator<miopen::solver::ConvSolution> > miopen::solver::SearchForAllSolutions<miopen::solver::ConvAsmBwdWrW1x1, miopen::solver::ConvAsmBwdWrW3x3, miopen::solver::ConvOclBwdWrW2, miopen::solver::ConvOclBwdWrW53, miopen::solver::ConvOclBwdWrW1x1, miopen::ConvolutionContext, miopen::MultiFileDb, miopen::solver::ConvSolution>(miopen::ConvolutionContext const&, miopen::MultiFileDb) () from /opt/rocm/lib/libMIOpen.so.1
#4  0x00007fffb4d446b8 in mlo_construct_BwdWrW2D::FindAllSolutions() () from /opt/rocm/lib/libMIOpen.so.1
#5  0x00007fffb4b60da5 in miopen::ConvolutionDescriptor::BackwardWeightsGetWorkSpaceSizeDirect(miopen::Handle&, miopen::TensorDescriptor const&, miopen::TensorDescriptor const&, miopen::TensorDescriptor const&) const () from /opt/rocm/lib/libMIOpen.so.1
#6  0x00007fffb4b6153d in miopen::ConvolutionDescriptor::ConvolutionBackwardWeightsGetWorkSpaceSize(miopen::Handle&, miopen::TensorDescriptor const&, miopen::TensorDescriptor const&, miopen::TensorDescriptor const&) const () from /opt/rocm/lib/libMIOpen.so.1
#7  0x00007fffb4b6c992 in miopenConvolutionBackwardWeightsGetWorkSpaceSize () from /opt/rocm/lib/libMIOpen.so.1
#8  0x00007fffc184ac24 in bool stream_executor::rocm::MIOpenSupport::DoConvolveBackwardFilterImpl<float>(stream_executor::Stream*, int, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float> const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float>*, stream_executor::ScratchAllocator*, stream_executor::dnn::AlgorithmConfig const&, stream_executor::dnn::ProfileResult*) () from /home/syoyo/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9  0x00007fffc184b1b5 in stream_executor::rocm::MIOpenSupport::DoConvolveBackwardFilter(stream_executor::Stream*, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float> const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float>*, stream_executor::ScratchAllocator*, stream_executor::dnn::AlgorithmConfig const&, stream_executor::dnn::ProfileResult*) ()
   from /home/syoyo/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x00007fffb9813309 in stream_executor::Stream::ThenConvolveBackwardFilterWithAlgorithm(stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float> const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float>*, stream_executor::ScratchAllocator*, stream_executor::dnn::AlgorithmConfig const&, stream_executor::dnn::ProfileResult*) ()
   from /home/syoyo/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#11 0x00007fffc1030c5d in tensorflow::LaunchConv2DBackpropFilterOp<Eigen::GpuDevice, float>::operator()(tensorflow::OpKernelContext*, bool, bool, tensorflow::Tensor const&, tensorflow::Tensor const&, int, int, int, int, tensorflow::Padding const&, tensorflow::Tensor*, tensorflow::TensorFormat) () from /home/syoyo/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00007fffc103197d in tensorflow::Conv2DSlowBackpropFilterOp<Eigen::GpuDevice, float>::Compute(tensorflow::OpKernelContext*) ()
   from /home/syoyo/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x00007fffb9416df1 in tensorflow::BaseGPUDevice::ComputeHelper(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
---Type <return> to continue, or q <return> to quit---
   from /home/syoyo/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#14 0x00007fffb9417319 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
   from /home/syoyo/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#15 0x00007fffb94600b3 in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) () from /home/syoyo/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#16 0x00007fffb946027f in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> > const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /home/syoyo/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#17 0x00007fffb95137a1 in Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
   from /home/syoyo/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#18 0x00007fffb95112c6 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /home/syoyo/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#19 0x00007fffb4a57678 in std::execute_native_thread_routine_compat (__p=<optimized out>)
    at /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:94
#20 0x00007ffff7bbd6db in start_thread (arg=0x7fff227fc700) at pthread_create.c:463
#21 0x00007ffff78e688f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
@whchung
Copy link
Contributor

whchung commented Jan 10, 2019

@daniellowell per discussion could you help try reproduce the issue? TF-specific log is this line:

2019-01-10 21:04:56.901119: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node ArithmeticOptimizer/HoistCommonFactor_Add_add_24 is missing output properties at position :0 (num_outputs=0)

Basically it means grappler failed to properly execute one optimization pass, ArithmeticOptimizer. But in such case the graph would simply go un-optimized and should continue to run properly. Therefore it seems the exception comes from MIOpen:

miopen::solver::ConvOclBwdWrW2::GetSolution

@syoyo
Copy link
Author

syoyo commented Jan 10, 2019

2019-01-10 21:04:56.901119: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node ArithmeticOptimizer/HoistCommonFactor_Add_add_24 is missing output properties at position :0 (num_outputs=0)

@whchung FYI, This warning also appear in CUDA(NVIDIA) GPU.

@daniellowell
Copy link
Contributor

@syoyo is it possible to dump the detailed MIOpen log? Rerun the model with the environment variable set to MIOPEN_LOG_LEVEL=6
I'll try to get the model running later today.

@syoyo
Copy link
Author

syoyo commented Jan 10, 2019

Here is a log with MIOPEN_LOG_LEVEL=6

miopen-log.txt

Besides, I got success to build MIOpen from source code. So I may try to debug code with MIOpen with Debug build or ASAN enabled to find more precise location where segfault happens.

@atamazov
Copy link
Contributor

Debug build would probably help to find exact place of the problem (there are some assertions which may fire and gdb can show exact file/line where SIGFPE occurs).

@atamazov
Copy link
Contributor

Unfortunately, the log ends with

2019-01-11 02:29:53.309319: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter
MIOpen(HIP): Info2 [ConvolutionBackwardWeightsGetWorkSpaceSize] 
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvAsmBwdWrW1x1: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvAsmBwdWrW3x3: Not applicable

so it is missing the problem config information. More detailed log can be obtained with both MIOPEN_ENABLE_LOGGING=1 and MIOPEN_LOG_LEVEL=6.

@syoyo
Copy link
Author

syoyo commented Jan 11, 2019

Please find attached log with MIOPEN_ENABLE_LOGGING=1 and MIOPEN_LOG_LEVEL=6.

...
MIOpen(HIP): miopenStatus_t miopenConvolutionBackwardWeightsGetWorkSpaceSize(miopenHandle_t, const miopenTensorDescriptor_t, const miopenTensorDescriptor_t, const miopenConvolutionDescriptor_t, const miopenTensorDescriptor_t, size_t *){
dyDesc = 4, 80, 64, 1
xDesc = 4, 80, 16384, 1
convDesc = miopenConvolution, 384, 0, 256, 1, 1, 1, 
dwDesc = 80, 80, 1024, 1
workSpaceSize = 139957174285512
}
MIOpen(HIP): Info2 [ConvolutionBackwardWeightsGetWorkSpaceSize] 
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvAsmBwdWrW1x1: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvAsmBwdWrW3x3: Not applicable

It looks workSpaceSize become invalid value

miopen-log.txt

@atamazov
Copy link
Contributor

Thanks. The config looks weird:

./bin/MIOpenDriver conv -n 4 -c 80 -k 80 -H 16384 -W 1 -y 1024 -x 1 -p 384 -q 0 -u 256 -v 1

@atamazov atamazov changed the title Floating point exception(segmentation fault) in miopen::solver::ConvOclBwdWrW2::GetSolution SIGFPE (DIV/0) in miopen::solver::ConvOclBwdWrW2::GetSolution Jan 11, 2019
@atamazov atamazov changed the title SIGFPE (DIV/0) in miopen::solver::ConvOclBwdWrW2::GetSolution SIGFPE (DIV/0) in ConvOclBwdWrW2::GetSolution() Jan 11, 2019
@atamazov
Copy link
Contributor

@syoyo Thanks for bug report. The reason is identified, please expect a fix soon.

@syoyo
Copy link
Author

syoyo commented Jan 11, 2019

@atamazov Thanks!

BTW, I got a place where DIV/0 happens by building Debug version of MIOpen

0x00007fffb4ad4aad in miopen::solver::ConvOclBwdWrW2::GetSolution (this=0x7fff3eff9808, params=...)
    at /home/syoyo/work/MIOpen/src/solver/conv_ocl_dir2D_bwdWrW_2.cpp:263
263	            out_wei_scan_loop = (out_width + n_wei_blk - 1) / n_wei_blk;
(gdb) p n_wei_blk
$1 = 0

@atamazov
Copy link
Contributor

Here is the patch which solves this issue and #72. The fixes will be included into next MIOpen release.
issue70_issue72.diff.txt

@atamazov
Copy link
Contributor

@syoyo Please close the issue if the above resolves it.

syoyo added a commit to syoyo/MIOpen that referenced this issue Jan 16, 2019
Fixes ROCm#70(DIV/0 seg fault)
Fixes ROCm#72(assertion failure)
@syoyo
Copy link
Author

syoyo commented Jan 16, 2019

I've confirmed given patch solves this issue. No more DIV/0 seg fault.
I have also created PR for this: #73

@syoyo

This comment has been minimized.

@syoyo
Copy link
Author

syoyo commented Feb 7, 2019

Fix has been included in 1.7.1: a478ac8

@syoyo syoyo closed this as completed Feb 7, 2019
cderb added a commit that referenced this issue Nov 21, 2022
49e3e3a62 clang format
db80b1777 update to using TestPerfCfgParams for pdb validity checks
e48a4fd3a format
a4f85842c exception for non-tunable solvers in params check
d58c42bbd Check params at end of perf tuning (#70)
1a3b47c7b Return status for failed compile commands (#69)
d59962752 out_layout -> in_layout
6ba7a8f3f Rename conv_mode to mode (#64)
513a3da1b [bg/LWPTUNA-173] (#65)
e05dcb421 perf db validation fix (#68)
260d9465d Add INT8 as a data_type v2 (#67)
b6a5b2a77 sync with fin folder in miopen (#62)
0e03399ec prep for Palamida scan (#63)
e6bd05c33 Performance db testing (#61)
30d699b9e Perf Eval Update (#60)
3535b948c PerfCompile and PerfEval changes (#59)
de79468d2 remove unneccessary solution check, add check for previously modified kernel names (#56)
6924286a2 miopen hash update (#55)
530399575 Refactor googletest infra to align with MIOpen (#53)
71c50d146 Datatype fix for BN (#57)
8abe2f5c6 Perf Eval updates, Add find info (#51)
e1c1ef0f5 filter find compile by solver input (#54)
722feea66 sp/chk precomp kernel 264 (#41)
b9aba2034 Batch norm find compile (#50)
359f3da80 Fix missing link directives in fin binary (#48)
a4020c1ba Cache Miss Fixes (#46)
2ec7ef44d Enable google test and compiling fin in the CI (#47)
8b6b453bc Applicability support for batch norm (#45)
44323aae9 Perf compile/eval for fin (#42)
ebd9aa6bd update member name (#43)
d6d798efe add cu count (#39)
8e1989a9f Add find option for selecting only dynamic solvers (#38)
0e164bf66 setting json version (#37)
f3f7fed18 Remove function redefinition (#36)
e1de51a58 Performance DB de-serialize test (#34)
043cdcdaa Layout support in Fin (#33)
3a1d58236 Hotfix (#32)
ee3f0d543 4.4 Tuning Bugfixes (#31)
832dbe234 Tunability Reporting (#27)
a564a229f include gfx90a_110 (#28)

git-subtree-dir: fin
git-subtree-split: 49e3e3a62a7cc54adacbeea95680d35f9a4685de
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants