Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ViennaCL fatal error in GPU mode #6259

Closed
jnschaeffer opened this issue Feb 27, 2018 · 5 comments
Closed

ViennaCL fatal error in GPU mode #6259

jnschaeffer opened this issue Feb 27, 2018 · 5 comments
Assignees

Comments

@jnschaeffer
Copy link

Issue summary

Trying to run models in caffe using GPU mode with amdgpu on Arch results in a ViennaCL kernel start error and subsequent crash. CPU mode has no issues.

This doesn't seem to be the case for all models, but even when the GPU is used it is several times slower than the CPU. This may or may not be related.

It seems like this issue may be related to #5804 and #6258. All caffe tests build and run successfully, however. Additionally, the machine this was tested on is using the opencl-amd package, which seems to install the AMDGPU-PRO OpenCL libraries as described in #5804.

Output from Python:

I0226 22:03:44.412575 13642 net.cpp:281] Memory required for data: 68681400
I0226 22:03:44.528407 13642 upgrade_proto.cpp:44] Attempting to upgrade input file specified using deprecated transformation parameters: ./models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
I0226 22:03:44.528434 13642 upgrade_proto.cpp:47] Successfully upgraded file specified using deprecated data transformation parameters.
W0226 22:03:44.528440 13642 upgrade_proto.cpp:49] Note that future Caffe releases will only support transform_param messages for transformation fields.
I0226 22:03:44.528446 13642 upgrade_proto.cpp:53] Attempting to upgrade input file specified using deprecated V1LayerParameter: ./models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
I0226 22:03:44.722709 13642 upgrade_proto.cpp:61] Successfully upgraded file specified using deprecated V1LayerParameter
I0226 22:03:44.783915 13642 net.cpp:796] Ignoring source layer loss
mean-subtracted values: [('B', 104.0069879317889), ('G', 116.66876761696767), ('R', 122.6789143406786)]
/home/john/src/python/env/ml/lib/python2.7/site-packages/skimage/transform/_warps.py:84: UserWarning: The default mode, 'constant', will be changed to 'reflect' in skimage 0.15.
  warn("The default mode, 'constant', will be changed to 'reflect' in "
I0226 22:03:44.992676 13642 device.cpp:62] CL_DEVICE_HOST_UNIFIED_MEMORY: 0
ViennaCL: FATAL ERROR: Kernel start failed for 'fill_float'.
ViennaCL: Smaller work sizes could not solve the problem.
std::exception
Segmentation fault (core dumped)
$

Steps to reproduce

With an AMD card on the opencl branch of caffe, run through the entire IPython notebook in examples/00-classification.ipynb. Alternatively, this gist is a modified version of (part of) the same example with GPU mode swapped in for CPU mode.

Your system configuration

Operating system: Arch Linux x86_64 4.15.5-1-ARCH
Compiler: g++ (GCC) 7.3.0
BLAS: OpenBLAS
Python or MATLAB version (for pycaffe and matcaffe respectively): Python
clinfo output: here
caffe device_query output: here

@naibaf7 naibaf7 self-assigned this Feb 27, 2018
@naibaf7
Copy link
Member

naibaf7 commented Feb 27, 2018

So - the most likely reason for it to fail is that you cannot switch a network from CPU to GPU with OpenCL Caffe - the layers need to know which device they will use beforehand. So you need to change the example to:

  • Set Caffe to GPU mode first
  • Set the device to your GPU
  • Create the network (while mode is GPU and the device is set)
  • Run classification or training

This is due to how OpenCL devices are managed in an OOP way - something not required by CUDA.
If your network runs slower on OpenCL than CPU, there might be multiple reasons:

  • Switching to GPU after having had the model in CPU mode
  • Network lacks big enough operations to be efficient in OpenCL mode (overhead of kernel launching)
  • First run/forward pass in OpenCL could be slower due to initialization - subsequent forward/backward passes should be faster
  • Bad choice of BLAS and convolution libraries - CLBlast (download and build separately, link Caffe to it) and libDNN (built-in, enable before compiling) can be recommended for your GPU, since I'm mostly developing on Vega and Polaris and know these to be good choices.

You can also try: ./build/tools/caffe time -model models/bvlc_reference_caffenet/deploy.prototxt --gpu 0

If you look at your clinfo you will also notice something else:

  Max work item sizes                             1024x1024x1024
Max work group size 256

Since the latest ROCm and amdgpu-pro, AMD seems to report the work item sizes wrong. It should be 256x256x256 (as evident also by the max work group size parameter). What does this mean? Kernels that rely on auto-selecting the workgroup sizes, fail with the current version of ViennaCL until we use the min of "Max work group size" and "Max work item sizes" or AMD fixes the bug.

@naibaf7
Copy link
Member

naibaf7 commented Feb 27, 2018

So, I used the same driver as you and ran it on Polaris and Vega.
The wrongly reported work-item size I was able to fix by also taking max work group size into consideration. So you need to update/recompile Caffe after pulling this update:
https://github.com/BVLC/caffe/tree/opencl
fe2a110

The second observation I made is that ViennaCL's GEMM does no longer seem to be compatible (you will get a memory access violation error from AMD's OpenCL) with AMD's latest OpenCL driver (it works on Intel and nVidia though). Therefore it is absolutely necessary to use CLBlast or clBlas (and recompile Caffe with it).

@naibaf7 naibaf7 closed this as completed Feb 27, 2018
@jnschaeffer
Copy link
Author

Thanks for the prompt response and looking into this! Moving the calls to caffe.set_mode_gpu() and caffe.set_device(0) made the example work properly; is that documented anywhere in the Python API?

I'll take the other steps you recommended too. Thanks again.

@naibaf7
Copy link
Member

naibaf7 commented Feb 27, 2018

@jnschaeffer No unfortunately it's a bit underdocumented as of now. The reason is that I'm mainly spending time preparing the next big release with a lot of additional features (quantized data types, faster network inference, device abstracted backend) before I add full documentation.
As of now I hope people will find the solutions in discussions like this.

Please report back if you can successfully run the network.

@jnschaeffer
Copy link
Author

Just checked; the network runs successfully with Arch's clblast-git package providing CLBlast. I'll build and run all of the tests and will let you know if there are any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants