Caffe - inconsistency in the activation feature values - GPU mode #2783

aalok1993 · 2015-07-18T20:17:40Z

Hi I am using Caffe on Ubuntu 14.04
CUDA version 7.0
cudnn version 2
GPU : NVIDIA GT 730

In caffe first I get the initialization done and then I load the imagenet model (Alexnet). I also initialize the gpu using set_mode_gpu()
After that I take an image. Lets call the image as x.
I copy this image onto the caffe source blob. Then I perform a forward pass for this image by using : net.forward(end='fc7')
Then I extract the 4096 dimensional fc7 output.(the activation features of the fc7 layer)

The problem I am facing is that when I run the same code multiple times, everytime I obtain a different result. That is, in GPU mode, everytime the activation features are different for the same image. When I am using forward pass, the function of the network is supposed to be deterministic right ? So I should get the same output everytime for the same image.
On the other hand, when I run caffe on cpu by using set_mode_cpu() everything works perfectly, i.e, I get the same output each time
The code used and the outputs obtained are shown below. I am not able to understand what the problem is. Is it that the problem is caused due to GPU rounding off ? But the errors are very large. Or is it due to some issues with the latest CUDNN version ? Or is it something else altogether ?

Following is the CODE
#1) IMPORT libraries

from cStringIO import StringIO
import numpy as np
import scipy.ndimage as nd
import PIL.Image
from IPython.display import clear_output, Image, display
from google.protobuf import text_format
import scipy
import matplotlib.pyplot as plt
import caffe

#2) IMPORT Caffe Models and define utility functions

model_path = '../../../caffe/models/bvlc_alexnet/' 
net_fn   = model_path + 'deploy.prototxt'
param_fn = model_path + 'bvlc_reference_caffenet.caffemodel'

model = caffe.io.caffe_pb2.NetParameter()
text_format.Merge(open(net_fn).read(), model)
model.force_backward = True
open('tmp.prototxt', 'w').write(str(model))

net = caffe.Classifier('tmp.prototxt', param_fn,
                   mean = np.float32([104.0, 116.0, 122.0]), # ImageNet mean, training set dependent
                   channel_swap = (2,1,0),# the reference model has channels in BGR order instead of RGB
                  image_dims=(227, 227)) 

caffe.set_mode_gpu()
# caffe.set_mode_cpu()

# a couple of utility functions for converting to and from Caffe's input image layout
def preprocess(net, img):
    return np.float32(np.rollaxis(img, 2)[::-1]) - net.transformer.mean['data']
def deprocess(net, img):
    return np.dstack((img + net.transformer.mean['data'])[::-1])

#3) LOADING Image and setting constants

target_img = PIL.Image.open('alpha.jpg')
target_img = target_img.resize((227,227), PIL.Image.ANTIALIAS)
target_img=np.float32(target_img)
target_img=preprocess(net, target_img)

end='fc7'

#4) Setting the source image and making the forward pass to obtain fc7 activation features

src = net.blobs['data']
src.reshape(1,3,227,227) # resize the network's input image size
src.data[0] = target_img
dst = net.blobs[end]
net.forward(end=end)
target_data = dst.data[0]
print dst.data

FOLLOWING is the output that I obtained for 'print dst.data' when I ran the above code multiple times

output on 1st execution of code

[[-2.22313166 -1.66219997 -1.67641115 ..., -3.62765646 -2.78621101
  -5.06158161]]

output on 2nd execution of code

[[ -82.72431946 -372.29296875 -160.5559845  ..., -367.49728394 -138.7151947
  -343.32080078]]

output on 3rd execution of code

[[-10986.42578125 -10910.08105469 -10492.50390625 ...,  -8597.87011719
   -5846.95898438  -7881.21923828]]

output on 4th execution of code

[[-137360.3125     -130303.53125    -102538.78125    ...,  -40479.59765625
    -5832.90869141   -1391.91259766]]

The text was updated successfully, but these errors were encountered:

seanbell · 2015-07-20T06:37:36Z

I believe that you can't hold onto references the way you are right now. Caffe copies to/from the GPU which makes old pointers to memory invalid after any calls to forward or backward. Move the line dst = net.blobs[end] to after net.forward.

Another note: whenever grabbing results from a forward pass, make sure that you make a copy of the data with .copy() (numpy method). Otherwise, earlier results will become invalid/overwritten after any subsequent forward passes.

aalok1993 · 2015-07-25T00:35:33Z

Hi I made the changes you suggested above, but that issue still persists.

I even ran the python script from "caffe/examples/00-classification.ipynb"
In that when I run in gpu mode, even time the classification output is different.
The actual output is supposed to be : "Predicted class is 281."

but in the GPU mode the output is arbitrary everytime. Following are the outputs I get

"Predicted class is 49855."
"Predicted class is 154."
"Predicted class is 594."
"Predicted class is 835."
"Predicted class is 49462."

I mean how does it give a value of 49462, I mean there are not even that many classes.
I have installed caffe using the following blog https://github.com/tiangolo/caffe/blob/ubuntu-tutorial-b/docs/install_apt2.md
I have installed cudnn v2. I read somewhere that it is not properly compatible with caffe. Is that the issue. Or is it something else ?

Following is the code for "caffe/examples/00-classification.ipynb"

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Make sure that caffe is on the python path:
caffe_root = '../'  # this file is expected to be in {caffe_root}/examples
import sys
sys.path.insert(0, caffe_root + 'python')

import caffe

plt.rcParams['figure.figsize'] = (10, 10)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

import os
if not os.path.isfile(caffe_root + 'models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel'):
    print("Downloading pre-trained CaffeNet model...")
    !../scripts/download_model_binary.py ../models/bvlc_reference_caffenet


caffe.set_device(0)
caffe.set_mode_gpu()

net = caffe.Net(caffe_root + 'models/bvlc_reference_caffenet/deploy.prototxt',
            caffe_root + 'models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel',
            caffe.TEST)

# input preprocessing: 'data' is the name of the input blob == net.inputs[0]
transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})
transformer.set_transpose('data', (2,0,1))
transformer.set_mean('data', np.load(caffe_root + 'python/caffe/imagenet/ilsvrc_2012_mean.npy').mean(1).mean(1)) # mean pixel
transformer.set_raw_scale('data', 255)  # the reference model operates on images in [0,255] range instead of [0,1]
transformer.set_channel_swap('data', (2,1,0))  # the reference model has channels in BGR order instead of RGB


net.blobs['data'].reshape(50,3,227,227)

net.blobs['data'].data[...] = transformer.preprocess('data', caffe.io.load_image(caffe_root + 'examples/images/cat.jpg'))
out = net.forward()
print("Predicted class is #{}.".format(out['prob'].argmax()))

seanbell · 2015-08-06T00:02:54Z

Have you tried running without cuDNN? I vaguely remember seeing somewhere that it's not always deterministic, but I could be wrong.

seanbell · 2015-08-06T00:08:10Z

The last line of your code is wrong: when you call argmax, you need to give it the correct axis (axis=1). Otherwise, it is computing the argmax over a flattened version of the array, which is only meaningful if your batchsize is 1 -- but in your case the batchsize is 50.

If you're processing just one image, at a time (a single cat image), you should also set the batchsize to 1. Right now you're making 50 copies of the input image and classifying all of them (since the assignment to net.blobs['data'].data[...] will broadcast along the first dimension).

aalok1993 · 2015-08-08T14:54:47Z

I tried making the changes you suggested, but it still gives the same error.

When I run the above code in CPU mode, I always get the same output everytime. But when I run it in GPU mode, I get arbitrary values everytime. The problem seems to be related with the GPU.

seanbell · 2015-08-08T15:13:49Z

I tried making the changes you suggested, but it still gives the same error.

You should at least be getting predicted class labels in the range [0, 1000) this time.

Also, does it work on the GPU without cuDNN?

aalok1993 · 2015-08-08T16:56:36Z

Yes, the predicted classes are within [0,1000).
Another thing that I noticed is that many of the times when the output is wrong, the final prob layer contains many zero values. Though this doesnt happen everytime, but in lets say around 50% of the trials it contains of many zeros.

I didnt quite understand what you meant by

Also, does it work on the GPU without cuDNN?

Do you mean to say that I'll need to recompile caffe without using the cUDNN files, or is there a faster way to test that ?

seanbell · 2015-08-08T17:06:50Z

Do you mean to say that I'll need to recompile caffe without using the cUDNN files, or is there a faster way to test that ?

You could either recompile without cuDNN (disabling it in the Makefile), or you could insert "engine: caffe" inside the prototxt params for any layer that has a cuDNN version. For example: https://gist.github.com/longjon/ac410cad48a088710872#file-fcn-32s-pascal-deploy-prototxt

aalok1993 · 2015-08-10T09:44:13Z

Hi @seanbell .
I compiled caffe with CUDNN disabled. Now I am getting the same output everytime. The error is now gone. Thanks a lot for your help and support.
But I have a small question. I am new to caffe and hence I have this confusion: When I am not using CUDNN, how does caffe still use GPU for its computation ? I thought it was through CUDNN that caffe used GPU.
And also what are the drawbacks of using GPU without CUDNN ?

shelhamer · 2017-04-14T01:40:38Z

Please ask usage and system configuration questions on the mailing list. This seems to have the fault of an installation of cuDNN gone wrong.

From https://github.com/BVLC/caffe/blob/master/CONTRIBUTING.md:

Please do not post usage, installation, or modeling questions, or other requests for help to Issues.
Use the caffe-users list instead. This helps developers maintain a clear, uncluttered, and efficient view of the state of Caffe.

shelhamer closed this as completed Apr 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caffe - inconsistency in the activation feature values - GPU mode #2783

Caffe - inconsistency in the activation feature values - GPU mode #2783

aalok1993 commented Jul 18, 2015

seanbell commented Jul 20, 2015

aalok1993 commented Jul 25, 2015

seanbell commented Aug 6, 2015

seanbell commented Aug 6, 2015

aalok1993 commented Aug 8, 2015

seanbell commented Aug 8, 2015

aalok1993 commented Aug 8, 2015

seanbell commented Aug 8, 2015

aalok1993 commented Aug 10, 2015

shelhamer commented Apr 14, 2017

Caffe - inconsistency in the activation feature values - GPU mode #2783

Caffe - inconsistency in the activation feature values - GPU mode #2783

Comments

aalok1993 commented Jul 18, 2015

output on 1st execution of code

output on 2nd execution of code

output on 3rd execution of code

output on 4th execution of code

seanbell commented Jul 20, 2015

aalok1993 commented Jul 25, 2015

seanbell commented Aug 6, 2015

seanbell commented Aug 6, 2015

aalok1993 commented Aug 8, 2015

seanbell commented Aug 8, 2015

aalok1993 commented Aug 8, 2015

seanbell commented Aug 8, 2015

aalok1993 commented Aug 10, 2015

shelhamer commented Apr 14, 2017