-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[OD] 6.0 inference regression between CPU and GPU. #2955
Conversation
OMG! I'm impressed by your effort to lock this down! One fundamental question, if the input resizing produces different results, the output should be different between 5.8 and 6.0. Did we do any functional comparison between 5.8 and 6.0? |
@@ -8,6 +8,7 @@ | |||
from __future__ import absolute_import as _ | |||
|
|||
import numpy as np | |||
import cv2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be wrong, but I'm not sure we have taken a dependency on OpenCV. Is there a compelling reason for us to take a dependency on OpenCV here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
Tensorflow's resize function is not aligned with mps and Mxnet's,
which causes the inference regression as I show in the summary.
Numpy itself doesn't has resize method for image,
and that's why I'm using cv2's resize method now,
which is consistent between mps and mxnet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need to get legal approval to depend on cv2, and it's a huge dependency to pull in if it's needed only for image resizing. We already resize images in many other places (see, e.g., the image_deep_feature_extraction code path).
If there is a way to use PIL or one of our C++ image resizing utilities instead, we should do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hoytak Thanks for this suggestion!
OMG! I'm impressed by your effort to lock this down!
I guess this should be done by benchmark's back comp. |
MAP difference comparison :
@nickjong @srikris seems like if we are not using opencv, the build in image_resize function in turicreae produces the closest result. |
np_img /= 255. | ||
return np_img | ||
|
||
def resize_turicretae_image(image, output_shape): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*turicreate
Even then, I don't love the name of this function. But in the long-run, if this approach proves stable and accurate, we should move this resizing to the C++ side anyway. There's no point in calling the C++ resizing code from Python from C++, once we converge on the right algorithm
pass gitlab. |
@@ -8,6 +8,8 @@ | |||
from __future__ import absolute_import as _ | |||
|
|||
import numpy as np | |||
from PIL import Image | |||
import PIL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to import PIL
? Would the line above this one suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good!
close #2865
6.0's CPU and GPU has prediction regression.
For the same image,
GPU:
CPU:
The regression from both
predict()
andevaluate()
are not neglectable.[First Step] Compare 5.8's CPU, GPU and 6.0's CPU, GPU
I found that only 6.0's CPU has inference regression, that is to say we have issue in tensorflow.
[Second Step] Compare tensorflow with mxnet
I loaded the same weight to tf and mxnet model, and compare the output layer by layer,
the max error for the output feature tensor is
5 * 1e-5
magnitude, which is good.[Third Step] Compare raw output from tensorflow and mxnet through tc's API
I compared the raw output tensor before the nms for tensorflow and mxnet through tc's API,
surprising the output tensor it self has error up to
0.7
.[Fourth Step] Compare raw input for tensorfow and mxnet model
I compared the input tensor for tensorflow and mxnet and found out they have error up to
0.17
hmm.[Fifth Step] Mock out the augmenter
I resize all images to
412 * 412
beforehand to mock out the effect of the tf image augmenter, and observed perfect aligned result across tf and mxnet.[Sixth Step] TF's resize bilinear has regression haha
Found some existing issue report for tensorflow's resize bilinear has regression from other open source API like cv2, mxnet haha.
So I replace the tf resize by cv2.
Now finally we got the same predictions and map!!
GPU:
CPU: