-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incompatibilities in BatchNorm. #276
Comments
hi, I wonder if anything has improved with regards to batch normalization in NVCaffe? thanks |
/cc @borisgin |
I wrote new BatchNorm which is
It will be released with new nvcaffe |
Thanks @borisgin |
Thanks @borisgin for the update. Will there be reverse compatibility as well i.e. can the NVIDIA/caffe CUDNN trained models be used for fine tuning in BVLC/caffe CPU and GPU modes? This is because there are several frameworks (such as FasterRCNN, SSD etc) built around BVLC/caffe, and I am wondering whether they will understand this new model trained in NVIDIA/caffe. If complete compatibility is not there, isn't better to give this layer a new name, such as BatchNormScale, and make sure that the layers BatchNorm and BatchNormScale co-exist. That way we will be able to check the type and write a utility to do the reverse conversion, if needed. |
You can use old BVLC prototxt and models with NV-caffe, so you can train old models on new nv-caffe. you can't load new nvcaffe models into BVLC caffe since BVLC caffe does not have fused BN and scale layer. having different name for the layer would not help, since for this you should have such layer in BVLC caffe. it would be much simpler just to replace old BVLC BN layer with new one from Nvcaffe. |
The gpu_diff() of blobs[2] and blobs[3] should be always 0. It is necessary set the 3rd and 4th param (blobs_[2] and blobs[3]) as param{ lr_mult:0 decay_mult:0}? |
Hi, @borisgin did you merged your batch norm layer with nvcaffe? because I can't still use models trained with bvlc on NV and it's really annoying :( |
Yes, it merged, Can you send a link on the Model which you use, please? |
The version of NVcaffe I'm using is 0.15.14 and this is one of the models I've tested which results in this error: |
This is very old branch. Did you try the latest branch (caffe-0.16)?
…On Sun, Oct 8, 2017 at 12:47 AM, szm2015 ***@***.***> wrote:
The version of NVcaffe I'm using is 0.15.14 and this
<https://github.com/lim0606/caffe-googlenet-bn> is one of the models I've
tested which results in this error:
F1008 11:16:40.578579 5197 net.cpp:797] Check failed: target_blobs.size()
== source_layer.blobs_size() (5 vs. 3) Incompatible number of blobs for
layer conv1/7x7_s2/bn
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#276 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AHMWqX72tMqaH9ke6H0NygqOJX1TCx63ks5sqH4agaJpZM4K6dqw>
.
|
Well the first and most important reason I'm using this branch is that according to BuildCaffe.md DIGITS is currently compatible with Caffe 0.15 and as I use DIGITS it seems like I don't have any other choice (do I?). The second reason is that I have also tried to build NVcaffe 0.16.4, but it gives me this error: /usr/include/c++/5/bits/hashtable.h(1526): error: no instance of overloaded function "std::forward" matches the argument list |
Hi @szm2015 - what Ubuntu and what GCC do you use? What particular command do you run to build NVCaffe? |
I'm using Ubuntu 16.04.3 LTS and gcc 5.4 and after cloning the NVcaffe I use the following commands to build it: Here's the build summary: -- ******************* Caffe Configuration Summary *******************
|
DIGITS 5 and 6 do work with NVCaffe 0.16, FWIW. Sounds like that document
needs updating.
On Oct 9, 2017 5:47 AM, "szm2015" <notifications@github.com> wrote:
I'm using Ubuntu 16.04.3 LTS and gcc 5.4 and after cloning the NVcaffe I
use the following commands to build it:
mkdir build && cd build
cmake ..
make
Here's the build summary:
-- ******************* Caffe Configuration Summary *******************
-- General:
-- Version : 0.16.4
-- Git : v0.16.4-2-gcdb3d9a
-- System : Linux
-- C++ compiler : /usr/bin/c++
-- Release CXX flags : -O3 -DNDEBUG -fPIC -Wall -std=c++11
-Wno-sign-compare -Wno-uninitialized
-- Debug CXX flags : -g -DDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare
-Wno-uninitialized
-- Build type : Release -- BUILD_SHARED_LIBS : ON
-- BUILD_python : ON
-- BUILD_matlab : OFF
-- BUILD_docs : ON
-- CPU_ONLY : OFF
-- USE_OPENCV : ON
-- USE_LEVELDB : ON
-- USE_LMDB : ON
-- ALLOW_LMDB_NOLOCK : OFF
-- TEST_FP16 : OFF -- Dependencies:
-- BLAS : Yes (Atlas)
-- Boost : Yes (ver. 1.58)
-- glog : Yes
-- gflags : Yes
-- protobuf : Yes (ver. 3.4.0)
-- lmdb : Yes (ver. 0.9.17)
-- LevelDB : Yes (ver. 1.18)
-- Snappy : Yes (ver. 1.1.3)
-- OpenCV : Yes (ver. 3.2.0)
-- CUDA : Yes (ver. 8.0) -- NVIDIA CUDA:
-- Target GPU(s) : Auto
-- GPU arch(s) : sm_50
-- cuDNN : Yes (ver. 6.0)
-- NCCL : Not found
-- NVML : /usr/lib/nvidia-375/libnvidia-ml.so -- Python:
-- Interpreter : /usr/bin/python2.7 (ver. 2.7.12)
-- Libraries : /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.12)
-- NumPy : /usr/lib/python2.7/dist-packages/numpy/core/include (ver 1.11.0) --
Documentaion:
-- Doxygen : No
-- config_file : -- Install:
-- Install path : /home/szm/Work/Caffe/nv-caffe/build/install
…-- Configuring done
-- Generating done
-- Build files have been written to: /home/szm/Work/Caffe/nv-caffe/build
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#276 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJO93lTDGIgOlVf4_KQPpS1QN6ldAy42ks5sqhXUgaJpZM4K6dqw>
.
|
If it's as @cliffwoolley says, then I will be really grateful if someone helps me with the problem with NVcaffe 16, Because apart from batch norm layer problem, it's been a couple of days that I have been unable to train an object detection model with DIGITS, the error is the same as the one mentioned in BVLC#1833. arthurlobo who has asked the question seems to have been able to resolve it by switching to NVcaffe 16.4. The thing is that I had to reinstall my whole OS recently and so I had to install everything from scratch, before that everything worked fine with the same versions of NVcaffe and probably DIGITS (I'm sure that it was version 6 but I'm not sure about the exact version). |
Hi @szm2015 |
Hi @drnikolaev , you mean the whole error? I'm not sure what you mean by the "make invocation". This is the complete error list that stops make: /usr/include/c++/5/bits/hashtable.h(1526): error: no instance of overloaded function "std::forward" matches the argument list 1 error detected in the compilation of "/tmp/tmpxft_0000400d_00000000-7_cudnn_conv_layer.cpp1.ii". src/caffe/CMakeFiles/caffe.dir/build.make:147: recipe for target 'src/caffe/CMakeFiles/cuda_compile.dir/layers/cuda_compile_generated_cudnn_conv_layer.cu.o' failed About Cuda 9, I will try it as soon as I can and report the result. |
Hi everyone, I1010 09:00:52.221909 5509 layer_factory.hpp:136] Creating layer 'cluster' of type 'Python' |
Let's step back to GPU for a moment... :) I see the problem. Fix is coming soon. |
@szm2015 |
Hi @drnikolaev /home/szm/Work/Caffe/nv-caffe_0.16.4_testVersion/src/caffe/layers/cudnn_conv_layer.cpp: In member function ‘void caffe::CuDNNConvolutionLayer<Ftype, Btype>::FindExConvAlgo(const std::vectorcaffe::Blob*&, const std::vectorcaffe::Blob*&)’: As a side point, I was at last able to get NVcaffe_0.15.14 work with DIGITS (the mysterious error regarding the clustering layer is gone). What I did was uninstalling everything and reinstalling from scratch. But I still need to get NVcaffe_0.16.4 to work so that I will be able to use BVCLcaffe trained BN models in DIGITS as well. |
It's already fixed, please pull again |
I did and was able to build it successfully but I have trouble getting it to work. When I try to deploy a model with BVLCcaffe type BN layers (via a cpp code I have written and use for deploying Caffe models) the program crashes when trying to load the model, more specifically at this line: whereas the same model can run without a problem using BVLCcaffe (the link to the model I'm testing) I tested the same code with a bvlc-googlenet and it works just fine. |
Hi @szm2015 , crash stack might help here but before getting there please consider adjusting this test to your net:
|
HI @drnikolaev , I tested the code via the following command: python test_classification.py /home/gpuserver/Moosavi/Test/Models/googlenet_bn_stepsize_6400_iter_1200000/googlenet_bn_stepsize_6400_iter_1200000.caffemodel /home/gpuserver/Moosavi/Test/Models/googlenet_bn_stepsize_6400_iter_1200000/deploy.prototxt /home/gpuserver/Moosavi/Test/Input/ImageNet_val/FullPathFileList.txt --mean_file /home/gpuserver/Moosavi/Test/Models/googlenet_bn_stepsize_6400_iter_1200000/mean.binaryproto --labels_file /home/gpuserver/Moosavi/Test/Input/ImageNet_val/Labels2015.txt use_gpu but nothing happens it just gives the prompt back after merely a second. I don't know exactly how may I get the stack trace but I think it's something like this, the list of functions called before the crash (when debugging with Qt): |
@szm2015
As of the python script - you need to adjust it to your model. |
I did and the model is now running without a problem. Thank you!
…On Mon, Oct 16, 2017 at 4:19 AM, Sergei Nikolaev ***@***.***> wrote:
@szm2015 <https://github.com/szm2015>
Please add this code to the very beginning of your main.cpp
vector<int> gpus(number_of_your_gpus);
gpus[0] = 0;
gpus[1] = 1;
//etc. for each gpu...
caffe::GPUMemory::Scope gpu_memory_scope(gpus);
As of the python script - you need to adjust it to your model.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#276 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/APaJX6fHyVshqBpKNT5B9cPfqw0H9ftyks5ssqgZgaJpZM4K6dqw>
.
|
Well, I tested a little more and there still seems to be two problems. The code runs without crashing but the forward time required to process a 244x244 image is twice the one for the same model and the same image size running on BVLCcaffe. The second and most important problem is that the results are always the same for all of the images, the same top-5 predictions, whereas the one running on BVLCcaffe achieves about 91% top-5 accuracy. |
May I have your code to try? |
Hi everyone, sorry for the delay! I attached a modified version of my code (the original one is part of a bigger project written in Qt and has some irrelevant parts). I build this code using Caffe makefile (by copying the code to tool directory). It's executed by the following command: (I know it's very messy!!! but I just wanted to organize a code for testing!) The timing differences I mentioned in my previous comment is not specific to this version of NVcaffe. As a recurring pattern, I've seen that models run faster on BVLCcaffe but consume more GPU memory than the ones running on NVcaffe. |
I'm not sure I follow this. Could you paste output of both runs (bvlc vs nv)? |
Well, this is the BVLCcaffe version of the code (there are some minor differences like declaring a template type for caffe::Net and things like that): Using the BVLCcaffe code, the output of running googlenet_bn model is this (for 30 first images): But using the NVcaffe version gives this output (again for the first 30 images): |
Please check v0.16.5 and reopen the issue if the problem still exists. |
BatchNorm in NVIDIA/caffe is not compatible with BatchNorm in BVLC/caffe.
There is also no compatibility b/w engine:CAFFE and engine:CUDNN BatchNorm in NVIDIA/caffe itself. (Blob shapes are different).
Kindly fix these issues - so that we can use pre-trained models for fine tuning.
Please refer to:
NVIDIA/DIGITS#629
and
BVLC#3919
as well where similar issues are discussed.
I have some suggestions to fix these issues:
Rename the NVIDIA/caffe's BatchNorm to BatchNormScale, since it now includes Scaling as well.
Put a check/exit in CUDNN BatchNormScale reshape function, if the top and bottom blobs are same - so that the user will get a warning.
Fix the inconsistency in blob shape between engine:CAFFE and engine:CUDNN
Currenty I have to specify so many parameters in the new BatchNorm layer. Thi is un-necessary.
(4a). In BatchNormScale, If you change the oder of the blobs to: gloabl_mean, and global_variance, scale, bias, global_counter, then I don't have to specify 4 param fields for lr_mult and decay_mult - but only 2.
(4b). If the definition of scale and bias fields in BatchNormParameter is changed to:
optional float scale_filler = 5 [default = 1];
optional float bias_filler = 6 [default = 0];
Then I don't have to specify these also in the prototxt.
The text was updated successfully, but these errors were encountered: