Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

RCNN training doesn't work with the latest master branch #5056

Closed
ksofiyuk opened this issue Feb 18, 2017 · 30 comments
Closed

RCNN training doesn't work with the latest master branch #5056

ksofiyuk opened this issue Feb 18, 2017 · 30 comments

Comments

@ksofiyuk
Copy link

RPNLogLoss(==0.693147) and RCNNLogLoss(==3.044522) don't change during training process.

INFO:root:Epoch[0] Batch [80]	Speed: 4.10 samples/sec	Train-RPNAcc=0.861304,	RPNLogLoss=0.693147,	RPNL1Loss=0.553695,	RCNNAcc=0.967207,	RCNNLogLoss=3.044522,	RCNNL1Loss=1.012108,	
INFO:root:Epoch[0] Batch [100]	Speed: 4.14 samples/sec	Train-RPNAcc=0.854386,	RPNLogLoss=0.693147,	RPNL1Loss=0.620289,	RCNNAcc=0.969291,	RCNNLogLoss=3.044522,	RCNNL1Loss=0.866792,	
INFO:root:Epoch[0] Batch [120]	Speed: 4.07 samples/sec	Train-RPNAcc=0.853596,	RPNLogLoss=0.693147,	RPNL1Loss=0.596068,	RCNNAcc=0.970235,	RCNNLogLoss=3.044522,	RCNNL1Loss=0.746457,	
INFO:root:Epoch[0] Batch [140]	Speed: 4.07 samples/sec	Train-RPNAcc=0.849845,	RPNLogLoss=0.693147,	RPNL1Loss=0.591450,	RCNNAcc=0.971133,	RCNNLogLoss=3.044522,	RCNNL1Loss=0.660492,	
INFO:root:Epoch[0] Batch [160]	Speed: 4.05 samples/sec	Train-RPNAcc=0.852266,	RPNLogLoss=0.693147,	RPNL1Loss=0.599130,	RCNNAcc=0.970788,	RCNNLogLoss=3.044522,	RCNNL1Loss=0.677796,	
INFO:root:Epoch[0] Batch [180]	Speed: 4.06 samples/sec	Train-RPNAcc=0.852555,	RPNLogLoss=0.693147,	RPNL1Loss=0.585532,	RCNNAcc=0.970261,	RCNNLogLoss=3.044522,	RCNNL1Loss=0.786778,	
INFO:root:Epoch[0] Batch [200]	Speed: 4.03 samples/sec	Train-RPNAcc=0.855585,	RPNLogLoss=0.693147,	RPNL1Loss=0.572514,	RCNNAcc=0.971121,	RCNNLogLoss=3.044522,	RCNNL1Loss=0.729596,	
...

It works fine with MXNet 0.9.1 (from https://github.com/precedenceguo/mxnet/tree/simple):

INFO:root:Epoch[0] Batch [20]	Speed: 4.59 samples/sec	Train-RPNAcc=0.869978,	RPNLogLoss=0.471558,	RPNL1Loss=1.076314,	RCNNAcc=0.724330,	RCNNLogLoss=1.237655,	RCNNL1Loss=2.551517,	
INFO:root:Epoch[0] Batch [40]	Speed: 4.45 samples/sec	Train-RPNAcc=0.898056,	RPNLogLoss=0.422949,	RPNL1Loss=1.062090,	RCNNAcc=0.770579,	RCNNLogLoss=1.140964,	RCNNL1Loss=2.575163,	
INFO:root:Epoch[0] Batch [60]	Speed: 4.54 samples/sec	Train-RPNAcc=0.910412,	RPNLogLoss=0.400885,	RPNL1Loss=1.071733,	RCNNAcc=0.788038,	RCNNLogLoss=1.072872,	RCNNL1Loss=2.538854,	
INFO:root:Epoch[0] Batch [80]	Speed: 4.61 samples/sec	Train-RPNAcc=0.912326,	RPNLogLoss=0.376773,	RPNL1Loss=1.044977,	RCNNAcc=0.789641,	RCNNLogLoss=1.053600,	RCNNL1Loss=2.568174,	
INFO:root:Epoch[0] Batch [100]	Speed: 4.57 samples/sec	Train-RPNAcc=0.913057,	RPNLogLoss=0.372781,	RPNL1Loss=1.001118,	RCNNAcc=0.796952,	RCNNLogLoss=1.022108,	RCNNL1Loss=2.557218,
...

Environment info

Operating System:
Ubuntu 16.04

Compiler:
gcc 4.9.2
CUDA 8.0.44 + CuDNN v5.1

Package used (Python/R/Scala/Julia):
Python

MXNet commit hash (git rev-parse HEAD):
0aeddf9

Python version and distribution:
Python 2.7.9

Steps to reproduce

or if you are running standard examples, please provide the commands you have run that lead to the error.

  1. run ./script/vgg_voc07.sh 0 in ./example/rcnn
@ksofiyuk ksofiyuk changed the title rcnn training doesn't work with the latest master branch RCNN training doesn't work with the latest master branch Feb 18, 2017
@piiswrong
Copy link
Contributor

@precedenceguo

@ijkguo
Copy link
Contributor

ijkguo commented Feb 19, 2017

I think v0.9.3 works, c.f. https://github.com/precedenceguo/mxnet/tree/expr . Could you please try if this is ok? I will narrow it down later.

edit: I noticed that you have gcc 4.9.2 on a Ubuntu 16.04 which isn't the default. So please try v0.9.3 to sync with me for less commits to check.

@lilhope
Copy link

lilhope commented Feb 19, 2017

It works well in mxnet0.9.3 and gcc version5.4.0

@ksofiyuk
Copy link
Author

I think v0.9.3 works, c.f. https://github.com/precedenceguo/mxnet/tree/expr . Could you please try if this is ok? I will narrow it down later.

It's ok.

@lilhope
Copy link

lilhope commented Feb 19, 2017

@precedenceguo I have a question, why add the bbox_means and bbox_stds to bbox_pred_weight_test and bbox_pred_bias_test ?

@piiswrong
Copy link
Contributor

Maybe its this commit #4644

@ijkguo
Copy link
Contributor

ijkguo commented Feb 21, 2017

Cannot reproduce this with master f2c17afd76a92a4edd5f8b3429283b9f1986ba79.

Update: Sorry this is the revision on mx-rcnn repo (master). MXNet would be 23fff3f. Also tested with mxnet/example/rcnn. They are virtually the same.
My environment is Ubuntu 14.04.5, gcc 4.8.4, cuda 8.0.44, cudnn 5.1, python 2.7.6.

@piiswrong
Copy link
Contributor

@precedenceguo Can we add a few iterations of training to nightly tests?

@ijkguo
Copy link
Contributor

ijkguo commented Feb 21, 2017

Also no problem with 0aeddf9.

Now you may want to find out why yourself.

git bisect start HEAD v0.9.3
# now build and test
# if good
git bisect good
# if bad
git bisect bad

@ijkguo
Copy link
Contributor

ijkguo commented Feb 21, 2017

@piiswrong We could do that after a new io, packing a subset of PascalVOC to ImageRec.

@ijkguo
Copy link
Contributor

ijkguo commented Feb 21, 2017

@lilhope add the bbox_means and bbox_stds to bbox_pred_weight_test and bbox_pred_bias_test so that we don't need to do them for every image in the test phase. Note that training phase do this normalization image by image.

@lilhope
Copy link

lilhope commented Feb 21, 2017

@precedenceguo where do this normalization in training phase? It seems I misunderstand something.

@ijkguo
Copy link
Contributor

ijkguo commented Feb 21, 2017

Well then bbox_pred_test is not important. It lies at somewhere in rcnn training so just don't break it.

@lilhope
Copy link

lilhope commented Feb 21, 2017

@precedenceguo Ok,Thanks a lot.

@ksofiyuk
Copy link
Author

@precedenceguo @piiswrong There is no problem with pure mxnet sources. I found that it's bug in dmlc-core. The problem arises since commit d7a89ea ("Platform independent real param parsing") in dmlc-core project.

@piiswrong
Copy link
Contributor

@sbodenstein

@ksofiyuk
Copy link
Author

ksofiyuk commented Feb 22, 2017

@piiswrong @sbodenstein The problem occur due to use std::stof and std::stod functions to parse floating-point numbers from string. The behaviour of these functions depends on the current language locale. My locale is russian and we use ',' instead '.' to separate integer and fractional parts. For example, std::stof parse string "3.1415" to float 3 in my locale. To avoid such problems it is necessary to use locale independent functions.

@piiswrong
Copy link
Contributor

is int('3.3') locale dependent in python?

@ksofiyuk
Copy link
Author

Python float() is locale-independent (PEP 331)

@piiswrong
Copy link
Contributor

looks like we can use this: http://en.cppreference.com/w/cpp/utility/from_chars
it always uses C locale.

@ksofiyuk Could you try a fix and make sure it works for russian? note that from_chars_result::ptr has to equal last to count as successful parsing.

@ksofiyuk
Copy link
Author

@piiswrong Ok, I will try to fix it and create PR to dmlc-core.

@ksofiyuk
Copy link
Author

@piiswrong from_chars is not good because it depends on C++17.

@piiswrong
Copy link
Contributor

piiswrong commented Feb 22, 2017

I hate this...
Is there no platform && locale independent way to parse floats?

@ksofiyuk
Copy link
Author

We can try to use sscanf but I'm not sure that it can parse INF and NaN correctly.

@piiswrong
Copy link
Contributor

worst case is we copy-paste a reference implementation of stod & stof and remove the locale part

@piiswrong
Copy link
Contributor

piiswrong commented Feb 22, 2017

@ksofiyuk
Copy link
Author

ksofiyuk commented Feb 22, 2017

I tested sscanf. It works well, parse floats locale independent and read +inf, -inf, nan correctly.

@sbodenstein
Copy link
Contributor

Wow, this is crazy that std::stod are locale-dependent... Seems like many libraries have suffered from this behaviour.

@ksofiyuk: is this true for all possible locales?

@piiswrong
Copy link
Contributor

I think this is a solution:

#ifdef __MINGW64__
#include <locale>
#include <sstream>

typedef void* locale_t;
static locale_t locale = (locale_t) 0;

void init_locale() {}

double strtod_l(const char* start, char** end, locale_t loc) {
    double d;
    std::stringstream ss;
    ss.imbue(std::locale::classic());
    ss << start;
    ss >> d;
    size_t nread = ss.tellg();
    *end = const_cast<char*>(start) + nread;
    return d;
}
#endif

Just need to know how slow it is.

@szha
Copy link
Member

szha commented Sep 29, 2017

This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!

@szha szha closed this as completed Sep 29, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants