RCNN training doesn't work with the latest master branch #5056

ksofiyuk · 2017-02-18T12:30:56Z

RPNLogLoss(==0.693147) and RCNNLogLoss(==3.044522) don't change during training process.

INFO:root:Epoch[0] Batch [80]	Speed: 4.10 samples/sec	Train-RPNAcc=0.861304,	RPNLogLoss=0.693147,	RPNL1Loss=0.553695,	RCNNAcc=0.967207,	RCNNLogLoss=3.044522,	RCNNL1Loss=1.012108,	
INFO:root:Epoch[0] Batch [100]	Speed: 4.14 samples/sec	Train-RPNAcc=0.854386,	RPNLogLoss=0.693147,	RPNL1Loss=0.620289,	RCNNAcc=0.969291,	RCNNLogLoss=3.044522,	RCNNL1Loss=0.866792,	
INFO:root:Epoch[0] Batch [120]	Speed: 4.07 samples/sec	Train-RPNAcc=0.853596,	RPNLogLoss=0.693147,	RPNL1Loss=0.596068,	RCNNAcc=0.970235,	RCNNLogLoss=3.044522,	RCNNL1Loss=0.746457,	
INFO:root:Epoch[0] Batch [140]	Speed: 4.07 samples/sec	Train-RPNAcc=0.849845,	RPNLogLoss=0.693147,	RPNL1Loss=0.591450,	RCNNAcc=0.971133,	RCNNLogLoss=3.044522,	RCNNL1Loss=0.660492,	
INFO:root:Epoch[0] Batch [160]	Speed: 4.05 samples/sec	Train-RPNAcc=0.852266,	RPNLogLoss=0.693147,	RPNL1Loss=0.599130,	RCNNAcc=0.970788,	RCNNLogLoss=3.044522,	RCNNL1Loss=0.677796,	
INFO:root:Epoch[0] Batch [180]	Speed: 4.06 samples/sec	Train-RPNAcc=0.852555,	RPNLogLoss=0.693147,	RPNL1Loss=0.585532,	RCNNAcc=0.970261,	RCNNLogLoss=3.044522,	RCNNL1Loss=0.786778,	
INFO:root:Epoch[0] Batch [200]	Speed: 4.03 samples/sec	Train-RPNAcc=0.855585,	RPNLogLoss=0.693147,	RPNL1Loss=0.572514,	RCNNAcc=0.971121,	RCNNLogLoss=3.044522,	RCNNL1Loss=0.729596,	
...

It works fine with MXNet 0.9.1 (from https://github.com/precedenceguo/mxnet/tree/simple):

INFO:root:Epoch[0] Batch [20]	Speed: 4.59 samples/sec	Train-RPNAcc=0.869978,	RPNLogLoss=0.471558,	RPNL1Loss=1.076314,	RCNNAcc=0.724330,	RCNNLogLoss=1.237655,	RCNNL1Loss=2.551517,	
INFO:root:Epoch[0] Batch [40]	Speed: 4.45 samples/sec	Train-RPNAcc=0.898056,	RPNLogLoss=0.422949,	RPNL1Loss=1.062090,	RCNNAcc=0.770579,	RCNNLogLoss=1.140964,	RCNNL1Loss=2.575163,	
INFO:root:Epoch[0] Batch [60]	Speed: 4.54 samples/sec	Train-RPNAcc=0.910412,	RPNLogLoss=0.400885,	RPNL1Loss=1.071733,	RCNNAcc=0.788038,	RCNNLogLoss=1.072872,	RCNNL1Loss=2.538854,	
INFO:root:Epoch[0] Batch [80]	Speed: 4.61 samples/sec	Train-RPNAcc=0.912326,	RPNLogLoss=0.376773,	RPNL1Loss=1.044977,	RCNNAcc=0.789641,	RCNNLogLoss=1.053600,	RCNNL1Loss=2.568174,	
INFO:root:Epoch[0] Batch [100]	Speed: 4.57 samples/sec	Train-RPNAcc=0.913057,	RPNLogLoss=0.372781,	RPNL1Loss=1.001118,	RCNNAcc=0.796952,	RCNNLogLoss=1.022108,	RCNNL1Loss=2.557218,
...

Environment info

Operating System:
Ubuntu 16.04

Compiler:
gcc 4.9.2
CUDA 8.0.44 + CuDNN v5.1

Package used (Python/R/Scala/Julia):
Python

MXNet commit hash (git rev-parse HEAD):
0aeddf9

Python version and distribution:
Python 2.7.9

Steps to reproduce

or if you are running standard examples, please provide the commands you have run that lead to the error.

run ./script/vgg_voc07.sh 0 in ./example/rcnn

The text was updated successfully, but these errors were encountered:

piiswrong · 2017-02-18T20:20:34Z

@precedenceguo

ijkguo · 2017-02-19T01:39:46Z

I think v0.9.3 works, c.f. https://github.com/precedenceguo/mxnet/tree/expr . Could you please try if this is ok? I will narrow it down later.

edit: I noticed that you have gcc 4.9.2 on a Ubuntu 16.04 which isn't the default. So please try v0.9.3 to sync with me for less commits to check.

lilhope · 2017-02-19T07:30:45Z

It works well in mxnet0.9.3 and gcc version5.4.0

ksofiyuk · 2017-02-19T07:38:18Z

I think v0.9.3 works, c.f. https://github.com/precedenceguo/mxnet/tree/expr . Could you please try if this is ok? I will narrow it down later.

It's ok.

lilhope · 2017-02-19T09:33:39Z

@precedenceguo I have a question, why add the bbox_means and bbox_stds to bbox_pred_weight_test and bbox_pred_bias_test ?

piiswrong · 2017-02-19T19:08:17Z

Maybe its this commit #4644

ijkguo · 2017-02-21T04:02:28Z

Cannot reproduce this with master f2c17afd76a92a4edd5f8b3429283b9f1986ba79.

Update: Sorry this is the revision on mx-rcnn repo (master). MXNet would be 23fff3f. Also tested with mxnet/example/rcnn. They are virtually the same.
My environment is Ubuntu 14.04.5, gcc 4.8.4, cuda 8.0.44, cudnn 5.1, python 2.7.6.

piiswrong · 2017-02-21T04:20:06Z

@precedenceguo Can we add a few iterations of training to nightly tests?

ijkguo · 2017-02-21T08:29:41Z

Also no problem with 0aeddf9.

Now you may want to find out why yourself.

git bisect start HEAD v0.9.3
# now build and test
# if good
git bisect good
# if bad
git bisect bad

ijkguo · 2017-02-21T08:48:50Z

@piiswrong We could do that after a new io, packing a subset of PascalVOC to ImageRec.

ijkguo · 2017-02-21T08:51:28Z

@lilhope add the bbox_means and bbox_stds to bbox_pred_weight_test and bbox_pred_bias_test so that we don't need to do them for every image in the test phase. Note that training phase do this normalization image by image.

lilhope · 2017-02-21T10:59:43Z

@precedenceguo where do this normalization in training phase? It seems I misunderstand something.

ijkguo · 2017-02-21T11:17:30Z

Well then bbox_pred_test is not important. It lies at somewhere in rcnn training so just don't break it.

lilhope · 2017-02-21T11:21:27Z

@precedenceguo Ok,Thanks a lot.

ksofiyuk · 2017-02-22T05:27:16Z

@precedenceguo @piiswrong There is no problem with pure mxnet sources. I found that it's bug in dmlc-core. The problem arises since commit d7a89ea ("Platform independent real param parsing") in dmlc-core project.

piiswrong · 2017-02-22T05:31:16Z

@sbodenstein

ksofiyuk · 2017-02-22T06:08:12Z

@piiswrong @sbodenstein The problem occur due to use std::stof and std::stod functions to parse floating-point numbers from string. The behaviour of these functions depends on the current language locale. My locale is russian and we use ',' instead '.' to separate integer and fractional parts. For example, std::stof parse string "3.1415" to float 3 in my locale. To avoid such problems it is necessary to use locale independent functions.

piiswrong · 2017-02-22T06:10:37Z

is int('3.3') locale dependent in python?

ksofiyuk · 2017-02-22T06:15:29Z

Python float() is locale-independent (PEP 331)

piiswrong · 2017-02-22T06:31:40Z

looks like we can use this: http://en.cppreference.com/w/cpp/utility/from_chars
it always uses C locale.

@ksofiyuk Could you try a fix and make sure it works for russian? note that from_chars_result::ptr has to equal last to count as successful parsing.

ksofiyuk · 2017-02-22T06:37:47Z

@piiswrong Ok, I will try to fix it and create PR to dmlc-core.

ksofiyuk · 2017-02-22T06:47:38Z

@piiswrong from_chars is not good because it depends on C++17.

piiswrong · 2017-02-22T07:03:56Z

I hate this...
Is there no platform && locale independent way to parse floats?

ksofiyuk · 2017-02-22T07:09:06Z

We can try to use sscanf but I'm not sure that it can parse INF and NaN correctly.

piiswrong · 2017-02-22T07:10:36Z

worst case is we copy-paste a reference implementation of stod & stof and remove the locale part

piiswrong · 2017-02-22T07:12:12Z

Maybe this https://gist.github.com/tknopp/9271244 or this http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/missing/strtod.c

ksofiyuk · 2017-02-22T12:28:59Z

I tested sscanf. It works well, parse floats locale independent and read +inf, -inf, nan correctly.

sbodenstein · 2017-02-22T15:22:43Z

Wow, this is crazy that std::stod are locale-dependent... Seems like many libraries have suffered from this behaviour.

@ksofiyuk: is this true for all possible locales?

piiswrong · 2017-02-24T23:57:21Z

I think this is a solution:

#ifdef __MINGW64__
#include <locale>
#include <sstream>

typedef void* locale_t;
static locale_t locale = (locale_t) 0;

void init_locale() {}

double strtod_l(const char* start, char** end, locale_t loc) {
    double d;
    std::stringstream ss;
    ss.imbue(std::locale::classic());
    ss << start;
    ss >> d;
    size_t nread = ss.tellg();
    *end = const_cast<char*>(start) + nread;
    return d;
}
#endif

Just need to know how slow it is.

szha · 2017-09-29T02:53:26Z

This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!

ksofiyuk changed the title ~~rcnn training doesn't work with the latest master branch~~ RCNN training doesn't work with the latest master branch Feb 18, 2017

manfredcalvo mentioned this issue Jul 12, 2017

[jvm-packages] eta parsing error dmlc/xgboost#2512

Closed

szha closed this as completed Sep 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RCNN training doesn't work with the latest master branch #5056

RCNN training doesn't work with the latest master branch #5056

ksofiyuk commented Feb 18, 2017

piiswrong commented Feb 18, 2017

ijkguo commented Feb 19, 2017 •

edited

Loading

lilhope commented Feb 19, 2017

ksofiyuk commented Feb 19, 2017

lilhope commented Feb 19, 2017

piiswrong commented Feb 19, 2017

ijkguo commented Feb 21, 2017 •

edited

Loading

piiswrong commented Feb 21, 2017

ijkguo commented Feb 21, 2017 •

edited

Loading

ijkguo commented Feb 21, 2017

ijkguo commented Feb 21, 2017

lilhope commented Feb 21, 2017

ijkguo commented Feb 21, 2017

lilhope commented Feb 21, 2017

ksofiyuk commented Feb 22, 2017

piiswrong commented Feb 22, 2017

ksofiyuk commented Feb 22, 2017 •

edited

Loading

piiswrong commented Feb 22, 2017

ksofiyuk commented Feb 22, 2017

piiswrong commented Feb 22, 2017

ksofiyuk commented Feb 22, 2017

ksofiyuk commented Feb 22, 2017

piiswrong commented Feb 22, 2017 •

edited

Loading

ksofiyuk commented Feb 22, 2017

piiswrong commented Feb 22, 2017

piiswrong commented Feb 22, 2017 •

edited

Loading

ksofiyuk commented Feb 22, 2017 •

edited

Loading

sbodenstein commented Feb 22, 2017

piiswrong commented Feb 24, 2017

szha commented Sep 29, 2017

RCNN training doesn't work with the latest master branch #5056

RCNN training doesn't work with the latest master branch #5056

Comments

ksofiyuk commented Feb 18, 2017

Environment info

Steps to reproduce

piiswrong commented Feb 18, 2017

ijkguo commented Feb 19, 2017 • edited Loading

lilhope commented Feb 19, 2017

ksofiyuk commented Feb 19, 2017

lilhope commented Feb 19, 2017

piiswrong commented Feb 19, 2017

ijkguo commented Feb 21, 2017 • edited Loading

piiswrong commented Feb 21, 2017

ijkguo commented Feb 21, 2017 • edited Loading

ijkguo commented Feb 21, 2017

ijkguo commented Feb 21, 2017

lilhope commented Feb 21, 2017

ijkguo commented Feb 21, 2017

lilhope commented Feb 21, 2017

ksofiyuk commented Feb 22, 2017

piiswrong commented Feb 22, 2017

ksofiyuk commented Feb 22, 2017 • edited Loading

piiswrong commented Feb 22, 2017

ksofiyuk commented Feb 22, 2017

piiswrong commented Feb 22, 2017

ksofiyuk commented Feb 22, 2017

ksofiyuk commented Feb 22, 2017

piiswrong commented Feb 22, 2017 • edited Loading

ksofiyuk commented Feb 22, 2017

piiswrong commented Feb 22, 2017

piiswrong commented Feb 22, 2017 • edited Loading

ksofiyuk commented Feb 22, 2017 • edited Loading

sbodenstein commented Feb 22, 2017

piiswrong commented Feb 24, 2017

szha commented Sep 29, 2017

ijkguo commented Feb 19, 2017 •

edited

Loading

ijkguo commented Feb 21, 2017 •

edited

Loading

ijkguo commented Feb 21, 2017 •

edited

Loading

ksofiyuk commented Feb 22, 2017 •

edited

Loading

piiswrong commented Feb 22, 2017 •

edited

Loading

piiswrong commented Feb 22, 2017 •

edited

Loading

ksofiyuk commented Feb 22, 2017 •

edited

Loading