Error while training the code for fcn32 model #1619

Viswa14 · 2016-03-11T04:44:51Z

File "fcn_xs.py", line 57, in main
epoch_end_callback = mx.callback.do_checkpoint(fcnxs_model_prefix))
File "solver.py", line 72, in fit
aux_states=self.aux_params)
File "symbol.py", line 718, in bind
args_handle, args = self._get_ndarray_inputs('args', args, listed_arguments, False)
File "symbol.py", line 585, in _get_ndarray_inputs
raise ValueError('Must specify all the arguments in %s' % arg_key)
ValueError: Must specify all the arguments in args

I come across this error when i try to train the model for fcn32s using VGG_FC_ILSVRC_16_layers as Prefix. I believe the trained model provided for VGG16 does not have 'bigscore_bias'. Can anyone help with this regard ?

Viswa14 · 2016-03-15T18:45:02Z

Thank you for helping me out @zhaw This helps to train the model successfully. But on testing using image_segmentaion.py by change appropriate model_prefix and epoch parameters the result I obtain is a black image. Can you provide an insight why that happens ? I am not sure Why all 0's are returned ? Is there any change in test code for other models ? This test code produces correct result for pre-trained FCN8s model provided by the author.

VALUES i get for data.shape, label.shape, out.shape and out_image are
(1L, 3L, 335L, 500L) (1, 167500L)
(1L, 21L, 335L, 500L)
[[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
...,
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]]

@tornadomeet @tqchen : Kindly provide suggestions on this.

zhaw · 2016-03-16T12:41:48Z

What's your training accuracy? If your training accuracy is not low then I have no idea what could cause this problem. If your training accuracy is low and stays the same, this is probably because you set the learning rate too high and some parameters become NaN. This will makes your model predict all zero.
I don't think you need to change anything in test code if you use your own model.

Viswa14 · 2016-03-16T13:15:10Z

My training accuracy comes around 69% stays same until 50 Epochs, I do not change anything in either training or testing code. I have the learning rate to be 1e-10, defined by the code in example.

zhaw · 2016-03-16T15:06:04Z

I think you should try higher learning rate. Fcn32s, 16s, 8s model need different learninig rate and 1e-10 is for training fcn8s model. If I remember correctly, learning rate I used for these three model is 1e-4, 1e-7, 1e-10.

Viswa14 · 2016-03-16T18:18:33Z

Okay. I did my trial based on information provided with the example. Do you suggest to change it according your arguements ?
The learning rates provided along with examples are as follows:
model lr (fixed) epoch
fcn-32s 1e-10 31
fcn-16s 1e-12 27
fcn-8s 1e-14 19

zhaw · 2016-03-17T03:14:11Z

All I can suggest is to raise your learning rate, try different values and see which works. If your training accuracy stays same for a long time, your learning rate is too low.
I'm not sure if my arguments will work for you because the original ones didn't either. I think the proper learning rate is related to the input image's size and this may be the reason why you need different learning rate training the same model.

tornadomeet · 2016-03-17T08:19:02Z

@Viswa14 due to update of mxnet in softmax operator currently, you should use samller lr as @zhaw suggested.

Viswa14 · 2016-03-17T08:28:47Z

@tornadomeet @zhaw : So you suggest a lower learning rate or higher learning rate ? Zhaw had suggested me to increase learning rate.

Viswa14 · 2016-03-17T08:30:21Z

And Thank you! Sure, I will try with it and provide an update. It will be great if the document can be modified too as it will support people trying out this example in future.

zhaw · 2016-03-17T08:50:23Z

Sorry, I think I misunderstood "My training accuracy comes around 69% stays same until 50 Epochs". I thought you meant that after 50 epochs your training accuracy increased. If your training accuracy stayed same all the time, that was because your learning rate was too high and some params turned to be NaN. If that was the case, you should lower your learning rate.

tornadomeet · 2016-03-17T09:30:30Z

yes, i made a mistake just a moment, just use larger lr.

zht3344 · 2017-04-14T03:37:47Z

@Viswa14 I obtain is a black image too using the fcn32s model, just like you ,should i lower my learning rate or increase the learning rate?(when i train the fcn32s model i use learning rate = 1e-10)

zhaw mentioned this issue Mar 13, 2016

set no_bias in deconv layer in fcnxs example #1627

Merged

Viswa14 closed this as completed Apr 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while training the code for fcn32 model #1619

Error while training the code for fcn32 model #1619

Viswa14 commented Mar 11, 2016

Viswa14 commented Mar 15, 2016

zhaw commented Mar 16, 2016

Viswa14 commented Mar 16, 2016

zhaw commented Mar 16, 2016

Viswa14 commented Mar 16, 2016

zhaw commented Mar 17, 2016

tornadomeet commented Mar 17, 2016

Viswa14 commented Mar 17, 2016

Viswa14 commented Mar 17, 2016

zhaw commented Mar 17, 2016

tornadomeet commented Mar 17, 2016

zht3344 commented Apr 14, 2017

Error while training the code for fcn32 model #1619

Error while training the code for fcn32 model #1619

Comments

Viswa14 commented Mar 11, 2016

Viswa14 commented Mar 15, 2016

zhaw commented Mar 16, 2016

Viswa14 commented Mar 16, 2016

zhaw commented Mar 16, 2016

Viswa14 commented Mar 16, 2016

zhaw commented Mar 17, 2016

tornadomeet commented Mar 17, 2016

Viswa14 commented Mar 17, 2016

Viswa14 commented Mar 17, 2016

zhaw commented Mar 17, 2016

tornadomeet commented Mar 17, 2016

zht3344 commented Apr 14, 2017