Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Error while training the code for fcn32 model #1619

Closed
Viswa14 opened this issue Mar 11, 2016 · 12 comments
Closed

Error while training the code for fcn32 model #1619

Viswa14 opened this issue Mar 11, 2016 · 12 comments

Comments

@Viswa14
Copy link

Viswa14 commented Mar 11, 2016

File "fcn_xs.py", line 57, in main
epoch_end_callback = mx.callback.do_checkpoint(fcnxs_model_prefix))
File "solver.py", line 72, in fit
aux_states=self.aux_params)
File "symbol.py", line 718, in bind
args_handle, args = self._get_ndarray_inputs('args', args, listed_arguments, False)
File "symbol.py", line 585, in _get_ndarray_inputs
raise ValueError('Must specify all the arguments in %s' % arg_key)
ValueError: Must specify all the arguments in args

I come across this error when i try to train the model for fcn32s using VGG_FC_ILSVRC_16_layers as Prefix. I believe the trained model provided for VGG16 does not have 'bigscore_bias'. Can anyone help with this regard ?

@Viswa14
Copy link
Author

Viswa14 commented Mar 15, 2016

Thank you for helping me out @zhaw This helps to train the model successfully. But on testing using image_segmentaion.py by change appropriate model_prefix and epoch parameters the result I obtain is a black image. Can you provide an insight why that happens ? I am not sure Why all 0's are returned ? Is there any change in test code for other models ? This test code produces correct result for pre-trained FCN8s model provided by the author.

VALUES i get for data.shape, label.shape, out.shape and out_image are
(1L, 3L, 335L, 500L) (1, 167500L)
(1L, 21L, 335L, 500L)
[[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
...,
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]]

@tornadomeet @tqchen : Kindly provide suggestions on this.

@zhaw
Copy link
Contributor

zhaw commented Mar 16, 2016

What's your training accuracy? If your training accuracy is not low then I have no idea what could cause this problem. If your training accuracy is low and stays the same, this is probably because you set the learning rate too high and some parameters become NaN. This will makes your model predict all zero.
I don't think you need to change anything in test code if you use your own model.

@Viswa14
Copy link
Author

Viswa14 commented Mar 16, 2016

My training accuracy comes around 69% stays same until 50 Epochs, I do not change anything in either training or testing code. I have the learning rate to be 1e-10, defined by the code in example.

@zhaw
Copy link
Contributor

zhaw commented Mar 16, 2016

I think you should try higher learning rate. Fcn32s, 16s, 8s model need different learninig rate and 1e-10 is for training fcn8s model. If I remember correctly, learning rate I used for these three model is 1e-4, 1e-7, 1e-10.

@Viswa14
Copy link
Author

Viswa14 commented Mar 16, 2016

Okay. I did my trial based on information provided with the example. Do you suggest to change it according your arguements ?
The learning rates provided along with examples are as follows:
model lr (fixed) epoch
fcn-32s 1e-10 31
fcn-16s 1e-12 27
fcn-8s 1e-14 19

@zhaw
Copy link
Contributor

zhaw commented Mar 17, 2016

All I can suggest is to raise your learning rate, try different values and see which works. If your training accuracy stays same for a long time, your learning rate is too low.
I'm not sure if my arguments will work for you because the original ones didn't either. I think the proper learning rate is related to the input image's size and this may be the reason why you need different learning rate training the same model.

@tornadomeet
Copy link
Contributor

@Viswa14 due to update of mxnet in softmax operator currently, you should use samller lr as @zhaw suggested.

@Viswa14
Copy link
Author

Viswa14 commented Mar 17, 2016

@tornadomeet @zhaw : So you suggest a lower learning rate or higher learning rate ? Zhaw had suggested me to increase learning rate.

@Viswa14
Copy link
Author

Viswa14 commented Mar 17, 2016

And Thank you! Sure, I will try with it and provide an update. It will be great if the document can be modified too as it will support people trying out this example in future.

@zhaw
Copy link
Contributor

zhaw commented Mar 17, 2016

Sorry, I think I misunderstood "My training accuracy comes around 69% stays same until 50 Epochs". I thought you meant that after 50 epochs your training accuracy increased. If your training accuracy stayed same all the time, that was because your learning rate was too high and some params turned to be NaN. If that was the case, you should lower your learning rate.

@tornadomeet
Copy link
Contributor

yes, i made a mistake just a moment, just use larger lr.

@Viswa14 Viswa14 closed this as completed Apr 28, 2016
@zht3344
Copy link

zht3344 commented Apr 14, 2017

@Viswa14 I obtain is a black image too using the fcn32s model, just like you ,should i lower my learning rate or increase the learning rate?(when i train the fcn32s model i use learning rate = 1e-10)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants