Skip to content

What is the learning rate decay and preprocessing you used in your training? #56

Open
@kwotsin

Description

@kwotsin

Thanks for providing the source code of this fantastic architecture. I am trying to clarify the learning rate decay as mentioned in your opts.lua file - is the learning rate decay 1e-1 or 1e-7 every 100 epochs? From your training, it seems that you didnt set the -d parameter, so would the decay go to 1e-7 by default?

However, the comment you gave for lrDecayEvery is:

--lrDecayEvery (default 100) Decay learning rate every X epoch by 1e-1

So I'd like to ask if the decay rate should be 1e-7 or 1e-1 every 100 epochs.

Also, what do you mean by # samples in this line?

-d,--learningRateDecay (default 1e-7) learning rate decay (in # samples)


Also, could I know how you performed your preprocessing for the training/evaluation data?

Activity

changed the title [-]What is the learning rate decay you used in your training?[/-] [+]What is the learning rate decay and preprocessing you used in your training?[/+] on Jun 6, 2017
codeAC29

codeAC29 commented on Jun 6, 2017

@codeAC29
Contributor
  1. --learningRateDecay is 1e-1 and # samples is there by mistake and has got no meaning.
  2. We do not perform any preprocessing.
kwotsin

kwotsin commented on Jun 7, 2017

@kwotsin
Author

Thank you for your reply! May I also confirm with you that for each dataset, you trained the model for a total of 300 epochs and you perform decay only every 100 epochs?

codeAC29

codeAC29 commented on Jun 7, 2017

@codeAC29
Contributor

Yes that is correct.

kwotsin

kwotsin commented on Jun 8, 2017

@kwotsin
Author

Thank you for the confirmation. Could I also know if you have continued to turn on dropout and batch norm when evaluating the test data? For many models, I think this is a standard thing to do. However, on my side, I seem to see a large difference in performance when I turned off batch norm and dropout.

Also, could I confirm the dataset you were using is equivalent to what is found here: https://github.com/alexgkendall/SegNet-Tutorial/tree/master/CamVid

Thank you once again.

codeAC29

codeAC29 commented on Jun 9, 2017

@codeAC29
Contributor
  1. If by turning off you mean deleting batchnorm then no you cannot do that. You need to adjust the weights of previous conv layer before getting rid of batchnorm layer. Once that is done then I don't think there will be any difference in performance.
  2. Yes the dataset used here is equivalent to the one which you have mentioned in your comment.
kwotsin

kwotsin commented on Jun 9, 2017

@kwotsin
Author

I have tested on the test dataset with dropout and batch_norm activated, and the results seem to be better than having either batch_norm or dropout turned off (or both). Did you have to turn off dropout when evaluating the test dataset? I see that in many models it's a common thing to stop dropout for test dataset.

Further, for the ordering of the classes in CamVid, I noted that the original dataset gave class labels from 0-11, where 11 is the void class. If the dataset you've used is the one found in the segnet tutorial as well, did you have to relabel all the segmentations from 1-12 (in lua's case), since you've put class 1 as void? Is there a particular reason why void is the first class instead of the last?

CamVid Labelling: https://github.com/alexgkendall/SegNet-Tutorial/blob/c922cc4a4fcc7ce279dd998fb2d4a8703f34ebd7/Scripts/test_segmentation_camvid.py#L60

Your Labelling:

local conClasses = {'Sky', 'Building', 'Column-Pole',

Could I also confirm with you if you performed median frequency balancing for obtained the weighted cross entropy loss? For a reference, these are the class weights used for the CamVid dataset:

https://github.com/alexgkendall/SegNet-Tutorial/blob/c922cc4a4fcc7ce279dd998fb2d4a8703f34ebd7/Models/segnet_train.prototxt#L1538

Thank you for your help once again.

codeAC29

codeAC29 commented on Jun 10, 2017

@codeAC29
Contributor
  1. As i told in my previous comment, you cannot just delete batchnorm layer. Before doing that you need to modify the weights of previous conv layer. In our case we did not get rid of these layers while testing.

  2. We do not include Unlabelled class in our confusion matrix. Giving it label 1 made writing the code easier, because cityscapes has Unlabelled as its first class.

  3. As mentioned in the paper, we use our own weight calculation scheme which gave us better result than median frequency balancing.

kwotsin

kwotsin commented on Jun 14, 2017

@kwotsin
Author

@codeAC29 thanks for your excellent response! I have been mulling over your response, and I've tried to create a version that can deactivate both batch_norm and spatial dropout during evaluation, however this gives me a very poor result. Like what you mentioned, during testing, batch_norm and spatial dropout are turned on. Is it correct to say these two functions are critical to evaluating images?

On the other hand, if batch_norm is critical to helping the model perform, would evaluating single images result in a very poor result? From my results, somehow there is quite a bit of difference in output when evaluating single images vs a batch of image. Would there be a way to effectively evaluate singular images for the network? I am currently only performing feature standardization to alleviate the effects.

Your paper has a great amount of content which I'm still learning to appreciate. Would you share how in particular is the p_class calculated in the weighing formula: w_class = 1.0 / ln(c + p_class) ? From your code, is it right to assume that p_class is the number of occurrences of a certain pixel label in all images, divided by the total number of pixels in all images? Is there a particular reason why the class weights should be restricted between 1 and 50? Using median frequency balancing, I see that the weights do not exceed 10.

Also, to verify with you, the spatial dropout you have used is Spatial Dropout in 2D (channel wise dropping) - is this correct?

codeAC29

codeAC29 commented on Jun 15, 2017

@codeAC29
Contributor
  1. As I have told in two of my previous comments: "You cannot just delete batch norm". Before removing batchnorm, you will have to do something like this:
        -- x is old model and y is new model
         local xsub    = x.modules[i].modules
         local xsubsub = x.modules[i].modules[1].modules
         local output = module.running_mean:nElement()
         local eps = xsubsub[j].eps
         local momentum = xsubsub[j].momentum
         local affine = xsubsub[j].affine
         y:add(nn.BatchNormalization(output*#xsub, eps, momentum, affine))
         y.modules[#y.modules].train = false

         -- concatenate distributed parameters over different models
         for k = 1, #xsub do
            local range = {output*(k-1)+1, output*k}
            y.modules[#y.modules].running_mean[{range}]:copy(xsub[k].modules[j].running_mean)
            y.modules[#y.modules].running_var[{range}]:copy(xsub[k].modules[j].running_var)
            if affine then
               y.modules[#y.modules].weight[{range}]:copy(xsub[k].modules[j].weight)
               y.modules[#y.modules].bias[{range}]:copy(xsub[k].modules[j].bias)
            end
         end
  1. Yes, p_class is what you have said. The values of the weights need to be such that, while training you are giving equal importance to all the classes. If xi is number of pixels occupied by class i then weight wi should be such that xi*wi is mostly giving a constant value for all the classes. If the there is huge class imbalance then weights varying between 1 to 50 is also fine, which you found in this case.

  2. Yes that is correct.

kwotsin

kwotsin commented on Jun 15, 2017

@kwotsin
Author

@codeAC29 Thank you for your excellent response once again. I am currently trying a variant of ENet in TensorFlow, in which case batch_norm could be turned off by setting the argument is_training=False in the standard batch norm function. Not accounting for implementation but theoretically speaking, would you say that spatial dropout and batch norm are crucial for getting good results?

If batch_norm and dropout are both turned on simultaneously during testing, how have you handled the test images differently from the validation images? Was there any change to the model when you were performing the inference for test images? If there aren't any changes, could the test images be included into the mix of train-validation images instead, given there is no difference from evaluating test images and validation images? That is, of course, assuming there are no changes to the model during testing.

Also, what inspired you to not perform any preprocessing for the images? Is there a conceptual reason behind this? It would be interesting to learn the reason why no preprocessing works well for the datasets.

In your paper, you mentioned that all the relus are replaced with prelus. However, in the decoder implementation here: https://github.com/e-lab/ENet-training/blob/master/train/models/decoder.lua
it seems that relus are used once again instead of prelus. Should relus be used in the decoder rather than prelus?

codeAC29

codeAC29 commented on Jun 21, 2017

@codeAC29
Contributor
  1. Yes batchnorm and spatial dropouts are very important for getting good results because it forces your network to learn and become more general.
  2. We did not change our model for inference on test images. Once you have your trained model, of course you can calculate accuracies on test set, the same way they were calculated on train-val set.
  3. In case of decoder, relus gave better result. Most probably because prelus have extra parameters and our network was not pretrained on any other bigger dataset.
kwotsin

kwotsin commented on Jun 27, 2017

@kwotsin
Author

@codeAC29 can I verify with you that your calculation on the ENet test accuracy was the result of testing on both the test and validation dataset combined together? It seems to me this is a natural choice given that there are no architectural changes for both the test and validation dataset and that the only difference comes from the data. In fact, perhaps the test dataset can be distributed to both training and validation datasets?

codeAC29

codeAC29 commented on Jun 27, 2017

@codeAC29
Contributor

@kwotsin No, we performed testing only on test dataset. Combining test data into training/validation will give you better result but then the whole point of test data will be lost. So, you should always train and validate your network using respective data and then in the end when you have your trained network, run it on test data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @kwotsin@codeAC29

        Issue actions

          What is the learning rate decay and preprocessing you used in your training? · Issue #56 · e-lab/ENet-training