The num_of_steps setting for Inception_v2 #5

wesley-stone · 2018-05-30T07:53:17Z

First of all, thank you very much. I noticed that 'num_steps' in 'faster_rcnn_inception_resnet_v2_atrous_kitti.config' file is not specified. Is this mean it would train infinitely? If so, could you share your experience on how many steps would be enough to have a stable loss?

sshleifer · 2018-05-30T15:52:17Z

yes, I think my loss got stable after roughly 12h training on 1 GPU.

…

On Wed, May 30, 2018 at 3:53 AM ShiAGou ***@***.***> wrote: First of all, thank you very much. I noticed that 'num_steps' in 'faster_rcnn_inception_resnet_v2_atrous_kitti.config' file is not specified. Is this mean it would train infinitely? If so, could you share your experience on how many steps would be enough to have a stable loss? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFw9YUh7Ux-LIY9FHWyabYm_shaZ3fboks5t3k_xgaJpZM4USxY8> .

wesley-stone · 2018-05-31T05:21:00Z

I have trained it for about 21 hours on one TITAN X GPU with 1.2 steps/second. But my loss still fluctuate between 0 to 1. Did you change any parameters in 'faster_rcnn_inception_resnet_v2_atrous_kitti.config' such as learning rate? It seems from 0 to 900k steps, the learning rate is a constant .0003.

I found the training procedure could be significantly slowed down when running eval.sh at the same time. So I did not run eval currently. Will this affect the result?

thanks

this is my current training loss state:

INFO:tensorflow:global step 95931: loss = 0.4842 (0.827 sec/step)
INFO:tensorflow:global step 95932: loss = 0.2304 (0.831 sec/step)
INFO:tensorflow:global step 95933: loss = 0.6756 (0.824 sec/step)
INFO:tensorflow:global step 95934: loss = 0.5103 (0.829 sec/step)
INFO:tensorflow:global step 95935: loss = 0.3497 (0.820 sec/step)
INFO:tensorflow:global step 95936: loss = 0.3261 (0.829 sec/step)
INFO:tensorflow:global step 95937: loss = 0.3748 (0.823 sec/step)
INFO:tensorflow:global step 95938: loss = 0.1620 (0.826 sec/step)
INFO:tensorflow:global step 95939: loss = 0.3487 (0.828 sec/step)
INFO:tensorflow:global step 95940: loss = 0.3864 (0.823 sec/step)
INFO:tensorflow:global step 95941: loss = 0.1237 (0.827 sec/step)
INFO:tensorflow:global step 95942: loss = 0.4237 (0.827 sec/step)
INFO:tensorflow:global step 95943: loss = 0.2671 (0.841 sec/step)
INFO:tensorflow:global step 95944: loss = 0.5672 (0.873 sec/step)
INFO:tensorflow:global step 95945: loss = 0.2411 (0.889 sec/step)
INFO:tensorflow:global step 95946: loss = 0.3034 (0.876 sec/step)
INFO:tensorflow:global step 95947: loss = 0.0378 (0.883 sec/step)
INFO:tensorflow:global step 95948: loss = 0.2312 (0.876 sec/step)
INFO:tensorflow:global step 95949: loss = 0.1306 (0.855 sec/step)
INFO:tensorflow:global step 95950: loss = 0.3180 (0.818 sec/step)

default config in 'faster_rcnn_inception_resnet_v2_atrous_kitti.config' is

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0003
          schedule {
            step: 0
            learning_rate: .0003
          }
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt"
  from_detection_checkpoint: true
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

sshleifer · 2018-05-31T17:55:42Z

From here, looks to me like you are evaluating loss on a per image basis, which is not a very good accurate proxy for your train loss over the whole dataset or your validation loss. I'd recommend looking at some validation metrics on tensorboard to figure out when to stop.

…

On Thu, May 31, 2018 at 1:21 AM ShiAGou ***@***.***> wrote: I have trained it for about 21 hours on one TITAN X GPU with 1.2 steps/second. But my loss still fluctuate between 0 to 1. Did you change any parameters in 'faster_rcnn_inception_resnet_v2_atrous_kitti.config' such as learning rate? thanks this is my current training loss state: INFO:tensorflow:global step 95931: loss = 0.4842 (0.827 sec/step) INFO:tensorflow:global step 95932: loss = 0.2304 (0.831 sec/step) INFO:tensorflow:global step 95933: loss = 0.6756 (0.824 sec/step) INFO:tensorflow:global step 95934: loss = 0.5103 (0.829 sec/step) INFO:tensorflow:global step 95935: loss = 0.3497 (0.820 sec/step) INFO:tensorflow:global step 95936: loss = 0.3261 (0.829 sec/step) INFO:tensorflow:global step 95937: loss = 0.3748 (0.823 sec/step) INFO:tensorflow:global step 95938: loss = 0.1620 (0.826 sec/step) INFO:tensorflow:global step 95939: loss = 0.3487 (0.828 sec/step) INFO:tensorflow:global step 95940: loss = 0.3864 (0.823 sec/step) INFO:tensorflow:global step 95941: loss = 0.1237 (0.827 sec/step) INFO:tensorflow:global step 95942: loss = 0.4237 (0.827 sec/step) INFO:tensorflow:global step 95943: loss = 0.2671 (0.841 sec/step) INFO:tensorflow:global step 95944: loss = 0.5672 (0.873 sec/step) INFO:tensorflow:global step 95945: loss = 0.2411 (0.889 sec/step) INFO:tensorflow:global step 95946: loss = 0.3034 (0.876 sec/step) INFO:tensorflow:global step 95947: loss = 0.0378 (0.883 sec/step) INFO:tensorflow:global step 95948: loss = 0.2312 (0.876 sec/step) INFO:tensorflow:global step 95949: loss = 0.1306 (0.855 sec/step) INFO:tensorflow:global step 95950: loss = 0.3180 (0.818 sec/step) default config in 'faster_rcnn_inception_resnet_v2_atrous_kitti.config' is train_config: { batch_size: 1 optimizer { momentum_optimizer: { learning_rate: { manual_step_learning_rate { initial_learning_rate: 0.0003 schedule { step: 0 learning_rate: .0003 } schedule { step: 900000 learning_rate: .00003 } schedule { step: 1200000 learning_rate: .000003 } } } momentum_optimizer_value: 0.9 } use_moving_average: false } gradient_clipping_by_norm: 10.0 fine_tune_checkpoint: "faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt" from_detection_checkpoint: true data_augmentation_options { random_horizontal_flip { } } } It seems from 0 to 900k steps, the learning rate is a constant .0003? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFw9YXduFeHzmhCdyyi0wKsXRmkI1m7fks5t3329gaJpZM4USxY8> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The num_of_steps setting for Inception_v2 #5

The num_of_steps setting for Inception_v2 #5

wesley-stone commented May 30, 2018

sshleifer commented May 30, 2018 via email

wesley-stone commented May 31, 2018 •

edited

Loading

sshleifer commented May 31, 2018 via email

The num_of_steps setting for Inception_v2 #5

The num_of_steps setting for Inception_v2 #5

Comments

wesley-stone commented May 30, 2018

sshleifer commented May 30, 2018 via email

wesley-stone commented May 31, 2018 • edited Loading

sshleifer commented May 31, 2018 via email

wesley-stone commented May 31, 2018 •

edited

Loading