Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why training loss doesn't decrease at the very beginning? #2051

Closed
ghost opened this issue Mar 6, 2015 · 11 comments
Closed

Why training loss doesn't decrease at the very beginning? #2051

ghost opened this issue Mar 6, 2015 · 11 comments

Comments

@ghost
Copy link

ghost commented Mar 6, 2015

I tried several different network structures, and also tried different learning rate, bias initialization as well weight decay, but all these trials did not work at all! I inspected my network and feel sure it's correct!

I0306 19:06:09.863409 38597 solver.cpp:420] Iteration 3840, lr = 0.01
I0306 19:07:02.010265 38597 solver.cpp:196] Iteration 3860, loss = 8.88737
I0306 19:07:02.010437 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:07:02.010562 38597 solver.cpp:211] Train net output #2: loss = 8.88737 (* 1 = 8.88737 loss)
I0306 19:07:02.010738 38597 solver.cpp:420] Iteration 3860, lr = 0.01
I0306 19:08:09.907526 38597 solver.cpp:196] Iteration 3880, loss = 8.85471
I0306 19:08:09.907752 38597 solver.cpp:211] Train net output #0: accuracy = 0.0078125
I0306 19:08:09.907881 38597 solver.cpp:211] Train net output #2: loss = 8.85471 (* 1 = 8.85471 loss)
I0306 19:08:09.908064 38597 solver.cpp:420] Iteration 3880, lr = 0.01
I0306 19:09:30.316922 38597 solver.cpp:196] Iteration 3900, loss = 8.92846
I0306 19:09:30.317157 38597 solver.cpp:211] Train net output #0: accuracy = 0.0078125
I0306 19:09:30.317291 38597 solver.cpp:211] Train net output #2: loss = 8.92846 (* 1 = 8.92846 loss)
I0306 19:09:30.317464 38597 solver.cpp:420] Iteration 3900, lr = 0.01
I0306 19:10:32.443650 38597 solver.cpp:196] Iteration 3920, loss = 8.88563
I0306 19:10:32.443862 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:10:32.443994 38597 solver.cpp:211] Train net output #2: loss = 8.88563 (* 1 = 8.88563 loss)
I0306 19:10:32.444160 38597 solver.cpp:420] Iteration 3920, lr = 0.01
I0306 19:11:34.688952 38597 solver.cpp:196] Iteration 3940, loss = 8.87257
I0306 19:11:34.689177 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:11:34.689304 38597 solver.cpp:211] Train net output #2: loss = 8.87257 (* 1 = 8.87257 loss)
I0306 19:11:34.689477 38597 solver.cpp:420] Iteration 3940, lr = 0.01
I0306 19:12:34.818542 38597 solver.cpp:196] Iteration 3960, loss = 8.88778
I0306 19:12:34.818768 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:12:34.818907 38597 solver.cpp:211] Train net output #2: loss = 8.88778 (* 1 = 8.88778 loss)
I0306 19:12:34.819080 38597 solver.cpp:420] Iteration 3960, lr = 0.01
I0306 19:13:33.571383 38597 solver.cpp:196] Iteration 3980, loss = 8.87781
I0306 19:13:35.596309 38597 solver.cpp:211] Train net output #0: accuracy = 0.0078125
I0306 19:13:35.596472 38597 solver.cpp:211] Train net output #2: loss = 8.87781 (* 1 = 8.87781 loss)
I0306 19:13:35.597108 38597 solver.cpp:420] Iteration 3980, lr = 0.01
I0306 19:14:30.326297 38597 solver.cpp:196] Iteration 4000, loss = 8.83755
I0306 19:14:30.326486 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:14:30.326617 38597 solver.cpp:211] Train net output #2: loss = 8.83755 (* 1 = 8.83755 loss)

@ghost
Copy link
Author

ghost commented Mar 6, 2015

This is my train_val.prototxt
name: "VGG_ILSVRC_16_layers"
layers {
name: "data"
type: IMAGE_DATA
top: "data"
top: "label"
image_data_param {
source: "/home/zbh/caffe/models/web/train.txt"
batch_size: 128
}
transform_param {
mean_file: "/home/zbh/caffe/models/web/mean_128.binaryproto"
mirror: false
}
include: { phase: TRAIN }
}
layers {
name: "data"
type: IMAGE_DATA
top: "data"
top: "label"
image_data_param {
source: "/home/zbh/caffe/models/web/val.txt"
batch_size: 128
}
transform_param {
mean_file: "/home/zbh/caffe/models/web/mean_128.binaryproto"
mirror: false
}
include: { phase: TEST }
}

layers {
name: "conv1_1"
type: CONVOLUTION
bottom: "data"
top: "conv1_1"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 32
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv1_1"
top: "conv1_1"
name: "relu1_1"
type: RELU
}
layers {
name: "conv1_2"
type: CONVOLUTION
bottom: "conv1_1"
top: "conv1_2"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 32
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv1_2"
top: "conv1_2"
name: "relu1_2"
type: RELU
}
layers {
bottom: "conv1_2"
top: "pool1"
name: "pool1"
type: POOLING
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layers {
name: "conv2_1"
type: CONVOLUTION
bottom: "pool1"
top: "conv2_1"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 64
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv2_1"
top: "conv2_1"
name: "relu2_1"
type: RELU
}
layers {
name: "conv2_2"
type: CONVOLUTION
bottom: "conv2_1"
top: "conv2_2"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 64
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv2_2"
top: "conv2_2"
name: "relu2_2"
type: RELU
}
layers {
bottom: "conv2_2"
top: "pool2"
name: "pool2"
type: POOLING
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layers {
name: "conv3_1"
type: CONVOLUTION
bottom: "pool2"
top: "conv3_1"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 128
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv3_1"
top: "conv3_1"
name: "relu3_1"
type: RELU
}
layers {
name: "conv3_2"
type: CONVOLUTION
bottom: "conv3_1"
top: "conv3_2"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 128
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv3_2"
top: "conv3_2"
name: "relu3_2"
type: RELU
}
layers {
bottom: "conv3_2"
top: "pool3"
name: "pool3"
type: POOLING
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layers {
name: "conv4_1"
type: CONVOLUTION
bottom: "pool3"
top: "conv4_1"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 256
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv4_1"
top: "conv4_1"
name: "relu4_1"
type: RELU
}
layers {
name: "conv4_2"
type: CONVOLUTION
bottom: "conv4_1"
top: "conv4_2"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 256
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv4_2"
top: "conv4_2"
name: "relu4_2"
type: RELU
}

layers {
bottom: "conv4_2"
top: "pool4"
name: "pool4"
type: POOLING
pooling_param {
pool: AVE
kernel_size: 13
stride: 13
}
}

layers {
bottom: "pool4"
top: "pool4"
name: "drop4"
type: DROPOUT
dropout_param {
dropout_ratio: 0.5
}
}

layers {
name: "fc5"
type: INNER_PRODUCT
bottom: "pool4"
top: "fc5"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
inner_product_param {
num_output: 10575
weight_filler {
type: "gaussian"
std: 0.005
}
bias_filler {
type: "constant"
value: 0.1
}
}
}
layers {
name: "loss"
type: SOFTMAX_LOSS
bottom: "fc5"
bottom: "label"
top: "loss"
}
layers {
name: "accuracy"
type: ACCURACY
bottom: "fc5"
bottom: "label"
top: "accuracy"
include: { phase: TRAIN }
}

layers {
name: "accuracy"
type: ACCURACY
bottom: "fc5"
bottom: "label"
top: "accuracy"
include: { phase: TEST }
}

@ghost
Copy link
Author

ghost commented Mar 6, 2015

And this is my solver.prototxt
net: "models/web/train_val.prototxt"
test_iter: 1500
test_interval: 12000
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
momentum: 0.9
weight_decay: 0.0005
display: 20
stepsize: 120000
max_iter: 480000
snapshot: 6000
snapshot_prefix: "models/web/models/web"
solver_mode: GPU

@lugiavn
Copy link

lugiavn commented Mar 10, 2015

Let it run 10-20k iterations and see.
I think maybe it's because the network is too deep to train, you can read their paper and see how they trained it.

@n-zhang
Copy link

n-zhang commented Mar 12, 2015

The first few iterations are seeing different batches. It is very possible that the learning on batch1 won't impact the loss on batch 2 that soon. One suggestion is to wait for longer and plot the training loss curve to verify the effectiveness of your network.

@n-zhang n-zhang closed this as completed Mar 12, 2015
@hli2020
Copy link

hli2020 commented May 20, 2015

Hi @ZHUANGBOHAN, nice to find you here. I am Hongyang Li.

  1. why your weight_decay in conv layer are zeros;
  2. try smaller base_lr and weight initializations.
  3. change to bigger batch_size (like from 128 to 256, but this is the least option).

@hli2020
Copy link

hli2020 commented May 20, 2015

where are my comments???

@wqysq
Copy link

wqysq commented Dec 29, 2015

@ghost I also have the same problem,could you tell me how to solve it?thank you!

@mrgloom
Copy link

mrgloom commented Sep 17, 2016

Seems VGG-16 need larger batch size
NVIDIA/DIGITS#159 (comment)

@ProGamerGov
Copy link

@mrgloom Is this required? My EC2 GPU instance can't seem to get higher than a batch size of 16?

@mrgloom
Copy link

mrgloom commented Sep 23, 2016

Not sure, but it helped me, also seems learning_rate (and maybe other parameters) is dependent on the batch size.

#430

In theory when you reduce the batch_size by a factor of X then you should increase the base_lr by a factor of sqrt(X), but Alex have used a factor of X (see http://arxiv.org/abs/1404.5997)

If you have out of memory problems you can use batch accumulation it can be set via iter_size parameter in solver.prototxt, something like batch_size=16 and iter_size=8

@ProGamerGov
Copy link

ProGamerGov commented Sep 23, 2016

@mrgloom This was my latest VGG-16 training attempt:

The training Log: https://gist.github.com/ProGamerGov/21f7bf21105bbea0010bab69a2761386

Category 1 has 3177 training images and 353 validation images.
Category 2 has 2948 training images and 328 validation images.

My "create_imagenet.sh" and my "make_imagenet_mean.sh" scripts can be found here: https://gist.github.com/ProGamerGov/16eddea12aee2b49e1fce9eb506a9648

It seems as though my loss values go down a bit before increasing by a significant amount (going from 4-8, to 30-56), and then they decrease again.

Would the batch size be a reason for my lack of success with training? Do you have any other suggestions about how I can be more successful in training? I have used these settings while randomly tweaking the learning rate for months, without much success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants