Why training loss doesn't decrease at the very beginning? #2051

ghost · 2015-03-06T09:33:46Z

I tried several different network structures, and also tried different learning rate, bias initialization as well weight decay, but all these trials did not work at all! I inspected my network and feel sure it's correct!

I0306 19:06:09.863409 38597 solver.cpp:420] Iteration 3840, lr = 0.01
I0306 19:07:02.010265 38597 solver.cpp:196] Iteration 3860, loss = 8.88737
I0306 19:07:02.010437 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:07:02.010562 38597 solver.cpp:211] Train net output #2: loss = 8.88737 (* 1 = 8.88737 loss)
I0306 19:07:02.010738 38597 solver.cpp:420] Iteration 3860, lr = 0.01
I0306 19:08:09.907526 38597 solver.cpp:196] Iteration 3880, loss = 8.85471
I0306 19:08:09.907752 38597 solver.cpp:211] Train net output #0: accuracy = 0.0078125
I0306 19:08:09.907881 38597 solver.cpp:211] Train net output #2: loss = 8.85471 (* 1 = 8.85471 loss)
I0306 19:08:09.908064 38597 solver.cpp:420] Iteration 3880, lr = 0.01
I0306 19:09:30.316922 38597 solver.cpp:196] Iteration 3900, loss = 8.92846
I0306 19:09:30.317157 38597 solver.cpp:211] Train net output #0: accuracy = 0.0078125
I0306 19:09:30.317291 38597 solver.cpp:211] Train net output #2: loss = 8.92846 (* 1 = 8.92846 loss)
I0306 19:09:30.317464 38597 solver.cpp:420] Iteration 3900, lr = 0.01
I0306 19:10:32.443650 38597 solver.cpp:196] Iteration 3920, loss = 8.88563
I0306 19:10:32.443862 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:10:32.443994 38597 solver.cpp:211] Train net output #2: loss = 8.88563 (* 1 = 8.88563 loss)
I0306 19:10:32.444160 38597 solver.cpp:420] Iteration 3920, lr = 0.01
I0306 19:11:34.688952 38597 solver.cpp:196] Iteration 3940, loss = 8.87257
I0306 19:11:34.689177 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:11:34.689304 38597 solver.cpp:211] Train net output #2: loss = 8.87257 (* 1 = 8.87257 loss)
I0306 19:11:34.689477 38597 solver.cpp:420] Iteration 3940, lr = 0.01
I0306 19:12:34.818542 38597 solver.cpp:196] Iteration 3960, loss = 8.88778
I0306 19:12:34.818768 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:12:34.818907 38597 solver.cpp:211] Train net output #2: loss = 8.88778 (* 1 = 8.88778 loss)
I0306 19:12:34.819080 38597 solver.cpp:420] Iteration 3960, lr = 0.01
I0306 19:13:33.571383 38597 solver.cpp:196] Iteration 3980, loss = 8.87781
I0306 19:13:35.596309 38597 solver.cpp:211] Train net output #0: accuracy = 0.0078125
I0306 19:13:35.596472 38597 solver.cpp:211] Train net output #2: loss = 8.87781 (* 1 = 8.87781 loss)
I0306 19:13:35.597108 38597 solver.cpp:420] Iteration 3980, lr = 0.01
I0306 19:14:30.326297 38597 solver.cpp:196] Iteration 4000, loss = 8.83755
I0306 19:14:30.326486 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:14:30.326617 38597 solver.cpp:211] Train net output #2: loss = 8.83755 (* 1 = 8.83755 loss)

ghost · 2015-03-06T09:39:46Z

This is my train_val.prototxt
name: "VGG_ILSVRC_16_layers"
layers {
name: "data"
type: IMAGE_DATA
top: "data"
top: "label"
image_data_param {
source: "/home/zbh/caffe/models/web/train.txt"
batch_size: 128
}
transform_param {
mean_file: "/home/zbh/caffe/models/web/mean_128.binaryproto"
mirror: false
}
include: { phase: TRAIN }
}
layers {
name: "data"
type: IMAGE_DATA
top: "data"
top: "label"
image_data_param {
source: "/home/zbh/caffe/models/web/val.txt"
batch_size: 128
}
transform_param {
mean_file: "/home/zbh/caffe/models/web/mean_128.binaryproto"
mirror: false
}
include: { phase: TEST }
}

layers {
name: "conv1_1"
type: CONVOLUTION
bottom: "data"
top: "conv1_1"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 32
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv1_1"
top: "conv1_1"
name: "relu1_1"
type: RELU
}
layers {
name: "conv1_2"
type: CONVOLUTION
bottom: "conv1_1"
top: "conv1_2"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 32
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv1_2"
top: "conv1_2"
name: "relu1_2"
type: RELU
}
layers {
bottom: "conv1_2"
top: "pool1"
name: "pool1"
type: POOLING
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layers {
name: "conv2_1"
type: CONVOLUTION
bottom: "pool1"
top: "conv2_1"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 64
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv2_1"
top: "conv2_1"
name: "relu2_1"
type: RELU
}
layers {
name: "conv2_2"
type: CONVOLUTION
bottom: "conv2_1"
top: "conv2_2"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 64
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv2_2"
top: "conv2_2"
name: "relu2_2"
type: RELU
}
layers {
bottom: "conv2_2"
top: "pool2"
name: "pool2"
type: POOLING
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layers {
name: "conv3_1"
type: CONVOLUTION
bottom: "pool2"
top: "conv3_1"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 128
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv3_1"
top: "conv3_1"
name: "relu3_1"
type: RELU
}
layers {
name: "conv3_2"
type: CONVOLUTION
bottom: "conv3_1"
top: "conv3_2"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 128
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv3_2"
top: "conv3_2"
name: "relu3_2"
type: RELU
}
layers {
bottom: "conv3_2"
top: "pool3"
name: "pool3"
type: POOLING
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layers {
name: "conv4_1"
type: CONVOLUTION
bottom: "pool3"
top: "conv4_1"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 256
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv4_1"
top: "conv4_1"
name: "relu4_1"
type: RELU
}
layers {
name: "conv4_2"
type: CONVOLUTION
bottom: "conv4_1"
top: "conv4_2"
blobs_lr: 1
blobs_lr: 2
weight_decay: 0
weight_decay: 0
convolution_param {
num_output: 256
kernel_size: 3
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.01
}
}
}
layers {
bottom: "conv4_2"
top: "conv4_2"
name: "relu4_2"
type: RELU
}

layers {
bottom: "conv4_2"
top: "pool4"
name: "pool4"
type: POOLING
pooling_param {
pool: AVE
kernel_size: 13
stride: 13
}
}

layers {
bottom: "pool4"
top: "pool4"
name: "drop4"
type: DROPOUT
dropout_param {
dropout_ratio: 0.5
}
}

layers {
name: "fc5"
type: INNER_PRODUCT
bottom: "pool4"
top: "fc5"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
inner_product_param {
num_output: 10575
weight_filler {
type: "gaussian"
std: 0.005
}
bias_filler {
type: "constant"
value: 0.1
}
}
}
layers {
name: "loss"
type: SOFTMAX_LOSS
bottom: "fc5"
bottom: "label"
top: "loss"
}
layers {
name: "accuracy"
type: ACCURACY
bottom: "fc5"
bottom: "label"
top: "accuracy"
include: { phase: TRAIN }
}

layers {
name: "accuracy"
type: ACCURACY
bottom: "fc5"
bottom: "label"
top: "accuracy"
include: { phase: TEST }
}

ghost · 2015-03-06T09:40:36Z

And this is my solver.prototxt
net: "models/web/train_val.prototxt"
test_iter: 1500
test_interval: 12000
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
momentum: 0.9
weight_decay: 0.0005
display: 20
stepsize: 120000
max_iter: 480000
snapshot: 6000
snapshot_prefix: "models/web/models/web"
solver_mode: GPU

lugiavn · 2015-03-10T17:27:20Z

Let it run 10-20k iterations and see.
I think maybe it's because the network is too deep to train, you can read their paper and see how they trained it.

n-zhang · 2015-03-12T23:25:19Z

The first few iterations are seeing different batches. It is very possible that the learning on batch1 won't impact the loss on batch 2 that soon. One suggestion is to wait for longer and plot the training loss curve to verify the effectiveness of your network.

hli2020 · 2015-05-20T14:46:52Z

Hi @ZHUANGBOHAN, nice to find you here. I am Hongyang Li.

why your weight_decay in conv layer are zeros;
try smaller base_lr and weight initializations.
change to bigger batch_size (like from 128 to 256, but this is the least option).

hli2020 · 2015-05-20T14:48:43Z

where are my comments???

wqysq · 2015-12-29T07:31:32Z

@ghost I also have the same problem,could you tell me how to solve it?thank you!

mrgloom · 2016-09-17T11:55:01Z

Seems VGG-16 need larger batch size
NVIDIA/DIGITS#159 (comment)

ProGamerGov · 2016-09-23T01:51:18Z

@mrgloom Is this required? My EC2 GPU instance can't seem to get higher than a batch size of 16?

mrgloom · 2016-09-23T06:50:41Z

Not sure, but it helped me, also seems learning_rate (and maybe other parameters) is dependent on the batch size.

#430

In theory when you reduce the batch_size by a factor of X then you should increase the base_lr by a factor of sqrt(X), but Alex have used a factor of X (see http://arxiv.org/abs/1404.5997)

If you have out of memory problems you can use batch accumulation it can be set via iter_size parameter in solver.prototxt, something like batch_size=16 and iter_size=8

ProGamerGov · 2016-09-23T07:37:23Z

@mrgloom This was my latest VGG-16 training attempt:

The training Log: https://gist.github.com/ProGamerGov/21f7bf21105bbea0010bab69a2761386

Category 1 has 3177 training images and 353 validation images.
Category 2 has 2948 training images and 328 validation images.

My "create_imagenet.sh" and my "make_imagenet_mean.sh" scripts can be found here: https://gist.github.com/ProGamerGov/16eddea12aee2b49e1fce9eb506a9648

It seems as though my loss values go down a bit before increasing by a significant amount (going from 4-8, to 30-56), and then they decrease again.

Would the batch size be a reason for my lack of success with training? Do you have any other suggestions about how I can be more successful in training? I have used these settings while randomly tweaking the learning rate for months, without much success.

n-zhang closed this as completed Mar 12, 2015

ProGamerGov mentioned this issue Sep 23, 2016

Where should I start if I want to train a model for usage with Neural-Style? jcjohnson/neural-style#292

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why training loss doesn't decrease at the very beginning? #2051

Why training loss doesn't decrease at the very beginning? #2051

ghost commented Mar 6, 2015

ghost commented Mar 6, 2015

ghost commented Mar 6, 2015

lugiavn commented Mar 10, 2015

n-zhang commented Mar 12, 2015

hli2020 commented May 20, 2015

hli2020 commented May 20, 2015

wqysq commented Dec 29, 2015

mrgloom commented Sep 17, 2016

ProGamerGov commented Sep 23, 2016

mrgloom commented Sep 23, 2016

ProGamerGov commented Sep 23, 2016 •

edited

Loading

Why training loss doesn't decrease at the very beginning? #2051

Why training loss doesn't decrease at the very beginning? #2051

Comments

ghost commented Mar 6, 2015

ghost commented Mar 6, 2015

ghost commented Mar 6, 2015

lugiavn commented Mar 10, 2015

n-zhang commented Mar 12, 2015

hli2020 commented May 20, 2015

hli2020 commented May 20, 2015

wqysq commented Dec 29, 2015

mrgloom commented Sep 17, 2016

ProGamerGov commented Sep 23, 2016

mrgloom commented Sep 23, 2016

ProGamerGov commented Sep 23, 2016 • edited Loading

ProGamerGov commented Sep 23, 2016 •

edited

Loading