Loss does not drop #3

ZQS1943 · 2020-07-11T14:18:10Z

Hi,

I'm trying to run your model but during training the loss does not drop.

Here is the part of the loss.

(qiusi) qiusi@mewtwo:~/rat-sql$ python run.py preprocess experiments/spider-glove-run.jsonnet
WARNING <class 'ratsql.models.enc_dec.EncDecModel.Preproc'>: superfluous {'name': 'EncDec'}
DB connections: 100%|████████████████████████████████████████████| 166/166 [00:01<00:00, 84.90it/s]
train section: 100%|███████████████████████████████████████████| 8659/8659 [07:43<00:00, 18.69it/s]
DB connections: 100%|███████████████████████████████████████████| 166/166 [00:01<00:00, 115.15it/s]
val section: 100%|█████████████████████████████████████████████| 1034/1034 [00:47<00:00, 21.98it/s]

(qiusi) qiusi@mewtwo:~/rat-sql$ python run.py train experiments/spider-glove-run.jsonnet
[2020-07-11T20:53:06] Logging to logdir/glove_run/bs=20,lr=7.4e-04,end_lr=0e0,att=0
Loading model from logdir/glove_run/bs=20,lr=7.4e-04,end_lr=0e0,att=0/model_checkpoint
[2020-07-11T20:55:38] Step 10: loss=178.1385
[2020-07-11T20:57:36] Step 20: loss=185.5836
[2020-07-11T20:59:29] Step 30: loss=165.1267
[2020-07-11T21:01:16] Step 40: loss=175.4593
[2020-07-11T21:03:07] Step 50: loss=191.4991
[2020-07-11T21:05:07] Step 60: loss=198.2391
[2020-07-11T21:07:04] Step 70: loss=204.6865
[2020-07-11T21:08:56] Step 80: loss=156.1282
[2020-07-11T21:10:47] Step 90: loss=158.8194
[2020-07-11T21:12:44] Step 100 stats, train: loss = 159.02596282958984
[2020-07-11T21:13:02] Step 100 stats, val: loss = 188.73794555664062
[2020-07-11T21:13:13] Step 100: loss=195.0599
[2020-07-11T21:15:03] Step 110: loss=166.1000
[2020-07-11T21:16:56] Step 120: loss=160.7225
[2020-07-11T21:18:48] Step 130: loss=172.8267
[2020-07-11T21:20:38] Step 140: loss=200.5286
[2020-07-11T21:22:29] Step 150: loss=194.8727
[2020-07-11T21:24:21] Step 160: loss=211.9967
[2020-07-11T21:26:17] Step 170: loss=215.9024
[2020-07-11T21:28:10] Step 180: loss=196.7601
[2020-07-11T21:30:00] Step 190: loss=186.0729
[2020-07-11T21:32:02] Step 200 stats, train: loss = 159.02596282958984
[2020-07-11T21:32:19] Step 200 stats, val: loss = 188.73794555664062
[2020-07-11T21:32:29] Step 200: loss=175.3226
[2020-07-11T21:34:22] Step 210: loss=176.8896
[2020-07-11T21:36:17] Step 220: loss=229.3411
[2020-07-11T21:38:08] Step 230: loss=183.6960
[2020-07-11T21:39:59] Step 240: loss=201.4256
[2020-07-11T21:41:50] Step 250: loss=171.4176

The loss remains unchanged in the two eval_on_train, as if the model has not been updated.

I did not use docker but installed related dependencies.
Is the error caused by this reason?
Thank you!!!

ZQS1943 · 2020-07-12T02:56:37Z

so sorry. I manually deleted all checkpoints and ran again, and the loss dropped. I don't know why.

Chlorie · 2021-06-16T08:41:05Z

I recently ran into a similar problem like this as well. Looking at your logs it might be the same problem: When the checkpoint at the 1st step (model_checkpoint-00000001) is loaded the model just won't optimize. Later checkpoints work fine though. This is really weird...

Also, I modified the code to support multi-GPU training, and when I set batch accumulation to 1 (no accumulation, just one big batch of 24 instead of 4 accumulations of 6) the model won't optimize. Setting it to 6x4 or 12x2 works.

I'm scratching my hair off trying to find the cause of this issue but to no avail.

ZQS1943 changed the title ~~Loss dose not drop~~ Loss does not drop Jul 11, 2020

alexpolozov closed this as completed Jul 12, 2020

saparina mentioned this issue Jul 29, 2020

Loss starts to increase during BERT model training #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss does not drop #3

Loss does not drop #3

ZQS1943 commented Jul 11, 2020 •

edited

Loading

ZQS1943 commented Jul 12, 2020

Chlorie commented Jun 16, 2021

Loss does not drop #3

Loss does not drop #3

Comments

ZQS1943 commented Jul 11, 2020 • edited Loading

ZQS1943 commented Jul 12, 2020

Chlorie commented Jun 16, 2021

ZQS1943 commented Jul 11, 2020 •

edited

Loading