Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss does not drop #3

Closed
ZQS1943 opened this issue Jul 11, 2020 · 2 comments
Closed

Loss does not drop #3

ZQS1943 opened this issue Jul 11, 2020 · 2 comments

Comments

@ZQS1943
Copy link

ZQS1943 commented Jul 11, 2020

Hi,

I'm trying to run your model but during training the loss does not drop.

Here is the part of the loss.

(qiusi) qiusi@mewtwo:~/rat-sql$ python run.py preprocess experiments/spider-glove-run.jsonnet
WARNING <class 'ratsql.models.enc_dec.EncDecModel.Preproc'>: superfluous {'name': 'EncDec'}
DB connections: 100%|████████████████████████████████████████████| 166/166 [00:01<00:00, 84.90it/s]
train section: 100%|███████████████████████████████████████████| 8659/8659 [07:43<00:00, 18.69it/s]
DB connections: 100%|███████████████████████████████████████████| 166/166 [00:01<00:00, 115.15it/s]
val section: 100%|█████████████████████████████████████████████| 1034/1034 [00:47<00:00, 21.98it/s]

(qiusi) qiusi@mewtwo:~/rat-sql$ python run.py train experiments/spider-glove-run.jsonnet
[2020-07-11T20:53:06] Logging to logdir/glove_run/bs=20,lr=7.4e-04,end_lr=0e0,att=0
Loading model from logdir/glove_run/bs=20,lr=7.4e-04,end_lr=0e0,att=0/model_checkpoint
[2020-07-11T20:55:38] Step 10: loss=178.1385
[2020-07-11T20:57:36] Step 20: loss=185.5836
[2020-07-11T20:59:29] Step 30: loss=165.1267
[2020-07-11T21:01:16] Step 40: loss=175.4593
[2020-07-11T21:03:07] Step 50: loss=191.4991
[2020-07-11T21:05:07] Step 60: loss=198.2391
[2020-07-11T21:07:04] Step 70: loss=204.6865
[2020-07-11T21:08:56] Step 80: loss=156.1282
[2020-07-11T21:10:47] Step 90: loss=158.8194
[2020-07-11T21:12:44] Step 100 stats, train: loss = 159.02596282958984
[2020-07-11T21:13:02] Step 100 stats, val: loss = 188.73794555664062
[2020-07-11T21:13:13] Step 100: loss=195.0599
[2020-07-11T21:15:03] Step 110: loss=166.1000
[2020-07-11T21:16:56] Step 120: loss=160.7225
[2020-07-11T21:18:48] Step 130: loss=172.8267
[2020-07-11T21:20:38] Step 140: loss=200.5286
[2020-07-11T21:22:29] Step 150: loss=194.8727
[2020-07-11T21:24:21] Step 160: loss=211.9967
[2020-07-11T21:26:17] Step 170: loss=215.9024
[2020-07-11T21:28:10] Step 180: loss=196.7601
[2020-07-11T21:30:00] Step 190: loss=186.0729
[2020-07-11T21:32:02] Step 200 stats, train: loss = 159.02596282958984
[2020-07-11T21:32:19] Step 200 stats, val: loss = 188.73794555664062
[2020-07-11T21:32:29] Step 200: loss=175.3226
[2020-07-11T21:34:22] Step 210: loss=176.8896
[2020-07-11T21:36:17] Step 220: loss=229.3411
[2020-07-11T21:38:08] Step 230: loss=183.6960
[2020-07-11T21:39:59] Step 240: loss=201.4256
[2020-07-11T21:41:50] Step 250: loss=171.4176

The loss remains unchanged in the two eval_on_train, as if the model has not been updated.

I did not use docker but installed related dependencies.
Is the error caused by this reason?
Thank you!!!

@ZQS1943 ZQS1943 changed the title Loss dose not drop Loss does not drop Jul 11, 2020
@ZQS1943
Copy link
Author

ZQS1943 commented Jul 12, 2020

so sorry. I manually deleted all checkpoints and ran again, and the loss dropped. I don't know why.

@Chlorie
Copy link

Chlorie commented Jun 16, 2021

I recently ran into a similar problem like this as well. Looking at your logs it might be the same problem: When the checkpoint at the 1st step (model_checkpoint-00000001) is loaded the model just won't optimize. Later checkpoints work fine though. This is really weird...

Also, I modified the code to support multi-GPU training, and when I set batch accumulation to 1 (no accumulation, just one big batch of 24 instead of 4 accumulations of 6) the model won't optimize. Setting it to 6x4 or 12x2 works.

I'm scratching my hair off trying to find the cause of this issue but to no avail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants