Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN loss while training stage 1 VAE #47

Open
supriya-gdptl opened this issue Jun 14, 2023 · 1 comment
Open

NaN loss while training stage 1 VAE #47

supriya-gdptl opened this issue Jun 14, 2023 · 1 comment

Comments

@supriya-gdptl
Copy link

Hi @ZENGXH ,

Thank you for sharing the code.

I am training VAE (stage 1) on the ShapeNet15k dataset by following the instructions given in the README.md file.
I am using the default config, except the batch size is 16 (because using batch size 32 was giving cuda_out_of_memory error). The loss started increasing and eventually became nan.
So, I trained with a lower learning rate of 1e-4 (originally it was 1e-3). This time again, the loss decreased, then increased, and becamenan.

Please see the contents of log file below:

2023-06-13 21:50:53.148 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[ 70/153] | [Loss] 335.14 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]    70 | [url] none
2023-06-13 21:51:53.192 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[152/153] | [Loss] 233.48 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   152 | [url] none
2023-06-13 21:51:53.251 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E0 iter[152/153] | [Loss] 233.48 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   152 | [url] none | [time] 2.0m (~267h) |[best] 0 -100.000x1e-2
2023-06-13 21:52:53.518 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[ 81/153] | [Loss] 108.90 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   234 | [url] none
2023-06-13 21:53:45.658 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E1 iter[152/153] | [Loss] 100.31 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   305 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 21:54:46.026 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[ 81/153] | [Loss] 79.69 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   387 | [url] none
2023-06-13 21:55:38.097 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E2 iter[152/153] | [Loss] 76.43 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   458 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 21:56:38.487 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E3 iter[ 81/153] | [Loss] 66.25 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   540 | [url] none
2023-06-13 21:57:30.785 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E3 iter[152/153] | [Loss] 63.98 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   611 | [url] none | [time] 1.9m (~250h) |[best] 0 -100.000x1e-2
2023-06-13 21:58:31.106 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E4 iter[ 81/153] | [Loss] 58.29 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   693 | [url] none
2023-06-13 21:59:23.191 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E4 iter[152/153] | [Loss] 57.15 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   764 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:00:23.558 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E5 iter[ 81/153] | [Loss] 55.49 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   846 | [url] none
2023-06-13 22:01:15.726 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E5 iter[152/153] | [Loss] 55.84 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   917 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:02:16.029 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E6 iter[ 81/153] | [Loss] 58.48 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   999 | [url] none
2023-06-13 22:03:08.117 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E6 iter[152/153] | [Loss] 59.70 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1070 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:04:08.409 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E7 iter[ 81/153] | [Loss] 64.31 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1152 | [url] none
2023-06-13 22:05:00.592 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E7 iter[152/153] | [Loss] 65.85 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1223 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:06:00.953 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E8 iter[ 81/153] | [Loss] 70.98 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1305 | [url] none
2023-06-13 22:06:53.085 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E8 iter[152/153] | [Loss] 72.55 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1376 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:07:53.497 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E9 iter[ 81/153] | [Loss] 77.83 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1458 | [url] none
2023-06-13 22:08:45.652 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E9 iter[152/153] | [Loss] 79.42 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1529 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:08:45.776 | INFO     | utils.exp_helper:get_evalname:94 - git hash: 13b1c
2023-06-13 22:08:47.341 | INFO     | trainers.base_trainer:eval_nll:743 - eval: 1/36
2023-06-13 22:08:51.946 | INFO     | trainers.base_trainer:eval_nll:743 - eval: 31/36
2023-06-13 22:09:00.621 | INFO     | utils.eval_helper:compute_NLL_metric:65 - best 10: tensor([ 57,   1, 349, 131, 113, 282, 271, 201, 108, 182], device='cuda:0')
2023-06-13 22:09:00.621 | INFO     | utils.eval_helper:compute_NLL_metric:72 - MMD-CD: 5.0256807604398546e-09
2023-06-13 22:09:00.622 | INFO     | utils.eval_helper:compute_NLL_metric:72 - MMD-EMD: 1.9488379621179774e-05
2023-06-13 22:09:00.622 | INFO     | utils.eval_helper:compute_NLL_metric:77 -
------------------------------------------------------------
../../output/lion_output/0613/car/cb9303h_hvae_lion_B16/recont_1529noemas1H13b1c.pt |
MMD-CD=0.000x1e-2 MMD-EMD=0.002x1e-2  step=1529
 none
 ------------------------------------------------------------
2023-06-13 22:09:00.622 | INFO     | trainers.base_trainer:eval_nll:814 - add: MMD-CD
2023-06-13 22:09:00.622 | INFO     | trainers.base_trainer:eval_nll:814 - add: MMD-EMD
2023-06-13 22:09:00.634 | INFO     | trainers.base_trainer:save:106 - save model as : ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16/checkpoints/best_eval.pth
2023-06-13 22:09:10.367 | INFO     | trainers.common_fun:validate_inspect_noprior:104 - writer: none
2023-06-13 22:09:46.203 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E10 iter[ 49/153] | [Loss] 83.91 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1579 | [url] none

I looked at previous issues #9 , #17 , #18 , #22 , #35 , but did not find any solution.
Could you please tell me how to resolve this issue?

Also, could you please share the checkpoint you mentioned in this section?

Thank you,
Supriya

@ZENGXH
Copy link
Collaborator

ZENGXH commented Jun 14, 2023

Hi Supriya, I didn't see the nan loss in the log, is it happen after epoch10?

I can think of several hyper-parameters to help with stabilize the training:

  • reducing lr is definitely helpful. It's great that you are trying this. another thing is the trainer.opt.vae_lr_warmup_epochs, the default value is 0, perhaps you can try setting it to 50 or even larger. This one will have the lr start from small number and slowly increase to the target lr through N epochs.
  • set sde.kl_anneal_portion_vada to be larger, default is 0.5, you can try increase to 1 (the maximum value), this control how fast the KL weight is increasing (the larger portion, the slower the weight increase), and slowly increasing the kl weight can lead to smoother training dynamic
  • reduce shapelatent.log_sigma_offset from 6.0 to say 5.0 or so, this is a constant offset pushing the sigma towards 0. Reducing the offset will make the latent point noisier when it's initialized and as a results lower KL loss. I am not sure whether reduce the offset can help, but it may worth a try.

For the checkpoints, sorry we are still under the company process of getting approval to release it. (it's unlikely to release this week, I will track the process next week).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants