Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

调用load_checkpoint导致的问题 #44

Open
tailangjun opened this issue Jan 2, 2024 · 2 comments
Open

调用load_checkpoint导致的问题 #44

tailangjun opened this issue Jan 2, 2024 · 2 comments

Comments

@tailangjun
Copy link

我在训练 landmark_generator时,为了能够从上次中断的地方继续训练,发现调用 load_checkpoint()传入的参数 reset_optimizer=False时,会出现下面的错误

开始 landmark_generator_training******************
Project_name: landmarks
Load checkpoint from: ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch_1166_checkpoint_step000035000.pth
Load optimizer state from ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch_1166_checkpoint_step000035000.pth
init dataset,filtering very short videos.....
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49644/49644 [00:04<00:00, 11383.00it/s]
complete,with available vids: 49475

init dataset,filtering very short videos.....
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 11265.29it/s]
complete,with available vids: 9976

0%| | 0/30 [00:00<?, ?it/s]Saved checkpoint: ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch1166_step000035000.pth
Evaluating model for 25 epochs
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:45<00:00, 6.62s/it]
eval_L1_loss 0.005300633320584894 global_step: 35000█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:45<00:00, 6.63s/it]
eval_velocity_loss 0.04097183309495449 global_step: 35000
0%| | 0/30 [02:56<?, ?it/s]
Traceback (most recent call last):
File "train_landmarks_generator.py", line 341, in
optimizer.step()
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, **kwargs)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/optimizer.py", line 23, in use_grad
ret = func(self, *args, **kwargs)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 252, in step
found_inf=found_inf)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 316, in adam
found_inf=found_inf)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 363, in single_tensor_adam
exp_avg.mul
(beta1).add
(grad, alpha=1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

如果改为 reset_optimizer=True,可以正常训练,但是生成的 pth文件大小和之前不太一样,不知道为啥会相差十几K,这个正常吗

-rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 02:20 landmarks_epoch_166_checkpoint_step000005000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 03:40 landmarks_epoch_333_checkpoint_step000010000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 05:00 landmarks_epoch_500_checkpoint_step000015000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 06:21 landmarks_epoch_666_checkpoint_step000020000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 07:41 landmarks_epoch_833_checkpoint_step000025000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167281157 1月 2 09:01 landmarks_epoch_1000_checkpoint_step000030000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167281157 1月 2 10:21 landmarks_epoch_1166_checkpoint_step000035000.pth

调用 load_checkpoint() 传入 reset_optimizer=True

-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 13:36 landmarks_epoch1199_step000036000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 13:54 landmarks_epoch1232_step000037000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 15:01 landmarks_epoch1266_step000038000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 15:19 landmarks_epoch1299_step000039000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 15:38 landmarks_epoch1332_step000040000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 15:56 landmarks_epoch1366_step000041000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 16:15 landmarks_epoch1399_step000042000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 16:33 landmarks_epoch1432_step000043000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 16:51 landmarks_epoch1466_step000044000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 17:10 landmarks_epoch1499_step000045000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 17:28 landmarks_epoch1532_step000046000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 17:47 landmarks_epoch1566_step000047000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 18:05 landmarks_epoch1599_step000048000.pth

@sunjian2015
Copy link

为啥你的 loss 可以降到这么低,我训的始终是 0.006 以上啊

@tailangjun
Copy link
Author

为啥你的 loss 可以降到这么低,我训的始终是 0.006 以上啊

当时只是为了跑通流程,所以数据集不大

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants