You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
开始 landmark_generator_training******************
Project_name: landmarks
Load checkpoint from: ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch_1166_checkpoint_step000035000.pth
Load optimizer state from ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch_1166_checkpoint_step000035000.pth
init dataset,filtering very short videos.....
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49644/49644 [00:04<00:00, 11383.00it/s]
complete,with available vids: 49475
init dataset,filtering very short videos.....
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 11265.29it/s]
complete,with available vids: 9976
0%| | 0/30 [00:00<?, ?it/s]Saved checkpoint: ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch1166_step000035000.pth
Evaluating model for 25 epochs
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:45<00:00, 6.62s/it]
eval_L1_loss 0.005300633320584894 global_step: 35000█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:45<00:00, 6.63s/it]
eval_velocity_loss 0.04097183309495449 global_step: 35000
0%| | 0/30 [02:56<?, ?it/s]
Traceback (most recent call last):
File "train_landmarks_generator.py", line 341, in
optimizer.step()
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, **kwargs)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/optimizer.py", line 23, in use_grad
ret = func(self, *args, **kwargs)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 252, in step
found_inf=found_inf)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 316, in adam
found_inf=found_inf)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 363, in single_tensor_adam
exp_avg.mul(beta1).add(grad, alpha=1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
我在训练 landmark_generator时,为了能够从上次中断的地方继续训练,发现调用 load_checkpoint()传入的参数 reset_optimizer=False时,会出现下面的错误
开始 landmark_generator_training******************
Project_name: landmarks
Load checkpoint from: ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch_1166_checkpoint_step000035000.pth
Load optimizer state from ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch_1166_checkpoint_step000035000.pth
init dataset,filtering very short videos.....
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49644/49644 [00:04<00:00, 11383.00it/s]
complete,with available vids: 49475
init dataset,filtering very short videos.....
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 11265.29it/s]
complete,with available vids: 9976
0%| | 0/30 [00:00<?, ?it/s]Saved checkpoint: ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch1166_step000035000.pth
Evaluating model for 25 epochs
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:45<00:00, 6.62s/it]
eval_L1_loss 0.005300633320584894 global_step: 35000█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:45<00:00, 6.63s/it]
eval_velocity_loss 0.04097183309495449 global_step: 35000
0%| | 0/30 [02:56<?, ?it/s]
Traceback (most recent call last):
File "train_landmarks_generator.py", line 341, in
optimizer.step()
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, **kwargs)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/optimizer.py", line 23, in use_grad
ret = func(self, *args, **kwargs)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 252, in step
found_inf=found_inf)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 316, in adam
found_inf=found_inf)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 363, in single_tensor_adam
exp_avg.mul(beta1).add(grad, alpha=1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
如果改为 reset_optimizer=True,可以正常训练,但是生成的 pth文件大小和之前不太一样,不知道为啥会相差十几K,这个正常吗
-rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 02:20 landmarks_epoch_166_checkpoint_step000005000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 03:40 landmarks_epoch_333_checkpoint_step000010000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 05:00 landmarks_epoch_500_checkpoint_step000015000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 06:21 landmarks_epoch_666_checkpoint_step000020000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 07:41 landmarks_epoch_833_checkpoint_step000025000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167281157 1月 2 09:01 landmarks_epoch_1000_checkpoint_step000030000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167281157 1月 2 10:21 landmarks_epoch_1166_checkpoint_step000035000.pth
调用 load_checkpoint() 传入 reset_optimizer=True
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 13:36 landmarks_epoch1199_step000036000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 13:54 landmarks_epoch1232_step000037000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 15:01 landmarks_epoch1266_step000038000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 15:19 landmarks_epoch1299_step000039000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 15:38 landmarks_epoch1332_step000040000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 15:56 landmarks_epoch1366_step000041000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 16:15 landmarks_epoch1399_step000042000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 16:33 landmarks_epoch1432_step000043000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 16:51 landmarks_epoch1466_step000044000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 17:10 landmarks_epoch1499_step000045000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 17:28 landmarks_epoch1532_step000046000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 17:47 landmarks_epoch1566_step000047000.pth
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 18:05 landmarks_epoch1599_step000048000.pth
The text was updated successfully, but these errors were encountered: