The loss becomes 'nan' as early as the second epoch during fine-tuning #3

Jeba-create · 2023-09-18T12:24:34Z

I used the 'swin' checkpoint and attempted to fine-tune the model using the command below.
CUDA_VISIBLE_DEVICES=0 python train.py --cfg configs/cuhk_sysu.yaml --resume --ckpt swin_tiny_cnvrtd.pth OUTPUT_DIR './results' SOLVER.BASE_LR 0.00003 EVAL_PERIOD 5 MODEL.BONE 'swin_tiny' INPUT.BATCH_SIZE_TRAIN 4 MODEL.SEMANTIC_WEIGHT 0.8
However, during the second epoch, the loss became 'nan' as shown below

Epoch: [1]  [ 920/2801]  eta: 0:27:47  lr: 0.000300  loss: 8.7676 (8.7104)  loss_proposal_cls: 0.2417 (0.2408)  loss_proposal_reg: 2.5734 (2.3512)  loss_box_cls: 0.6964 (0.7277)  loss_box_reg: 0.2329 (0.2421)  loss_box_reid: 4.3837 (4.5777)  loss_rpn_reg: 0.0509 (0.0585)  loss_rpn_cls: 0.4375 (0.5125)  time: 0.8908  data: 0.0003  max mem: 25115
Loss is nan, stopping training
{'loss_proposal_cls': tensor(0.1927, device='cuda:0', grad_fn=<MulBackward0>), 'loss_proposal_reg': tensor(2.3873, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_cls': tensor(0.6746, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_reg': tensor(0.2970, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_reid': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_reg': tensor(0.1049, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_cls': tensor(0.4566, device='cuda:0', grad_fn=<MulBackward0>)}

Could you please provide the trained checkpoint to perform inference on it?

The text was updated successfully, but these errors were encountered:

YangJae96 · 2024-10-08T08:57:30Z

@Jeba-create Same here, how did you solve the problem?

AlecDusheck · 2024-10-31T05:53:16Z

Currently doing a training run, for whatever reason we also had this at step 920 on e1. Also had loss_box_reid as nan.

We went in and modified engine.py to just skip the batch if this is detected, the code that stops training is here.

Will update when our training is done and if this isn't the result of something more problematic, will publish our weights and training log.

AlecDusheck · 2024-11-01T04:40:44Z

My pre-trained weights for epoch 5 are available here: https://drive.google.com/file/d/1-l-_F_ernnZxbPoUSC5xGkv5dI7z_4tT/view?usp=sharing

At this point, there is definitely room to be desired in terms of performance, but the model works well. When we get some more GPU capacity we will train until epoch 20. I'd be happy to provide these (just send me an email, on my GH profile)

AlecDusheck mentioned this issue Nov 1, 2024

在cuhk_sysu做完训练后使用demo.py推理结果不正确 #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The loss becomes 'nan' as early as the second epoch during fine-tuning #3

The loss becomes 'nan' as early as the second epoch during fine-tuning #3

Jeba-create commented Sep 18, 2023

YangJae96 commented Oct 8, 2024

AlecDusheck commented Oct 31, 2024 •

edited

Loading

AlecDusheck commented Nov 1, 2024

The loss becomes 'nan' as early as the second epoch during fine-tuning #3

The loss becomes 'nan' as early as the second epoch during fine-tuning #3

Comments

Jeba-create commented Sep 18, 2023

YangJae96 commented Oct 8, 2024

AlecDusheck commented Oct 31, 2024 • edited Loading

AlecDusheck commented Nov 1, 2024

AlecDusheck commented Oct 31, 2024 •

edited

Loading