Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The loss becomes 'nan' as early as the second epoch during fine-tuning #3

Open
Jeba-create opened this issue Sep 18, 2023 · 3 comments

Comments

@Jeba-create
Copy link

I used the 'swin' checkpoint and attempted to fine-tune the model using the command below.
CUDA_VISIBLE_DEVICES=0 python train.py --cfg configs/cuhk_sysu.yaml --resume --ckpt swin_tiny_cnvrtd.pth OUTPUT_DIR './results' SOLVER.BASE_LR 0.00003 EVAL_PERIOD 5 MODEL.BONE 'swin_tiny' INPUT.BATCH_SIZE_TRAIN 4 MODEL.SEMANTIC_WEIGHT 0.8
However, during the second epoch, the loss became 'nan' as shown below

Epoch: [1]  [ 920/2801]  eta: 0:27:47  lr: 0.000300  loss: 8.7676 (8.7104)  loss_proposal_cls: 0.2417 (0.2408)  loss_proposal_reg: 2.5734 (2.3512)  loss_box_cls: 0.6964 (0.7277)  loss_box_reg: 0.2329 (0.2421)  loss_box_reid: 4.3837 (4.5777)  loss_rpn_reg: 0.0509 (0.0585)  loss_rpn_cls: 0.4375 (0.5125)  time: 0.8908  data: 0.0003  max mem: 25115
Loss is nan, stopping training
{'loss_proposal_cls': tensor(0.1927, device='cuda:0', grad_fn=<MulBackward0>), 'loss_proposal_reg': tensor(2.3873, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_cls': tensor(0.6746, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_reg': tensor(0.2970, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_reid': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_reg': tensor(0.1049, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_cls': tensor(0.4566, device='cuda:0', grad_fn=<MulBackward0>)}

Could you please provide the trained checkpoint to perform inference on it?

@YangJae96
Copy link

@Jeba-create Same here, how did you solve the problem?

@AlecDusheck
Copy link

AlecDusheck commented Oct 31, 2024

Currently doing a training run, for whatever reason we also had this at step 920 on e1. Also had loss_box_reid as nan.

We went in and modified engine.py to just skip the batch if this is detected, the code that stops training is here.

Will update when our training is done and if this isn't the result of something more problematic, will publish our weights and training log.

@AlecDusheck
Copy link

My pre-trained weights for epoch 5 are available here: https://drive.google.com/file/d/1-l-_F_ernnZxbPoUSC5xGkv5dI7z_4tT/view?usp=sharing

At this point, there is definitely room to be desired in terms of performance, but the model works well. When we get some more GPU capacity we will train until epoch 20. I'd be happy to provide these (just send me an email, on my GH profile)

Screenshot 2024-10-31 at 11 38 16 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants