Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always got nan loss while token_label is False. #6

Open
lqniunjunlper opened this issue Mar 29, 2024 · 6 comments
Open

Always got nan loss while token_label is False. #6

lqniunjunlper opened this issue Mar 29, 2024 · 6 comments

Comments

@lqniunjunlper
Copy link

[03/29 11:44:29] train INFO: Test: [520/521] eta: 0:00:00 loss: 6.7130 (6.8320) acc1: 0.0000 (0.5440) acc5: 0.0000 (2.0020) time: 0.4453 data: 0.0002 max mem: 15878
[03/29 11:44:29] train INFO: Test: Total time: 0:04:02 (0.4651 s / it)
[03/29 11:44:35] train INFO: * Acc@1 0.544 Acc@5 2.002 loss 6.832
[03/29 11:44:37] train INFO: *** Best metric: 0.5440000167846679 (epoch 1)
[03/29 11:44:37] train INFO: Accuracy of the network on the 50000 test images: 0.5%
[03/29 11:44:37] train INFO: Max accuracy: 0.54%
[03/29 11:44:37] train INFO: {'train_lr': 9.999999999999953e-07, 'train_loss': 6.907856970549011, 'test_loss': 6.832010622845959, 'test_acc1': 0.5440000167846679, 'test_acc5': 2.0021250657749174, 'epoch': 1, 'n_parameters': 66249512}
[03/29 11:44:44] train INFO: Epoch: [2] [ 0/1251] eta: 2:18:16 lr: 0.000401 loss: 6.9199 (6.9199) time: 6.6321 data: 4.9419 max mem: 15878
[03/29 11:45:01] train INFO: Epoch: [2] [ 10/1251] eta: 0:44:41 lr: 0.000401 loss: 6.9639 (6.9749) time: 2.1604 data: 0.4495 max mem: 15878
[03/29 11:45:18] train INFO: Epoch: [2] [ 20/1251] eta: 0:39:54 lr: 0.000401 loss: 6.9639 (6.9806) time: 1.7107 data: 0.0003 max mem: 15878
[03/29 11:45:35] train INFO: Epoch: [2] [ 30/1251] eta: 0:38:00 lr: 0.000401 loss: 7.0065 (6.9880) time: 1.7071 data: 0.0002 max mem: 15878
[03/29 11:45:49] train INFO: Loss is nan, stopping training

@Ale0311
Copy link

Ale0311 commented Apr 3, 2024

Hello! Where did you download the token-labels imagenet_efficientnet_l2_sz475_top5 from? Thanks!

@Ale0311
Copy link

Ale0311 commented Apr 4, 2024

Are you training on multiple GPUs? If you managed to get the checkpoint, can you please share it with me? Thanks!

@EasonXiao-888
Copy link

@lqniunjunlper Hello, I also encounter this problem and would like to know how you solved it.

@lqniunjunlper
Copy link
Author

@EasonXiao-888 When token_label is setting to False, the loss is always nan. Later i downloaded the token label datasets, but the training is still unstable (at the middle step of training process, loss increased suddenly).
image
I have tried different training setup including model size and batch size, most of the training is failed. I only got one success, part of the log is following:
image

@EasonXiao-888
Copy link

Thank you very much for your reply. I would like to know if the results provided by Simba in the paper use token_lable?

@megumitagaki
Copy link

Hello! Where did you download the token-labels imagenet_efficientnet_l2_sz475_top5 from? Thanks!

hello!Did you have the token-labels _imagenet_efficientnet_l2_sz475_top5?Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants