Always got nan loss while token_label is False. #6

lqniunjunlper · 2024-03-29T11:49:16Z

[03/29 11:44:29] train INFO: Test: [520/521] eta: 0:00:00 loss: 6.7130 (6.8320) acc1: 0.0000 (0.5440) acc5: 0.0000 (2.0020) time: 0.4453 data: 0.0002 max mem: 15878
[03/29 11:44:29] train INFO: Test: Total time: 0:04:02 (0.4651 s / it)
[03/29 11:44:35] train INFO: * Acc@1 0.544 Acc@5 2.002 loss 6.832
[03/29 11:44:37] train INFO: *** Best metric: 0.5440000167846679 (epoch 1)
[03/29 11:44:37] train INFO: Accuracy of the network on the 50000 test images: 0.5%
[03/29 11:44:37] train INFO: Max accuracy: 0.54%
[03/29 11:44:37] train INFO: {'train_lr': 9.999999999999953e-07, 'train_loss': 6.907856970549011, 'test_loss': 6.832010622845959, 'test_acc1': 0.5440000167846679, 'test_acc5': 2.0021250657749174, 'epoch': 1, 'n_parameters': 66249512}
[03/29 11:44:44] train INFO: Epoch: [2] [ 0/1251] eta: 2:18:16 lr: 0.000401 loss: 6.9199 (6.9199) time: 6.6321 data: 4.9419 max mem: 15878
[03/29 11:45:01] train INFO: Epoch: [2] [ 10/1251] eta: 0:44:41 lr: 0.000401 loss: 6.9639 (6.9749) time: 2.1604 data: 0.4495 max mem: 15878
[03/29 11:45:18] train INFO: Epoch: [2] [ 20/1251] eta: 0:39:54 lr: 0.000401 loss: 6.9639 (6.9806) time: 1.7107 data: 0.0003 max mem: 15878
[03/29 11:45:35] train INFO: Epoch: [2] [ 30/1251] eta: 0:38:00 lr: 0.000401 loss: 7.0065 (6.9880) time: 1.7071 data: 0.0002 max mem: 15878
[03/29 11:45:49] train INFO: Loss is nan, stopping training

Ale0311 · 2024-04-03T14:33:36Z

Hello! Where did you download the token-labels imagenet_efficientnet_l2_sz475_top5 from? Thanks!

Ale0311 · 2024-04-04T20:42:08Z

Are you training on multiple GPUs? If you managed to get the checkpoint, can you please share it with me? Thanks!

EasonXiao-888 · 2024-05-10T01:37:05Z

@lqniunjunlper Hello, I also encounter this problem and would like to know how you solved it.

lqniunjunlper · 2024-05-10T02:01:24Z

@EasonXiao-888 When token_label is setting to False, the loss is always nan. Later i downloaded the token label datasets, but the training is still unstable (at the middle step of training process, loss increased suddenly).

I have tried different training setup including model size and batch size, most of the training is failed. I only got one success, part of the log is following:

EasonXiao-888 · 2024-05-10T03:24:01Z

Thank you very much for your reply. I would like to know if the results provided by Simba in the paper use token_lable?

megumitagaki · 2024-10-21T03:21:57Z

Hello! Where did you download the token-labels imagenet_efficientnet_l2_sz475_top5 from? Thanks!

hello!Did you have the token-labels _imagenet_efficientnet_l2_sz475_top5?Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always got nan loss while token_label is False. #6

Always got nan loss while token_label is False. #6

lqniunjunlper commented Mar 29, 2024

Ale0311 commented Apr 3, 2024

Ale0311 commented Apr 4, 2024

EasonXiao-888 commented May 10, 2024

lqniunjunlper commented May 10, 2024

EasonXiao-888 commented May 10, 2024

megumitagaki commented Oct 21, 2024

Always got nan loss while token_label is False. #6

Always got nan loss while token_label is False. #6

Comments

lqniunjunlper commented Mar 29, 2024

Ale0311 commented Apr 3, 2024

Ale0311 commented Apr 4, 2024

EasonXiao-888 commented May 10, 2024

lqniunjunlper commented May 10, 2024

EasonXiao-888 commented May 10, 2024

megumitagaki commented Oct 21, 2024