Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wired behaviors of AdaHessian on ResNext-50 #11

Open
XuezheMax opened this issue Nov 18, 2020 · 1 comment
Open

Wired behaviors of AdaHessian on ResNext-50 #11

XuezheMax opened this issue Nov 18, 2020 · 1 comment

Comments

@XuezheMax
Copy link

XuezheMax commented Nov 18, 2020

Hi,

Thanks for this great work. Recently, we tried to train ResNext-50 on ImageNet classification using AdaHessian. The implementation we used is from https://github.com/davda54/ada-hessian.

However, I got some wired observations. Please see the training log:

Epoch: 1/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 6.1249, top1: 2.74%, top5: 8.40%, time: 9660.5s

Avg  loss: 4.7754, top1: 10.54%, top5: 27.53%

Best loss: 4.7754, top1: 10.54%, top5: 27.53%, epoch: 1

Epoch: 2/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 4.2148, top1: 18.27%, top5: 38.85%, time: 9638.9s

Avg  loss: 3.4256, top1: 27.41%, top5: 53.10%

Best loss: 3.4256, top1: 27.41%, top5: 53.10%, epoch: 2

Epoch: 3/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 3.3622, top1: 30.28%, top5: 55.08%, time: 9635.2s

Avg  loss: 2.7773, top1: 38.40%, top5: 65.36%

Best loss: 2.7773, top1: 38.40%, top5: 65.36%, epoch: 3

Epoch: 4/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.9959, top1: 36.21%, top5: 61.72%, time: 9636.2s

Avg  loss: 2.6380, top1: 40.47%, top5: 67.98%

Best loss: 2.6380, top1: 40.47%, top5: 67.98%, epoch: 4

Epoch: 5/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.8171, top1: 39.26%, top5: 64.87%, time: 9630.8s

Avg  loss: 2.5880, top1: 41.73%, top5: 68.91%

Best loss: 2.5880, top1: 41.73%, top5: 68.91%, epoch: 5

Epoch: 6/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.7149, top1: 41.07%, top5: 66.66%, time: 9640.7s

Avg  loss: 2.3805, top1: 45.68%, top5: 72.20%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 7/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.6456, top1: 42.30%, top5: 67.90%, time: 9639.8s

Avg  loss: 5.2944, top1: 13.36%, top5: 30.77%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 8/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.5855, top1: 43.46%, top5: 68.86%, time: 9637.7s

Avg  loss: 14.9700, top1: 0.14%, top5: 0.49%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 9/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.5401, top1: 44.36%, top5: 69.65%, time: 9642.6s

Avg  loss: 8.2867, top1: 0.10%, top5: 0.50%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 10/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.5080, top1: 45.03%, top5: 70.24%, time: 9633.9s

Avg  loss: 11.4105, top1: 0.10%, top5: 0.50%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

We see that at the first 6 epochs, AdaHessian worked well. But from the 7th epoch, the training loss still decreased normally. But the test lost increased and the test accuracy declined, rapidly. We have tried several hyper-parameters and different random seeds, but this always happens.

We provided the details of our setting below for your reference.
The implementation of ResNext-50 is the standard one in PyTorch. The training is performed across 8 V100 GPUs, with total batch size 256 (32 per GPU).
We have tried to search the hyper-parameters: lr in {0.1, 0.15}, eps in {1e-2, 1e-4}, weight decay in {1e-4, 2e-4, 4e-4, 8e-4, 1e-3}. For other hyper-parameters, we used the default values.
We also applied linear warmup of the learning rate at the first 100 steps, otherwise AdaHessian crashed at the beginning of model training.

@yaozhewei
Copy link
Collaborator

Hi Xuezhe,

Please let me know if the version in our branch can solve your problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants