-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On the use of Apex AMP and hybrid stages #22
Comments
Hi @DonkeyShot21 , Thanks for your attention. We only used to train We plan to release the implementation and more trained models of hybrid stages around March. As for [Pool, Pool, Attention, Attention]-S12 (81.0% accuracy) shown in the paper, we trained it with LayerNorm, batch size of 1024, the learning rate of 1e-3. The remained hyper-parameters are the same as |
Hi @yuweihao, thanks for the nice reply. Apex can be hard to install without sudo, that is why I prefer native AMP. Actually, I have tried both (Apex, native) with solo-learn and both lead to nans in the loss quite quickly. This also happens with Swin and ViT. I am trying your implementation now with native AMP and it seems it works nicely, the logs are similar to the ones you posted on google drive. So I guess my problem is related to the SSL methods or to the fact that solo-learn does not support mixup and cutmix. The only way I could stabilize training was with SGD + LARS and gradient accumulation (to simulate a large batch size), but the results are very bad, much worse than ResNet18. I guess SGD is not a good match for metaformers in general. Thanks for the details on the hybrid stage. I have also seen in other issues that you said that depthwise convs can also be used instead of pooling with a slight increase in performance. Do you think this can be paired with the hybrid stages as well (e.g. depthwise conv in the first 2 blocks and then attention in the last 2)? |
Hi @DonkeyShot21 , thanks for your wonderful works for the research community:) Yes, [DWConv, DWConv, Attention, Attention] also works very well and it is in our release plan of models with hybrid stages. |
Thank you again! Looking forward to the release! |
Hey @yuweihao, sorry to bother you again. For the hybrid stage [Pool, Pool, Attention, Attention] did you use layer norm just for the attention blocks or for the pooling blocks as well? I am trying to reproduce it on ImageNet-100 but I didn't get better performance than vanilla poolformer. The params and flops are the same as you reported, so I guess the implementation should be correct. |
Hi @DonkeyShot21 , I use layer norm for all [Pool, Pool, Attention, Attention]-S12 blocks. I guess the attention blocks may be easy to overfit on small datasets, which results in worse performance than vanilla poolformer on ImageNet-100. |
Is there a specific reason why you used Apex AMP instead of the native AMP provided by PyTorch? Have you tried native AMP?
I tried to train
poolformer_s12
andpoolformer_s24
with solo-learn; with native fp16 the loss goes tonan
after a few epochs, while with fp32 it works fine. Did you experience similar behavior?On a side note, can you provide the implementation and the hyperparameters for the hybrid stage [Pool, Pool, Attention, Attention]? It seems very interesting!
The text was updated successfully, but these errors were encountered: