-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Add option for ML-Decoder - an improved classification head #1012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@mrT23 thanks for the PR, I'd definitely like to add this. I've had an outstanding TODO to figure out a clean mechanism to use different heads on the models. Obviously the differences across model archs is a challenge, and I probably should move the pooling for the transformer models as you pointed out. I've also got an alternative to GAP that worked well (but fixes the resolution of the network as I imagine yours does?) https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/attention_pool2d.py .. and wanted to support that too. I was thinking of trying to re-working some of (https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/classifier.py ), adding alternate head / 'decoder' module support, and some sort of head factory and interface for changing them that works across all models.... |
Our solution does not require a fixed resolution. We tested it in the article on 224,448 and 640, and switched between them all the time seamlessly. While the name is similar, i think that our proposed head (ML-Decoder) is very different than the attention_pool2d.py head:
In any case, using a factory head scheme is a great idea :-) |
If you are working\planing to refactor the implementation of the classification head in all the models, i will close the merge request. However, this is a major task. Each model has its own quirks and specific details, and you will need to edit the models one by one. it might be better to enable via this merge request different classification head at least for the CNN models, so that people will get the chance to experiment with other heads (GAP vs ML-Decoder vs attention-based), and compare performances and speed-accuracy tradeof. |
@mrT23 don't close yet, I have quite a few things on the go and just haven't had much time to think about this one yet |
@mrT23 I have an application where I'd like to try this, so going to get the merge going, but will leave out the factory changes for now since I don't want to support this approach long term. A question re the MLDecoder imp. It looks a bit hardcoded to a certain size/scale of model? The decoder dim is fixed at 2048 which seems to imply a certain capacity range (medium-large) ResNet, TResNet, etc... |
"2048" (dim_forward) is an internal size of the multi-head attention. num_of_features (which is 2048 for plain ResNet50) is taken from the model parameters
ML-Decoder works seamlessly with any number of input features. I think that 'EfficientNet' has a different number of features, and i tested it also there
Looking forward for your application and comparisons. :-) |
@mrT23 this dropout isn't making much sense to me https://github.com/rwightman/pytorch-image-models/pull/1012/files#diff-5f8df68f2f387455dc7c1b962432df77e952a5879a614205b2ef067b4324afd1R65 |
@mrT23 any comment on the transformer decodery layer? You have
I think below makes more sense?
|
TL;DR long answer:
notice the nice engineering features:
none of these options is trivial, and i am sure that a lot of experiments led to this good design. For our classification head, the decoder input is from fixed external queries. Hence the expensive self-attention module just provides a fixed transformation on them, and can be removed (we talk about it thoroughly in the paper). My initial choice was also to remove the dropout and the normalization - they also seem redundant. However, the score dropped a bit. only when i kept them, i could remove the self-attention and get exactly the same accuracy. My guess is that the initial dropout component provides a needed regularization, that helps the model converge well. |
@mrT23 thanks for the explanation, I'll pull it in as is then (minus the factory additions for now), experiment, and then figure out how to deal with the head generically later. |
awesome work! thanks |
While almost every aspect of ImageNet training had improved in the last couple of years (backbones, augmentations, loss,...), a plain classification head, GAP + fully connected, remains the default option.
In our paper, "ML-Decoder: Scalable and Versatile Classification Head" ( https://github.com/Alibaba-MIIL/ML_Decoder ),
we propose a new attention-based classification head, that not only improves results, but also provides better speed-accuracy tradeoff on various classification tasks - multi-label, single-label and zero shot.
A technical note about the merge request - since each model has a unique coding style, systematically using a different classification head is challenging. This merge request enables ML-Decoder head to all CNNs (I specifically checked ResNet, ResNetD, EfficientNet, RgeNet and TResNet). For Transformers, the GAP operation is embedded inside the 'forward_features' pass, so it is hard to use a different classification head without editing each model separately.