Skip to content

Commit

Permalink
Make Swin work with VisionEncoderDecoderModel (#15527)
Browse files Browse the repository at this point in the history
* Add attribute_map

* Add mention in docs

* Set hidden_size attribute correctly

* Add note about Transformer-based models only

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MBP.localdomain>
  • Loading branch information
NielsRogge and Niels Rogge authored Feb 14, 2022
1 parent ec15da2 commit b090b79
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 2 deletions.
4 changes: 2 additions & 2 deletions docs/source/model_doc/vision-encoder-decoder.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ specific language governing permissions and limitations under the License.
# Vision Encoder Decoder Models

The [`VisionEncoderDecoderModel`] can be used to initialize an image-to-text-sequence model with any
pretrained vision autoencoding model as the encoder (*e.g.* [ViT](vit), [BEiT](beit), [DeiT](deit))
and any pretrained language model as the decoder (*e.g.* [RoBERTa](roberta), [GPT2](gpt2), [BERT](bert)).
pretrained Transformer-based vision autoencoding model as the encoder (*e.g.* [ViT](vit), [BEiT](beit), [DeiT](deit), [Swin](swin))
and any pretrained language model as the decoder (*e.g.* [RoBERTa](roberta), [GPT2](gpt2), [BERT](bert), [DistilBERT](distilbert)).

The effectiveness of initializing image-to-text-sequence models with pretrained checkpoints has been shown in (for
example) [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
Expand Down
7 changes: 7 additions & 0 deletions src/transformers/models/swin/configuration_swin.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,10 @@ class SwinConfig(PretrainedConfig):
```"""
model_type = "swin"

attribute_map = {
"num_attention_heads": "num_heads",
}

def __init__(
self,
image_size=224,
Expand Down Expand Up @@ -130,3 +134,6 @@ def __init__(
self.path_norm = patch_norm
self.layer_norm_eps = layer_norm_eps
self.initializer_range = initializer_range
# we set the hidden_size attribute in order to make Swin work with VisionEncoderDecoderModel
# this indicates the channel dimension after the last stage of the model
self.hidden_size = embed_dim * 8

0 comments on commit b090b79

Please sign in to comment.