diff --git a/README.md b/README.md index 948525d120f43a..4170e489aefd51 100644 --- a/README.md +++ b/README.md @@ -486,6 +486,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. 1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. diff --git a/README_es.md b/README_es.md index 5d895a6ac8e050..e0f07ca8c51972 100644 --- a/README_es.md +++ b/README_es.md @@ -463,6 +463,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. 1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. diff --git a/README_hd.md b/README_hd.md index b3640555744aa0..1b02e6f8b16a5a 100644 --- a/README_hd.md +++ b/README_hd.md @@ -435,6 +435,7 @@ conda install -c huggingface transformers 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (गूगल एआई से) कागज के साथ [एक इमेज इज़ वर्थ 16x16 वर्ड्स: ट्रांसफॉर्मर्स फॉर इमेज रिकॉग्निशन एट स्केल](https://arxiv.org/abs/2010.11929) एलेक्सी डोसोवित्स्की, लुकास बेयर, अलेक्जेंडर कोलेसनिकोव, डिर्क वीसेनबोर्न, शियाओहुआ झाई, थॉमस अनटरथिनर, मुस्तफा देहघानी, मैथियास मिंडरर, जॉर्ज हेगोल्ड, सिल्वेन गेली, जैकब उस्ज़कोरेइट द्वारा हॉल्सबी द्वारा पोस्ट किया गया। 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (UCLA NLP से) साथ वाला पेपर [VisualBERT: A Simple and Performant Baseline for Vision and Language](https:/ /arxiv.org/pdf/1908.03557) लियुनियन हेरोल्ड ली, मार्क यात्स्कर, दा यिन, चो-जुई हसीह, काई-वेई चांग द्वारा। 1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI से) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. द्वाराअनुसंधान पत्र [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) के साथ जारी किया गया 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (मेटा एआई से) साथ में कागज [मास्कड ऑटोएन्कोडर स्केलेबल विजन लर्नर्स हैं](https://arxiv.org/ एब्स/2111.06377) कैमिंग हे, ज़िनेली चेन, सेनिंग ज़ी, यांगहो ली, पिओट्र डॉलर, रॉस गिर्शिक द्वारा। 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (मेटा एआई से) साथ में कागज [लेबल-कुशल सीखने के लिए मास्क्ड स्याम देश के नेटवर्क](https://arxiv. org/abs/2204.07141) महमूद असरान, मथिल्डे कैरन, ईशान मिश्रा, पियोट्र बोजानोवस्की, फ्लोरियन बोर्डेस, पास्कल विंसेंट, आर्मंड जौलिन, माइकल रब्बत, निकोलस बल्लास द्वारा। 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. diff --git a/README_ja.md b/README_ja.md index 64ee3a1df8c0b9..0c9a2648985779 100644 --- a/README_ja.md +++ b/README_ja.md @@ -497,6 +497,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (Google AI から) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby から公開された研究論文: [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (UCLA NLP から) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang から公開された研究論文: [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) 1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (Google AI から) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby から公開された研究論文: [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) +1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI から) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. から公開された研究論文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI から) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick から公開された研究論文: [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI から) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas から公開された研究論文: [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. diff --git a/README_ko.md b/README_ko.md index 41a8870b64a7f0..3d50ec4ae0bde8 100644 --- a/README_ko.md +++ b/README_ko.md @@ -412,6 +412,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (Google AI 에서) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 의 [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 논문과 함께 발표했습니다. 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (UCLA NLP 에서) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang 의 [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) 논문과 함께 발표했습니다. 1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (Google AI 에서) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 의 [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 논문과 함께 발표했습니다. +1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI 에서 제공)은 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.의 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527)논문과 함께 발표했습니다. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI 에서) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 의 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 논문과 함께 발표했습니다. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI 에서) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 의 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) 논문과 함께 발표했습니다. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. diff --git a/README_zh-hans.md b/README_zh-hans.md index d0c2967350403a..105dbae7de7372 100644 --- a/README_zh-hans.md +++ b/README_zh-hans.md @@ -436,6 +436,7 @@ conda install -c huggingface transformers 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (来自 Google AI) 伴随论文 [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 由 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 发布。 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (来自 UCLA NLP) 伴随论文 [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) 由 Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang 发布。 1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (来自 Google AI) 伴随论文 [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 由 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 发布。 +1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (来自 Meta AI) 伴随论文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) 由 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He 发布。 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (来自 Meta AI) 伴随论文 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 由 Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 发布。 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (来自 Meta AI) 伴随论文 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 发布. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (来自 Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) 由 Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. diff --git a/README_zh-hant.md b/README_zh-hant.md index 11535074863618..e5ebcf10fd4872 100644 --- a/README_zh-hant.md +++ b/README_zh-hant.md @@ -448,6 +448,7 @@ conda install -c huggingface transformers 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. 1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 55608e8aa69f66..72a8ee1dbfa04c 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -558,6 +558,8 @@ title: Vision Transformer (ViT) - local: model_doc/vit_hybrid title: ViT Hybrid + - local: model_doc/vitdet + title: ViTDet - local: model_doc/vit_mae title: ViTMAE - local: model_doc/vit_msn diff --git a/docs/source/en/index.md b/docs/source/en/index.md index 148832d8ed556d..39376f369b4870 100644 --- a/docs/source/en/index.md +++ b/docs/source/en/index.md @@ -252,6 +252,7 @@ The documentation is organized into five sections: 1. **[Vision Transformer (ViT)](model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 1. **[VisualBERT](model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. 1. **[ViT Hybrid](model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VitDet](model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[ViTMAE](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. 1. **[ViTMSN](model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. 1. **[ViViT](model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. @@ -471,6 +472,7 @@ Flax), PyTorch, and/or TensorFlow. | VisualBERT | ✅ | ❌ | ❌ | | ViT | ✅ | ✅ | ✅ | | ViT Hybrid | ✅ | ❌ | ❌ | +| VitDet | ✅ | ❌ | ❌ | | ViTMAE | ✅ | ✅ | ❌ | | ViTMSN | ✅ | ❌ | ❌ | | ViViT | ✅ | ❌ | ❌ | diff --git a/docs/source/en/model_doc/vitdet.md b/docs/source/en/model_doc/vitdet.md new file mode 100644 index 00000000000000..657e467ee31900 --- /dev/null +++ b/docs/source/en/model_doc/vitdet.md @@ -0,0 +1,39 @@ + + +# ViTDet + +## Overview + +The ViTDet model was proposed in [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. +VitDet leverages the plain [Vision Transformer](vit) for the task of object detection. + +The abstract from the paper is the following: + +*We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors.* + +Tips: + +- For the moment, only the backbone is available. + +This model was contributed by [nielsr](https://huggingface.co/nielsr). +The original code can be found [here](https://github.com/facebookresearch/detectron2/tree/main/projects/ViTDet). + + +## VitDetConfig + +[[autodoc]] VitDetConfig + +## VitDetModel + +[[autodoc]] VitDetModel + - forward \ No newline at end of file diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py index eeb0f8185bcc25..8e9e559cbbccbe 100644 --- a/src/transformers/__init__.py +++ b/src/transformers/__init__.py @@ -586,6 +586,7 @@ "models.vit_hybrid": ["VIT_HYBRID_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTHybridConfig"], "models.vit_mae": ["VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMAEConfig"], "models.vit_msn": ["VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMSNConfig"], + "models.vitdet": ["VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP", "VitDetConfig"], "models.vivit": [ "VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "VivitConfig", @@ -2925,6 +2926,14 @@ "ViTMSNPreTrainedModel", ] ) + _import_structure["models.vitdet"].extend( + [ + "VITDET_PRETRAINED_MODEL_ARCHIVE_LIST", + "VitDetBackbone", + "VitDetModel", + "VitDetPreTrainedModel", + ] + ) _import_structure["models.vivit"].extend( [ "VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -4632,6 +4641,7 @@ from .models.vit_hybrid import VIT_HYBRID_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTHybridConfig from .models.vit_mae import VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMAEConfig from .models.vit_msn import VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMSNConfig + from .models.vitdet import VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP, VitDetConfig from .models.vivit import VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP, VivitConfig from .models.wav2vec2 import ( WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP, @@ -6577,6 +6587,12 @@ ViTMSNModel, ViTMSNPreTrainedModel, ) + from .models.vitdet import ( + VITDET_PRETRAINED_MODEL_ARCHIVE_LIST, + VitDetBackbone, + VitDetModel, + VitDetPreTrainedModel, + ) from .models.vivit import ( VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST, VivitForVideoClassification, diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py index 48520217de8262..da35fbea60bdb5 100644 --- a/src/transformers/models/__init__.py +++ b/src/transformers/models/__init__.py @@ -210,6 +210,7 @@ vit_hybrid, vit_mae, vit_msn, + vitdet, vivit, wav2vec2, wav2vec2_conformer, diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py index e6529cd85fb448..f3c28fa58e8753 100755 --- a/src/transformers/models/auto/configuration_auto.py +++ b/src/transformers/models/auto/configuration_auto.py @@ -218,6 +218,7 @@ ("vit_hybrid", "ViTHybridConfig"), ("vit_mae", "ViTMAEConfig"), ("vit_msn", "ViTMSNConfig"), + ("vitdet", "VitDetConfig"), ("vivit", "VivitConfig"), ("wav2vec2", "Wav2Vec2Config"), ("wav2vec2-conformer", "Wav2Vec2ConformerConfig"), @@ -408,6 +409,7 @@ ("vit_hybrid", "VIT_HYBRID_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vit_mae", "VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vit_msn", "VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("vitdet", "VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vivit", "VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("wav2vec2", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("wav2vec2-conformer", "WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), @@ -640,6 +642,7 @@ ("vit_hybrid", "ViT Hybrid"), ("vit_mae", "ViTMAE"), ("vit_msn", "ViTMSN"), + ("vitdet", "VitDet"), ("vivit", "ViViT"), ("wav2vec2", "Wav2Vec2"), ("wav2vec2-conformer", "Wav2Vec2-Conformer"), diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py index a7c07c2c31dc0f..7661f5154f3783 100755 --- a/src/transformers/models/auto/modeling_auto.py +++ b/src/transformers/models/auto/modeling_auto.py @@ -204,6 +204,7 @@ ("vit_hybrid", "ViTHybridModel"), ("vit_mae", "ViTMAEModel"), ("vit_msn", "ViTMSNModel"), + ("vitdet", "VitDetModel"), ("vivit", "VivitModel"), ("wav2vec2", "Wav2Vec2Model"), ("wav2vec2-conformer", "Wav2Vec2ConformerModel"), @@ -1061,6 +1062,7 @@ ("resnet", "ResNetBackbone"), ("swin", "SwinBackbone"), ("timm_backbone", "TimmBackbone"), + ("vitdet", "VitDetBackbone"), ] ) diff --git a/src/transformers/models/vitdet/__init__.py b/src/transformers/models/vitdet/__init__.py new file mode 100644 index 00000000000000..8ccc1365820d69 --- /dev/null +++ b/src/transformers/models/vitdet/__init__.py @@ -0,0 +1,57 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import TYPE_CHECKING + +from ...utils import ( + OptionalDependencyNotAvailable, + _LazyModule, + is_torch_available, +) + + +_import_structure = {"configuration_vitdet": ["VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP", "VitDetConfig"]} + +try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["modeling_vitdet"] = [ + "VITDET_PRETRAINED_MODEL_ARCHIVE_LIST", + "VitDetModel", + "VitDetPreTrainedModel", + "VitDetBackbone", + ] + +if TYPE_CHECKING: + from .configuration_vitdet import VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP, VitDetConfig + + try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .modeling_vitdet import ( + VITDET_PRETRAINED_MODEL_ARCHIVE_LIST, + VitDetBackbone, + VitDetModel, + VitDetPreTrainedModel, + ) + +else: + import sys + + sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) diff --git a/src/transformers/models/vitdet/configuration_vitdet.py b/src/transformers/models/vitdet/configuration_vitdet.py new file mode 100644 index 00000000000000..80fb60ef4547bf --- /dev/null +++ b/src/transformers/models/vitdet/configuration_vitdet.py @@ -0,0 +1,155 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" VitDet model configuration""" + + +from ...configuration_utils import PretrainedConfig +from ...utils import logging +from ...utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices + + +logger = logging.get_logger(__name__) + +VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP = { + "facebook/vit-det-base": "https://huggingface.co/facebook/vit-det-base/resolve/main/config.json", +} + + +class VitDetConfig(BackboneConfigMixin, PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`VitDetModel`]. It is used to instantiate an + VitDet model according to the specified arguments, defining the model architecture. Instantiating a configuration + with the defaults will yield a similar configuration to that of the VitDet + [google/vitdet-base-patch16-224](https://huggingface.co/google/vitdet-base-patch16-224) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + hidden_size (`int`, *optional*, defaults to 768): + Dimensionality of the encoder layers and the pooler layer. + num_hidden_layers (`int`, *optional*, defaults to 12): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 12): + Number of attention heads for each attention layer in the Transformer encoder. + mlp_ratio (`int`, *optional*, defaults to 4): + Ratio of mlp hidden dim to embedding dim. + hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, + `"relu"`, `"selu"` and `"gelu_new"` are supported. + dropout_prob (`float`, *optional*, defaults to 0.0): + The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + layer_norm_eps (`float`, *optional*, defaults to 1e-6): + The epsilon used by the layer normalization layers. + image_size (`int`, *optional*, defaults to 224): + The size (resolution) of each image. + pretrain_image_size (`int`, *optional*, defaults to 224): + The size (resolution) of each image during pretraining. + patch_size (`int`, *optional*, defaults to 16): + The size (resolution) of each patch. + num_channels (`int`, *optional*, defaults to 3): + The number of input channels. + qkv_bias (`bool`, *optional*, defaults to `True`): + Whether to add a bias to the queries, keys and values. + drop_path_rate (`float`, *optional*, defaults to 0.0): + Stochastic depth rate. + window_block_indices (`List[int]`, *optional*): + List of indices of blocks that should have window attention instead of regular global self-attention. + residual_block_indices (`List[int]`, *optional*): + List of indices of blocks that should have an extra residual block after the MLP. + use_absolute_position_embeddings (`bool`, *optional*, defaults to `True`): + Whether to add absolute position embeddings to the patch embeddings. + use_relative_position_embeddings (`bool`, *optional*, defaults to `False`): + Whether to add relative position embeddings to the attention maps. + window_size (`int`, *optional*, defaults to 0): + The size of the attention window. + out_features (`List[str]`, *optional*): + If used as backbone, list of features to output. Can be any of `"stem"`, `"stage1"`, `"stage2"`, etc. + (depending on how many stages the model has). If unset and `out_indices` is set, will default to the + corresponding stages. If unset and `out_indices` is unset, will default to the last stage. + out_indices (`List[int]`, *optional*): + If used as backbone, list of indices of features to output. Can be any of 0, 1, 2, etc. (depending on how + many stages the model has). If unset and `out_features` is set, will default to the corresponding stages. + If unset and `out_features` is unset, will default to the last stage. + + Example: + + ```python + >>> from transformers import VitDetConfig, VitDetModel + + >>> # Initializing a VitDet configuration + >>> configuration = VitDetConfig() + + >>> # Initializing a model (with random weights) from the configuration + >>> model = VitDetModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + model_type = "vitdet" + + def __init__( + self, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + mlp_ratio=4, + hidden_act="gelu", + dropout_prob=0.0, + initializer_range=0.02, + layer_norm_eps=1e-6, + image_size=224, + pretrain_image_size=224, + patch_size=16, + num_channels=3, + qkv_bias=True, + drop_path_rate=0.0, + window_block_indices=[], + residual_block_indices=[], + use_absolute_position_embeddings=True, + use_relative_position_embeddings=False, + window_size=0, + out_features=None, + out_indices=None, + **kwargs, + ): + super().__init__(**kwargs) + + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.mlp_ratio = mlp_ratio + self.hidden_act = hidden_act + self.dropout_prob = dropout_prob + self.initializer_range = initializer_range + self.layer_norm_eps = layer_norm_eps + self.image_size = image_size + self.pretrain_image_size = pretrain_image_size + self.patch_size = patch_size + self.num_channels = num_channels + self.qkv_bias = qkv_bias + self.drop_path_rate = drop_path_rate + self.window_block_indices = window_block_indices + self.residual_block_indices = residual_block_indices + self.use_absolute_position_embeddings = use_absolute_position_embeddings + self.use_relative_position_embeddings = use_relative_position_embeddings + self.window_size = window_size + + self.stage_names = ["stem"] + [f"stage{idx}" for idx in range(1, self.num_hidden_layers + 1)] + self._out_features, self._out_indices = get_aligned_output_features_output_indices( + out_features=out_features, out_indices=out_indices, stage_names=self.stage_names + ) diff --git a/src/transformers/models/vitdet/modeling_vitdet.py b/src/transformers/models/vitdet/modeling_vitdet.py new file mode 100644 index 00000000000000..30620f5d1b9826 --- /dev/null +++ b/src/transformers/models/vitdet/modeling_vitdet.py @@ -0,0 +1,875 @@ +# coding=utf-8 +# Copyright 2023 Meta AI and The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" PyTorch ViTDet backbone.""" + + +import collections.abc +import math +from typing import Dict, List, Optional, Tuple, Union + +import torch +import torch.utils.checkpoint +from torch import nn + +from ...activations import ACT2FN +from ...modeling_outputs import BackboneOutput, BaseModelOutput +from ...modeling_utils import PreTrainedModel +from ...utils import ( + add_code_sample_docstrings, + add_start_docstrings, + add_start_docstrings_to_model_forward, + logging, + replace_return_docstrings, +) +from ...utils.backbone_utils import BackboneMixin +from .configuration_vitdet import VitDetConfig + + +logger = logging.get_logger(__name__) + +# General docstring +_CONFIG_FOR_DOC = "VitDetConfig" + +# Base docstring +_CHECKPOINT_FOR_DOC = "facebook/vit-det-base" +_EXPECTED_OUTPUT_SHAPE = [1, 197, 768] + + +VITDET_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "facebook/vit-det-base", + # See all ViTDet models at https://huggingface.co/models?filter=vitdet +] + + +class VitDetEmbeddings(nn.Module): + """ + This class turns `pixel_values` of shape `(batch_size, num_channels, height, width)` into the initial + `hidden_states` (patch embeddings) to be consumed by a Transformer. + """ + + def __init__(self, config): + super().__init__() + image_size, patch_size = config.pretrain_image_size, config.patch_size + num_channels, hidden_size = config.num_channels, config.hidden_size + + image_size = image_size if isinstance(image_size, collections.abc.Iterable) else (image_size, image_size) + patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size) + num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) + self.image_size = image_size + self.patch_size = patch_size + self.num_channels = num_channels + self.num_patches = num_patches + + if config.use_absolute_position_embeddings: + # Initialize absolute positional embedding with pretrain image size. + num_positions = num_patches + 1 + self.position_embeddings = nn.Parameter(torch.zeros(1, num_positions, config.hidden_size)) + else: + self.position_embeddings = None + + self.projection = nn.Conv2d(num_channels, hidden_size, kernel_size=patch_size, stride=patch_size) + + def get_absolute_positions(self, abs_pos_embeddings, has_cls_token, height, width): + """ + Calculate absolute positional embeddings. If needed, resize embeddings and remove cls_token dimension for the + original embeddings. + + Args: + abs_pos_embeddings (`torch.Tensor`): + Absolute positional embeddings with (1, num_position, num_channels). + has_cls_token (`bool`): + If true, has 1 embedding in abs_pos_embeddings for cls token. + height (`int`): + Height of input image tokens. + width (`int`): + Width of input image tokens. + + Returns: + Absolute positional embeddings after processing with shape (1, height, width, num_channels) + """ + if has_cls_token: + abs_pos_embeddings = abs_pos_embeddings[:, 1:] + num_position = abs_pos_embeddings.shape[1] + size = int(math.sqrt(num_position)) + if size * size != num_position: + raise ValueError("Absolute position embeddings must be a square number.") + + if size != height or size != width: + new_abs_pos_embeddings = nn.functional.interpolate( + abs_pos_embeddings.reshape(1, size, size, -1).permute(0, 3, 1, 2), + size=(height, width), + mode="bicubic", + align_corners=False, + ) + + return new_abs_pos_embeddings.permute(0, 2, 3, 1) + else: + return abs_pos_embeddings.reshape(1, height, width, -1) + + def forward(self, pixel_values: torch.Tensor) -> torch.Tensor: + num_channels = pixel_values.shape[1] + if num_channels != self.num_channels: + raise ValueError( + "Make sure that the channel dimension of the pixel values match with the one set in the configuration." + f" Expected {self.num_channels} but got {num_channels}." + ) + embeddings = self.projection(pixel_values) + + if self.position_embeddings is not None: + # (batch_size, num_channels, height, width) -> (batch_size, height, width, num_channels) + embeddings = embeddings.permute(0, 2, 3, 1) + # add position embeddings + embeddings = embeddings + self.get_absolute_positions( + self.position_embeddings, True, embeddings.shape[1], embeddings.shape[2] + ) + # (batch_size, height, width, num_channels) -> (batch_size, num_channels, height, width) + embeddings = embeddings.permute(0, 3, 1, 2) + + return embeddings + + +def get_rel_pos(q_size, k_size, rel_pos): + """ + Get relative positional embeddings according to the relative positions of query and key sizes. + + Args: + q_size (`int`): + Size of query q. + k_size (`int`): + Size of key k. + rel_pos (`torch.Tensor`): + Relative position embeddings (num_embeddings, num_channels). + + Returns: + Extracted positional embeddings according to relative positions. + """ + max_rel_dist = int(2 * max(q_size, k_size) - 1) + # Interpolate rel pos if needed. + if rel_pos.shape[0] != max_rel_dist: + # Interpolate rel position embeddings. + rel_pos_resized = nn.functional.interpolate( + rel_pos.reshape(1, rel_pos.shape[0], -1).permute(0, 2, 1), + size=max_rel_dist, + mode="linear", + ) + rel_pos_resized = rel_pos_resized.reshape(-1, max_rel_dist).permute(1, 0) + else: + rel_pos_resized = rel_pos + + # Scale the coords with short length if shapes for q and k are different. + q_coords = torch.arange(q_size)[:, None] * max(k_size / q_size, 1.0) + k_coords = torch.arange(k_size)[None, :] * max(q_size / k_size, 1.0) + relative_coords = (q_coords - k_coords) + (k_size - 1) * max(q_size / k_size, 1.0) + + return rel_pos_resized[relative_coords.long()] + + +def add_decomposed_relative_positions(attn, queries, rel_pos_h, rel_pos_w, q_size, k_size): + """ + Calculate decomposed Relative Positional Embeddings as introduced in + [MViT2](https://github.com/facebookresearch/mvit/blob/19786631e330df9f3622e5402b4a419a263a2c80/mvit/models/attention.py). + + Args: + attn (`torch.Tensor`): + Attention map. + queries (`torch.Tensor`): + Query q in the attention layer with shape (batch_size, queries_height * queries_width, num_channels). + rel_pos_h (`torch.Tensor`): + Relative position embeddings (Lh, num_channels) for height axis. + rel_pos_w (`torch.Tensor`): + Relative position embeddings (Lw, num_channels) for width axis. + q_size (`Tuple[int]`): + Spatial sequence size of query q with (queries_height, queries_width). + k_size (`Tuple[int]`]): + Spatial sequence size of key k with (keys_height, keys_width). + + Returns: + attn (Tensor): attention map with added relative positional embeddings. + """ + queries_height, queries_width = q_size + keys_height, keys_width = k_size + relative_height = get_rel_pos(queries_height, keys_height, rel_pos_h) + relative_width = get_rel_pos(queries_width, keys_width, rel_pos_w) + + batch_size, _, dim = queries.shape + r_q = queries.reshape(batch_size, queries_height, queries_width, dim) + relative_height = torch.einsum("bhwc,hkc->bhwk", r_q, relative_height) + relative_weight = torch.einsum("bhwc,wkc->bhwk", r_q, relative_width) + + attn = ( + attn.view(batch_size, queries_height, queries_width, keys_height, keys_width) + + relative_height[:, :, :, :, None] + + relative_weight[:, :, :, None, :] + ).view(batch_size, queries_height * queries_width, keys_height * keys_width) + + return attn + + +class VitDetAttention(nn.Module): + """Multi-head Attention block with relative position embeddings.""" + + def __init__(self, config, input_size=None): + """ + Args: + config (`VitDetConfig`): + Model configuration. + input_size (`Tuple[int]`, *optional*): + Input resolution, only required in case relative position embeddings are added. + """ + super().__init__() + + dim = config.hidden_size + num_heads = config.num_attention_heads + + self.num_heads = num_heads + head_dim = dim // num_heads + self.scale = head_dim**-0.5 + + self.qkv = nn.Linear(dim, dim * 3, bias=config.qkv_bias) + self.proj = nn.Linear(dim, dim) + + self.use_relative_position_embeddings = config.use_relative_position_embeddings + if self.use_relative_position_embeddings: + # initialize relative positional embeddings + self.rel_pos_h = nn.Parameter(torch.zeros(2 * input_size[0] - 1, head_dim)) + self.rel_pos_w = nn.Parameter(torch.zeros(2 * input_size[1] - 1, head_dim)) + + def forward(self, hidden_state, output_attentions=False): + batch_size, height, width, _ = hidden_state.shape + # qkv with shape (3, batch_size, num_heads, height * width, num_channels) + qkv = self.qkv(hidden_state).reshape(batch_size, height * width, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4) + # queries, keys and values have shape (batch_size * num_heads, height * width, num_channels) + queries, keys, values = qkv.reshape(3, batch_size * self.num_heads, height * width, -1).unbind(0) + + attention_scores = (queries * self.scale) @ keys.transpose(-2, -1) + + if self.use_relative_position_embeddings: + attention_scores = add_decomposed_relative_positions( + attention_scores, queries, self.rel_pos_h, self.rel_pos_w, (height, width), (height, width) + ) + + attention_probs = attention_scores.softmax(dim=-1) + + hidden_state = attention_probs @ values + hidden_state = hidden_state.view(batch_size, self.num_heads, height, width, -1) + hidden_state = hidden_state.permute(0, 2, 3, 1, 4) + hidden_state = hidden_state.reshape(batch_size, height, width, -1) + hidden_state = self.proj(hidden_state) + + if output_attentions: + attention_probs = attention_probs.reshape( + batch_size, self.num_heads, attention_probs.shape[-2], attention_probs.shape[-1] + ) + outputs = (hidden_state, attention_probs) + else: + outputs = (hidden_state,) + + return outputs + + +# Copied from transformers.models.beit.modeling_beit.drop_path +def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = False) -> torch.Tensor: + """ + Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks). + + Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks, + however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper... + See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the + layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the + argument. + """ + if drop_prob == 0.0 or not training: + return input + keep_prob = 1 - drop_prob + shape = (input.shape[0],) + (1,) * (input.ndim - 1) # work with diff dim tensors, not just 2D ConvNets + random_tensor = keep_prob + torch.rand(shape, dtype=input.dtype, device=input.device) + random_tensor.floor_() # binarize + output = input.div(keep_prob) * random_tensor + return output + + +# Copied from transformers.models.beit.modeling_beit.BeitDropPath +class VitDetDropPath(nn.Module): + """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).""" + + def __init__(self, drop_prob: Optional[float] = None) -> None: + super().__init__() + self.drop_prob = drop_prob + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + return drop_path(hidden_states, self.drop_prob, self.training) + + def extra_repr(self) -> str: + return "p={}".format(self.drop_prob) + + +class VitDetLayerNorm(nn.Module): + """ + A LayerNorm variant, popularized by Transformers, that performs point-wise mean and variance normalization over the + channel dimension for inputs that have shape (batch_size, channels, height, width). + https://github.com/facebookresearch/ConvNeXt/blob/d1fa8f6fef0a165b27399986cc2bdacc92777e40/models/convnext.py#L119 + """ + + def __init__(self, normalized_shape, eps=1e-6): + super().__init__() + self.weight = nn.Parameter(torch.ones(normalized_shape)) + self.bias = nn.Parameter(torch.zeros(normalized_shape)) + self.eps = eps + self.normalized_shape = (normalized_shape,) + + def forward(self, x): + u = x.mean(1, keepdim=True) + s = (x - u).pow(2).mean(1, keepdim=True) + x = (x - u) / torch.sqrt(s + self.eps) + x = self.weight[:, None, None] * x + self.bias[:, None, None] + return x + + +class VitDetResBottleneckBlock(nn.Module): + """ + The standard bottleneck residual block without the last activation layer. It contains 3 conv layers with kernels + 1x1, 3x3, 1x1. + """ + + def __init__(self, config, in_channels, out_channels, bottleneck_channels): + """ + Args: + config (`VitDetConfig`): + Model configuration. + in_channels (`int`): + Number of input channels. + out_channels (`int`): + Number of output channels. + bottleneck_channels (`int`): + Number of output channels for the 3x3 "bottleneck" conv layers. + """ + super().__init__() + self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, 1, bias=False) + self.norm1 = VitDetLayerNorm(bottleneck_channels) + self.act1 = ACT2FN[config.hidden_act] + + self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels, 3, padding=1, bias=False) + self.norm2 = VitDetLayerNorm(bottleneck_channels) + self.act2 = ACT2FN[config.hidden_act] + + self.conv3 = nn.Conv2d(bottleneck_channels, out_channels, 1, bias=False) + self.norm3 = VitDetLayerNorm(out_channels) + + def forward(self, x): + out = x + for layer in self.children(): + out = layer(out) + + out = x + out + return out + + +class VitDetMlp(nn.Module): + def __init__(self, config, in_features: int, hidden_features: int) -> None: + super().__init__() + self.fc1 = nn.Linear(in_features, hidden_features) + self.act = ACT2FN[config.hidden_act] + self.fc2 = nn.Linear(hidden_features, in_features) + self.drop = nn.Dropout(config.dropout_prob) + + def forward(self, x: torch.Tensor) -> torch.Tensor: + x = self.fc1(x) + x = self.act(x) + x = self.drop(x) + x = self.fc2(x) + x = self.drop(x) + + return x + + +def window_partition(hidden_state, window_size): + """ + Partition into non-overlapping windows with padding if needed. + + Args: + hidden_state (`torch.Tensor`): + Input tokens with [batch_size, height, width, num_channels]. + window_size (`int`): + Window size. + + Returns: + `tuple(torch.FloatTensor)` comprising various elements: + - windows: windows after partition with [batch_size * num_windows, window_size, window_size, num_channels]. + - (patch_height, patch_width): padded height and width before partition + """ + batch_size, height, width, num_channels = hidden_state.shape + + pad_height = (window_size - height % window_size) % window_size + pad_width = (window_size - width % window_size) % window_size + if pad_height > 0 or pad_width > 0: + hidden_state = nn.functional.pad(hidden_state, (0, 0, 0, pad_width, 0, pad_height)) + patch_height, patch_width = height + pad_height, width + pad_width + + hidden_state = hidden_state.view( + batch_size, patch_height // window_size, window_size, patch_width // window_size, window_size, num_channels + ) + windows = hidden_state.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, num_channels) + return windows, (patch_height, patch_width) + + +def window_unpartition(windows, window_size, pad_height_width, height_width): + """ + Window unpartition into original sequences and removing padding. + + Args: + windows (`torch.Tensor`): + Input tokens with [batch_size * num_windows, window_size, window_size, num_channels]. + window_size (`int`): + Window size. + pad_height_width (`Tuple[int]`): + Padded height and width (patch_height, patch_width). + height_width (`Tuple[int]`): + Original height and width before padding. + + Returns: + hidden_state: unpartitioned sequences with [batch_size, height, width, num_channels]. + """ + patch_height, patch_width = pad_height_width + height, width = height_width + batch_size = windows.shape[0] // (patch_height * patch_width // window_size // window_size) + hidden_state = windows.view( + batch_size, patch_height // window_size, patch_width // window_size, window_size, window_size, -1 + ) + hidden_state = hidden_state.permute(0, 1, 3, 2, 4, 5).contiguous().view(batch_size, patch_height, patch_width, -1) + + if patch_height > height or patch_width > width: + hidden_state = hidden_state[:, :height, :width, :].contiguous() + return hidden_state + + +class VitDetLayer(nn.Module): + """This corresponds to the Block class in the original implementation.""" + + def __init__( + self, config: VitDetConfig, drop_path_rate: float = 0, window_size: int = 0, use_residual_block: bool = False + ) -> None: + super().__init__() + + dim = config.hidden_size + input_size = (config.image_size // config.patch_size, config.image_size // config.patch_size) + + self.norm1 = nn.LayerNorm(dim, eps=config.layer_norm_eps) + self.attention = VitDetAttention( + config, input_size=input_size if window_size == 0 else (window_size, window_size) + ) + + self.drop_path = VitDetDropPath(drop_path_rate) if drop_path_rate > 0.0 else nn.Identity() + self.norm2 = nn.LayerNorm(dim, eps=config.layer_norm_eps) + self.mlp = VitDetMlp(config=config, in_features=dim, hidden_features=int(dim * config.mlp_ratio)) + + self.window_size = window_size + + self.use_residual_block = use_residual_block + if self.use_residual_block: + # Use a residual block with bottleneck channel as dim // 2 + self.residual = VitDetResBottleneckBlock( + config=config, + in_channels=dim, + out_channels=dim, + bottleneck_channels=dim // 2, + ) + + def forward( + self, + hidden_states: torch.Tensor, + head_mask: Optional[torch.Tensor] = None, + output_attentions: bool = False, + ) -> Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor]]: + hidden_states = hidden_states.permute(0, 2, 3, 1) + + shortcut = hidden_states + + hidden_states = self.norm1(hidden_states) + + # Window partition + if self.window_size > 0: + height, width = hidden_states.shape[1], hidden_states.shape[2] + hidden_states, pad_height_width = window_partition(hidden_states, self.window_size) + + self_attention_outputs = self.attention( + hidden_states, + output_attentions=output_attentions, + ) + hidden_states = self_attention_outputs[0] + outputs = self_attention_outputs[1:] # add self attentions if we output attention weights + + # Reverse window partition + if self.window_size > 0: + hidden_states = window_unpartition(hidden_states, self.window_size, pad_height_width, (height, width)) + + # first residual connection + hidden_states = shortcut + self.drop_path(hidden_states) + + hidden_states = hidden_states + self.drop_path(self.mlp(self.norm2(hidden_states))) + + hidden_states = hidden_states.permute(0, 3, 1, 2) + + if self.use_residual_block: + hidden_states = self.residual(hidden_states) + + outputs = (hidden_states,) + outputs + + return outputs + + +class VitDetEncoder(nn.Module): + def __init__(self, config: VitDetConfig) -> None: + super().__init__() + self.config = config + depth = config.num_hidden_layers + + # stochastic depth decay rule + drop_path_rate = [x.item() for x in torch.linspace(0, config.drop_path_rate, depth)] + + layers = [] + for i in range(depth): + layers.append( + VitDetLayer( + config, + drop_path_rate=drop_path_rate[i], + window_size=config.window_size if i in config.window_block_indices else 0, + use_residual_block=i in config.residual_block_indices, + ) + ) + + self.layer = nn.ModuleList(layers) + self.gradient_checkpointing = False + + def forward( + self, + hidden_states: torch.Tensor, + head_mask: Optional[torch.Tensor] = None, + output_attentions: bool = False, + output_hidden_states: bool = False, + return_dict: bool = True, + ) -> Union[tuple, BaseModelOutput]: + all_hidden_states = () if output_hidden_states else None + all_self_attentions = () if output_attentions else None + + for i, layer_module in enumerate(self.layer): + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + layer_head_mask = head_mask[i] if head_mask is not None else None + + if self.gradient_checkpointing and self.training: + + def create_custom_forward(module): + def custom_forward(*inputs): + return module(*inputs, output_attentions) + + return custom_forward + + layer_outputs = torch.utils.checkpoint.checkpoint( + create_custom_forward(layer_module), + hidden_states, + layer_head_mask, + ) + else: + layer_outputs = layer_module(hidden_states, layer_head_mask, output_attentions) + + hidden_states = layer_outputs[0] + + if output_attentions: + all_self_attentions = all_self_attentions + (layer_outputs[1],) + + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if not return_dict: + return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None) + return BaseModelOutput( + last_hidden_state=hidden_states, + hidden_states=all_hidden_states, + attentions=all_self_attentions, + ) + + +def caffe2_msra_fill(module: nn.Module) -> None: + """ + Initialize `module.weight` using the "MSRAFill" implemented in Caffe2. Also initializes `module.bias` to 0. + + Source: https://detectron2.readthedocs.io/en/latest/_modules/fvcore/nn/weight_init.html. + + Args: + module (torch.nn.Module): module to initialize. + """ + nn.init.kaiming_normal_(module.weight, mode="fan_out", nonlinearity="relu") + if module.bias is not None: + nn.init.constant_(module.bias, 0) + + +class VitDetPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = VitDetConfig + base_model_prefix = "vitdet" + main_input_name = "pixel_values" + supports_gradient_checkpointing = True + _no_split_modules = [] + + def _init_weights(self, module: Union[nn.Linear, nn.Conv2d, nn.LayerNorm]) -> None: + """Initialize the weights""" + if isinstance(module, (nn.Linear, nn.Conv2d)): + # Upcast the input in `fp32` and cast it back to desired `dtype` to avoid + # `trunc_normal_cpu` not implemented in `half` issues + module.weight.data = nn.init.trunc_normal_( + module.weight.data.to(torch.float32), mean=0.0, std=self.config.initializer_range + ).to(module.weight.dtype) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + + elif isinstance(module, VitDetEmbeddings): + module.position_embeddings.data = nn.init.trunc_normal_( + module.position_embeddings.data.to(torch.float32), + mean=0.0, + std=self.config.initializer_range, + ).to(module.position_embeddings.dtype) + + elif isinstance(module, VitDetAttention) and self.config.use_relative_position_embeddings: + module.rel_pos_h.data = nn.init.trunc_normal_( + module.rel_pos_h.data.to(torch.float32), + mean=0.0, + std=self.config.initializer_range, + ) + module.rel_pos_w.data = nn.init.trunc_normal_( + module.rel_pos_w.data.to(torch.float32), + mean=0.0, + std=self.config.initializer_range, + ) + + elif isinstance(module, VitDetResBottleneckBlock): + for layer in [module.conv1, module.conv2, module.conv3]: + caffe2_msra_fill(layer) + for layer in [module.norm1, module.norm2]: + layer.weight.data.fill_(1.0) + layer.bias.data.zero_() + # zero init last norm layer. + module.norm3.weight.data.zero_() + module.norm3.bias.data.zero_() + + def _set_gradient_checkpointing(self, module: VitDetEncoder, value: bool = False) -> None: + if isinstance(module, VitDetEncoder): + module.gradient_checkpointing = value + + +VITDET_START_DOCSTRING = r""" + This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it + as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and + behavior. + + Parameters: + config ([`VitDetConfig`]): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + +VITDET_INPUTS_DOCSTRING = r""" + Args: + pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): + Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See [`ViTImageProcessor.__call__`] + for details. + + head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*): + Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +@add_start_docstrings( + "The bare VitDet Transformer model outputting raw hidden-states without any specific head on top.", + VITDET_START_DOCSTRING, +) +class VitDetModel(VitDetPreTrainedModel): + def __init__(self, config: VitDetConfig): + super().__init__(config) + self.config = config + + self.embeddings = VitDetEmbeddings(config) + self.encoder = VitDetEncoder(config) + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> VitDetEmbeddings: + return self.embeddings.projection + + def _prune_heads(self, heads_to_prune: Dict[int, List[int]]) -> None: + """ + Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base + class PreTrainedModel + """ + for layer, heads in heads_to_prune.items(): + self.encoder.layer[layer].attention.prune_heads(heads) + + @add_start_docstrings_to_model_forward(VITDET_INPUTS_DOCSTRING) + @add_code_sample_docstrings( + checkpoint=_CHECKPOINT_FOR_DOC, + output_type=BaseModelOutput, + config_class=_CONFIG_FOR_DOC, + modality="vision", + expected_output=_EXPECTED_OUTPUT_SHAPE, + ) + def forward( + self, + pixel_values: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutput]: + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if pixel_values is None: + raise ValueError("You have to specify pixel_values") + + # Prepare head mask if needed + # 1.0 in head_mask indicate we keep the head + # attention_probs has shape bsz x n_heads x N x N + # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] + # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length] + head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers) + + embedding_output = self.embeddings(pixel_values) + + encoder_outputs = self.encoder( + embedding_output, + head_mask=head_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + sequence_output = encoder_outputs[0] + + if not return_dict: + return (sequence_output,) + encoder_outputs[1:] + + return BaseModelOutput( + last_hidden_state=sequence_output, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + ) + + +@add_start_docstrings( + """ + ViTDet backbone, to be used with frameworks like Mask R-CNN. + """, + VITDET_START_DOCSTRING, +) +class VitDetBackbone(VitDetPreTrainedModel, BackboneMixin): + def __init__(self, config): + super().__init__(config) + super()._init_backbone(config) + + self.embeddings = VitDetEmbeddings(config) + self.encoder = VitDetEncoder(config) + self.num_features = [config.hidden_size for _ in range(config.num_hidden_layers + 1)] + + # initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> VitDetEmbeddings: + return self.embeddings.projection + + @add_start_docstrings_to_model_forward(VITDET_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=BackboneOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + pixel_values: torch.Tensor, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> BackboneOutput: + """ + Returns: + + Examples: + + ```python + >>> from transformers import AutoImageProcessor, AutoBackbone + >>> import torch + >>> from PIL import Image + >>> import requests + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> processor = AutoImageProcessor.from_pretrained("facebook/convnext-tiny-224") + >>> model = AutoBackbone.from_pretrained("facebook/convnext-tiny-224") + + >>> inputs = processor(image, return_tensors="pt") + >>> outputs = model(**inputs) + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + + embedding_output = self.embeddings(pixel_values) + + outputs = self.encoder( + embedding_output, + output_hidden_states=True, + output_attentions=output_attentions, + return_dict=return_dict, + ) + + hidden_states = outputs.hidden_states if return_dict else outputs[1] + + feature_maps = () + for stage, hidden_state in zip(self.stage_names, hidden_states): + if stage in self.out_features: + feature_maps += (hidden_state,) + + if not return_dict: + if output_hidden_states: + output = (feature_maps,) + outputs[1:] + else: + output = (feature_maps,) + outputs[2:] + return output + + return BackboneOutput( + feature_maps=feature_maps, + hidden_states=outputs.hidden_states if output_hidden_states else None, + attentions=outputs.attentions, + ) diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py index c1cdc3955e97dc..7189861619b803 100644 --- a/src/transformers/utils/dummy_pt_objects.py +++ b/src/transformers/utils/dummy_pt_objects.py @@ -7898,6 +7898,30 @@ def __init__(self, *args, **kwargs): requires_backends(self, ["torch"]) +VITDET_PRETRAINED_MODEL_ARCHIVE_LIST = None + + +class VitDetBackbone(metaclass=DummyObject): + _backends = ["torch"] + + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + +class VitDetModel(metaclass=DummyObject): + _backends = ["torch"] + + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + +class VitDetPreTrainedModel(metaclass=DummyObject): + _backends = ["torch"] + + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST = None diff --git a/tests/models/vitdet/__init__.py b/tests/models/vitdet/__init__.py new file mode 100644 index 00000000000000..e69de29bb2d1d6 diff --git a/tests/models/vitdet/test_modeling_vitdet.py b/tests/models/vitdet/test_modeling_vitdet.py new file mode 100644 index 00000000000000..f98c5b4d6f38f0 --- /dev/null +++ b/tests/models/vitdet/test_modeling_vitdet.py @@ -0,0 +1,291 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Testing suite for the PyTorch ViTDet model. """ + + +import inspect +import unittest + +from transformers import VitDetConfig +from transformers.testing_utils import require_torch, torch_device +from transformers.utils import is_torch_available + +from ...test_backbone_common import BackboneTesterMixin +from ...test_configuration_common import ConfigTester +from ...test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor +from ...test_pipeline_mixin import PipelineTesterMixin + + +if is_torch_available(): + import torch + from torch import nn + + from transformers import VitDetBackbone, VitDetModel + + +class VitDetModelTester: + def __init__( + self, + parent, + batch_size=13, + image_size=30, + patch_size=2, + num_channels=3, + is_training=True, + use_labels=True, + hidden_size=32, + num_hidden_layers=2, + num_attention_heads=4, + intermediate_size=37, + hidden_act="gelu", + hidden_dropout_prob=0.1, + attention_probs_dropout_prob=0.1, + type_sequence_label_size=10, + initializer_range=0.02, + scope=None, + ): + self.parent = parent + self.batch_size = batch_size + self.image_size = image_size + self.patch_size = patch_size + self.num_channels = num_channels + self.is_training = is_training + self.use_labels = use_labels + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.intermediate_size = intermediate_size + self.hidden_act = hidden_act + self.hidden_dropout_prob = hidden_dropout_prob + self.attention_probs_dropout_prob = attention_probs_dropout_prob + self.type_sequence_label_size = type_sequence_label_size + self.initializer_range = initializer_range + self.scope = scope + + self.num_patches_one_direction = self.image_size // self.patch_size + self.seq_length = (self.image_size // self.patch_size) ** 2 + + def prepare_config_and_inputs(self): + pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size]) + + labels = None + if self.use_labels: + labels = ids_tensor([self.batch_size], self.type_sequence_label_size) + + config = self.get_config() + + return config, pixel_values, labels + + def get_config(self): + return VitDetConfig( + image_size=self.image_size, + patch_size=self.patch_size, + num_channels=self.num_channels, + hidden_size=self.hidden_size, + num_hidden_layers=self.num_hidden_layers, + num_attention_heads=self.num_attention_heads, + intermediate_size=self.intermediate_size, + hidden_act=self.hidden_act, + hidden_dropout_prob=self.hidden_dropout_prob, + attention_probs_dropout_prob=self.attention_probs_dropout_prob, + is_decoder=False, + initializer_range=self.initializer_range, + ) + + def create_and_check_model(self, config, pixel_values, labels): + model = VitDetModel(config=config) + model.to(torch_device) + model.eval() + result = model(pixel_values) + self.parent.assertEqual( + result.last_hidden_state.shape, + (self.batch_size, self.hidden_size, self.num_patches_one_direction, self.num_patches_one_direction), + ) + + def create_and_check_backbone(self, config, pixel_values, labels): + model = VitDetBackbone(config=config) + model.to(torch_device) + model.eval() + result = model(pixel_values) + + # verify hidden states + self.parent.assertEqual(len(result.feature_maps), len(config.out_features)) + self.parent.assertListEqual( + list(result.feature_maps[0].shape), + [self.batch_size, self.hidden_size, self.num_patches_one_direction, self.num_patches_one_direction], + ) + + # verify channels + self.parent.assertEqual(len(model.channels), len(config.out_features)) + self.parent.assertListEqual(model.channels, [config.hidden_size]) + + # verify backbone works with out_features=None + config.out_features = None + model = VitDetBackbone(config=config) + model.to(torch_device) + model.eval() + result = model(pixel_values) + + # verify feature maps + self.parent.assertEqual(len(result.feature_maps), 1) + self.parent.assertListEqual( + list(result.feature_maps[0].shape), + [self.batch_size, self.hidden_size, self.num_patches_one_direction, self.num_patches_one_direction], + ) + + # verify channels + self.parent.assertEqual(len(model.channels), 1) + self.parent.assertListEqual(model.channels, [config.hidden_size]) + + def prepare_config_and_inputs_for_common(self): + config_and_inputs = self.prepare_config_and_inputs() + config, pixel_values, labels = config_and_inputs + inputs_dict = {"pixel_values": pixel_values} + return config, inputs_dict + + +@require_torch +class VitDetModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase): + """ + Here we also overwrite some of the tests of test_modeling_common.py, as VitDet does not use input_ids, inputs_embeds, + attention_mask and seq_length. + """ + + all_model_classes = (VitDetModel, VitDetBackbone) if is_torch_available() else () + pipeline_model_mapping = {"feature-extraction": VitDetModel} if is_torch_available() else {} + + fx_compatible = False + test_pruning = False + test_resize_embeddings = False + test_head_masking = False + + def setUp(self): + self.model_tester = VitDetModelTester(self) + self.config_tester = ConfigTester(self, config_class=VitDetConfig, has_text_modality=False, hidden_size=37) + + def test_config(self): + self.config_tester.run_common_tests() + + @unittest.skip(reason="VitDet does not use inputs_embeds") + def test_inputs_embeds(self): + pass + + def test_model_common_attributes(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + model = model_class(config) + self.assertIsInstance(model.get_input_embeddings(), (nn.Module)) + x = model.get_output_embeddings() + self.assertTrue(x is None or isinstance(x, nn.Linear)) + + def test_forward_signature(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + model = model_class(config) + signature = inspect.signature(model.forward) + # signature.parameters is an OrderedDict => so arg_names order is deterministic + arg_names = [*signature.parameters.keys()] + + expected_arg_names = ["pixel_values"] + self.assertListEqual(arg_names[:1], expected_arg_names) + + def test_model(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model(*config_and_inputs) + + def test_backbone(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_backbone(*config_and_inputs) + + def test_hidden_states_output(self): + def check_hidden_states_output(inputs_dict, config, model_class): + model = model_class(config) + model.to(torch_device) + model.eval() + + with torch.no_grad(): + outputs = model(**self._prepare_for_class(inputs_dict, model_class)) + + hidden_states = outputs.hidden_states + + expected_num_stages = self.model_tester.num_hidden_layers + self.assertEqual(len(hidden_states), expected_num_stages + 1) + + # VitDet's feature maps are of shape (batch_size, num_channels, height, width) + self.assertListEqual( + list(hidden_states[0].shape[-2:]), + [ + self.model_tester.num_patches_one_direction, + self.model_tester.num_patches_one_direction, + ], + ) + + config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + inputs_dict["output_hidden_states"] = True + check_hidden_states_output(inputs_dict, config, model_class) + + # check that output_hidden_states also work using config + del inputs_dict["output_hidden_states"] + config.output_hidden_states = True + + check_hidden_states_output(inputs_dict, config, model_class) + + # overwrite since VitDet only supports retraining gradients of hidden states + def test_retain_grad_hidden_states_attentions(self): + config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() + config.output_hidden_states = True + config.output_attentions = self.has_attentions + + # no need to test all models as different heads yield the same functionality + model_class = self.all_model_classes[0] + model = model_class(config) + model.to(torch_device) + + inputs = self._prepare_for_class(inputs_dict, model_class) + + outputs = model(**inputs) + + output = outputs[0] + + # Encoder-/Decoder-only models + hidden_states = outputs.hidden_states[0] + hidden_states.retain_grad() + + output.flatten()[0].backward(retain_graph=True) + + self.assertIsNotNone(hidden_states.grad) + + @unittest.skip(reason="VitDet does not support feedforward chunking") + def test_feed_forward_chunking(self): + pass + + @unittest.skip(reason="VitDet does not have standalone checkpoints since it used as backbone in other models") + def test_model_from_pretrained(self): + pass + + +@require_torch +class VitDetBackboneTest(unittest.TestCase, BackboneTesterMixin): + all_model_classes = (VitDetBackbone,) if is_torch_available() else () + config_class = VitDetConfig + + has_attentions = False + + def setUp(self): + self.model_tester = VitDetModelTester(self) diff --git a/tests/test_backbone_common.py b/tests/test_backbone_common.py index 1700ab98eed438..a6c10b26a7da24 100644 --- a/tests/test_backbone_common.py +++ b/tests/test_backbone_common.py @@ -31,7 +31,8 @@ def test_config(self): # test default config config = config_class() self.assertIsNotNone(config) - expected_stage_names = ["stem"] + [f"stage{idx}" for idx in range(1, len(config.depths) + 1)] + num_stages = len(config.depths) if hasattr(config, "depths") else config.num_hidden_layers + expected_stage_names = ["stem"] + [f"stage{idx}" for idx in range(1, num_stages + 1)] self.assertEqual(config.stage_names, expected_stage_names) self.assertTrue(set(config.out_features).issubset(set(config.stage_names))) diff --git a/utils/check_repo.py b/utils/check_repo.py index 678294a1b0dcd1..9da984c24275a7 100644 --- a/utils/check_repo.py +++ b/utils/check_repo.py @@ -965,6 +965,7 @@ def find_all_documented_objects() -> List[str]: "SwinBackbone", "TimmBackbone", "TimmBackboneConfig", + "VitDetBackbone", ]