Skip to content

v4.16.0: Nyströmformer, REALM, ViTMAE, ViLT, Swin Transformer, YOSO, ...

Compare
Choose a tag to compare
@sgugger sgugger released this 27 Jan 18:14

New models

Nyströmformer

The Nyströmformer model was proposed in Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh.

The Nyströmformer model overcomes the quadratic complexity of self-attention on the input sequence length by adapting the Nyström method to approximate standard self-attention, enabling longer sequences with thousands of tokens as input.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=nystromformer

REALM

The REALM model was proposed in REALM: Retrieval-Augmented Language Model Pre-Training by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.

It’s a retrieval-augmented language model that firstly retrieves documents from a textual knowledge corpus and then utilizes retrieved documents to process question answering tasks.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=realm

ViTMAE

The ViTMAE model was proposed in Masked Autoencoders Are Scalable Vision Learners by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.

The paper shows that, by pre-training a Vision Transformer (ViT) to reconstruct pixel values for masked patches, one can get results after fine-tuning that outperform supervised pre-training.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=vit_mae

ViLT

The ViLT model was proposed in ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Wonjae Kim, Bokyung Son, Ildoo Kim.

ViLT incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design for Vision-and-Language Pre-training (VLP).

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=vilt

Swin Transformer

The Swin Transformer was proposed in Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.

The Swin Transformer serves as a general-purpose backbone for computer vision. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=swin

YOSO

The YOSO model was proposed in You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling
by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh.

YOSO approximates standard softmax self-attention via a Bernoulli sampling scheme based on Locality Sensitive Hashing (LSH). In principle, all the Bernoulli random variables can be sampled with a single hash.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=yoso

Add model like

To help contributors add new models more easily to Transformers, there is a new command that will clone an existing model and set the various hooks in the library, so that you only have to write the tweaks needed to the modeling file. Just run transformers-cli add-new-model-like and fill the questionnaire!

Training scripts

New training scripts were introduced, for speech seq2seq models and an image pre-training script leveraging the ViTMAE models.
Finally, an image captioning example in Flax gets added to the library.

Pipelines

Adding support for long files on automatic-speech-recognition (ASR) as well as supporting audio models with LM which increases the WER on many tasks See the blogpost.
Also continuously increasing homogeneity in arguments, framework support on all pipelines.

  • Large audio chunking for the existing ASR pipeline by @anton-l in #14896
  • Enabling TF on image-classification pipeline. by @Narsil in #15030
  • Pipeline ASR with LM. by @Narsil in #15071
  • ChunkPipeline: batch_size enabled on zero-cls and qa pipelines. by @Narsil in #14225

PyTorch improvements

The ELECTRA model can now be used as a decoder, enabling an ELECTRA encoder-decoder model.

  • Add ElectraForCausalLM -> Enable Electra encoder-decoder model by @stancld in #14729

TensorFlow improvements

The vision encoder decoder model can now be used in TensorFlow.

CLIP gets ported to TensorFlow.

Flax improvements

RoFormer gets ported to Flax.

Deprecations

Documentation

The documentation has been fully migrated to MarkDown, if you are making contribution, make sure to read the upgraded guide on how to write good docstrings.

Bugfixes and improvements

Impressive community contributors

The community contributors below have significantly contributed to the v4.16.0 release. Thank you!

  • @novice03, for contributing Nyströmformer, Swin Transformer and YOSO
  • @qqaatw, for contributing REALM
  • @stancld, for adding support for ELECTRA as a decoder, and porting RoFormer to Flax
  • @ydshieh, for a myriad of documentation fixes, the port of CLIP to TensorFlow, the addition of the TensorFlow vision encoder-decoder model, and the contribution of an image captioning example in Flax.

New Contributors

Full Changelog: v4.15.0...v4.16.0