v4.16.0: Nyströmformer, REALM, ViTMAE, ViLT, Swin Transformer, YOSO, ...
New models
Nyströmformer
The Nyströmformer model was proposed in Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh.
The Nyströmformer model overcomes the quadratic complexity of self-attention on the input sequence length by adapting the Nyström method to approximate standard self-attention, enabling longer sequences with thousands of tokens as input.
Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=nystromformer
REALM
The REALM model was proposed in REALM: Retrieval-Augmented Language Model Pre-Training by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.
It’s a retrieval-augmented language model that firstly retrieves documents from a textual knowledge corpus and then utilizes retrieved documents to process question answering tasks.
Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=realm
ViTMAE
The ViTMAE model was proposed in Masked Autoencoders Are Scalable Vision Learners by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
The paper shows that, by pre-training a Vision Transformer (ViT) to reconstruct pixel values for masked patches, one can get results after fine-tuning that outperform supervised pre-training.
- Add MAE by @NielsRogge in #15120
Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=vit_mae
ViLT
The ViLT model was proposed in ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Wonjae Kim, Bokyung Son, Ildoo Kim.
ViLT incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design for Vision-and-Language Pre-training (VLP).
- Add ViLT by @NielsRogge in #14895
Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=vilt
Swin Transformer
The Swin Transformer was proposed in Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
The Swin Transformer serves as a general-purpose backbone for computer vision. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.
Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=swin
YOSO
The YOSO model was proposed in You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling
by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh.
YOSO approximates standard softmax self-attention via a Bernoulli sampling scheme based on Locality Sensitive Hashing (LSH). In principle, all the Bernoulli random variables can be sampled with a single hash.
Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=yoso
Add model like
To help contributors add new models more easily to Transformers, there is a new command that will clone an existing model and set the various hooks in the library, so that you only have to write the tweaks needed to the modeling file. Just run transformers-cli add-new-model-like
and fill the questionnaire!
Training scripts
New training scripts were introduced, for speech seq2seq models and an image pre-training script leveraging the ViTMAE models.
Finally, an image captioning example in Flax gets added to the library.
- Add Speech Seq2Seq Training script by @patrickvonplaten in #14792
- [ViTMAE] Add image pretraining script by @NielsRogge in #15242
- Add Flax image captioning example by @ydshieh in #14864
Pipelines
Adding support for long files on automatic-speech-recognition
(ASR) as well as supporting audio models with LM which increases the WER on many tasks See the blogpost.
Also continuously increasing homogeneity in arguments, framework support on all pipelines.
- Large audio chunking for the existing ASR pipeline by @anton-l in #14896
- Enabling
TF
onimage-classification
pipeline. by @Narsil in #15030 - Pipeline ASR with LM. by @Narsil in #15071
- ChunkPipeline:
batch_size
enabled onzero-cls
andqa
pipelines. by @Narsil in #14225
PyTorch improvements
The ELECTRA model can now be used as a decoder, enabling an ELECTRA encoder-decoder model.
TensorFlow improvements
- Keras metric callback by @Rocketknight1 and @merveenoyan in #14867
The vision encoder decoder model can now be used in TensorFlow.
CLIP gets ported to TensorFlow.
Flax improvements
RoFormer gets ported to Flax.
Deprecations
- Deprecates AdamW and adds
--optim
by @manuelciosici in #14744
Documentation
The documentation has been fully migrated to MarkDown, if you are making contribution, make sure to read the upgraded guide on how to write good docstrings.
- Convert rst files by @sgugger in #14888
- Doc styler v2 by @sgugger in #14950
- Convert last rst file by @sgugger in #14952
- Doc styler examples by @sgugger in #14953
- [doc] consistent True/False/None default format by @stas00 in #14951
- [doc] :obj: hunt by @stas00 in #14954
- [doc] :class: hunt by @stas00 in #14955
Bugfixes and improvements
- Fix installation instructions for BART ONNX example by @lewtun in #14885
- Fix doc examples: ... takes no keyword arguments by @ydshieh in #14701
- Fix
AttributeError
fromPreTrainedTokenizerFast.decoder
by @aphedges in #14691 - Add 'with torch.no_grad()' to ALBERT integration test forward pass by @henholm in #14808
- Add ONNX support for MarianMT models by @lewtun in #14586
- add custom stopping criteria to human eval script by @lvwerra in #14897
- Set
run_name
in MLflowCallback by @YangDong2002 in #14894 - [AutoTokenizer] Fix incorrect from pretrained by @patrickvonplaten in #14900
- [Tests] Update speech diarization and WavLM tolerances by @anton-l in #14902
- [doc] post-porting by @stas00 in #14890
- [Generate] Remove attention_mask and integrate model_main_input_name by @patrickvonplaten in #14856
- Fix failing GPU trainer tests by @sgugger in #14903
- Better logic for getting tokenizer config in AutoTokenizer by @sgugger in #14906
- [doc] install - add link to jax installation by @stas00 in #14912
- [WavLM] fix wavlm docs by @patrickvonplaten in #14910
- Fix Perceiver docs by @Sanster in #14917
- fix to issue #14833 in data_collator - consider no labels by @kleinay in #14930
- Fix duplicate call to save_checkpoint when using deepspeed by @MihaiBalint in #14946
- [WavLM] give model more precision tolerance in tests by @patrickvonplaten in #14958
- [Speech Recognition Examples] Update README.md by @patrickvonplaten in #14965
- [Tests] Speed up tokenizer tests by @patrickvonplaten in #14964
- [Wav2Vec2] Rename model's feature extractor to feature encoder by @patrickvonplaten in #14959
- Replace assertion with exception by @jaketae in #14970
- remove absl workaround as it's no longer needed by @stas00 in #14909
- Fixing a pathological case for slow tokenizers by @Narsil in #14981
- [AutoProcessor] Correct AutoProcessor and automatically add processor… by @patrickvonplaten in #14881
- [Generate] correct encoder_outputs are passed without attention_mask by @patrickvonplaten in #14980
- Adding
num_return_sequences
support for text2text generation. by @Narsil in #14988 - Enabling
tokenizers
upgrade. by @Narsil in #14941 - Allow training to resume even if RNG states are not properly loaded by @sgugger in #14994
- Map model_type and doc pages names by @sgugger in #14944
- Fixing t2t pipelines lists outputs. by @Narsil in #15008
- Improve truncation_side by @Narsil in #14947
- Fix doc examples: name 'torch' is not defined by @ydshieh in #15016
- [Tests] Correct Wav2Vec2 & WavLM tests by @patrickvonplaten in #15015
- [doc] Update parallelism.mdx by @hyunwoongko in #15013
- Fix Code block speech pretraining example by @flozi00 in #14983
- Fix a little typo by @milyiyo in #15002
- Hotfix
chunk_length_s
instead of_ms
. by @Narsil in #15029 - [doc] Update parallelism.mdx by @hyunwoongko in #15018
- [megatron convert] PYTHONPATH requirements by @stas00 in #14956
- Fix doc example: mask_time_indices (numpy) has no attribute 'to' by @ydshieh in #15033
- Adding QoL for
batch_size
arg (like others enabled everywhere). by @Narsil in #15027 - [CLIP] Fix PT test by @patrickvonplaten in #15041
- [SpeechEncoderDecoder] Fix from pretrained by @patrickvonplaten in #15043
- [CLIP] Fix TF test by @patil-suraj in #15042
- Wrap Roberta integration test forward passes with torch.no_grad() by @mattchurgin in #15037
- Add Detectron2 to Github actions by @NielsRogge in #15053
- Remove old asserts. by @Narsil in #15012
- Add 'with torch.no_grad()' to BertGeneration integration test forward passes by @itsTurner in #14963
- Update run_speech_recognition_seq2seq.py (max_eval_samples instead of train_samples) by @flozi00 in #14967
- [VisionTextDualEncoder] Fix doc example by @ydshieh in #15057
- Resubmit changes after rebase to master by @kct22aws in #14982
- [Fix doc examples] missing from_pretrained by @ydshieh in #15044
- [VisionTextDualEncoder] Add token_type_ids param by @ydshieh in #15073
- Fix convert for newer megatron-lm bert model by @yoquankara in #14082
- [Wav2Vec2 Speech Event] Add speech event v2 by @patrickvonplaten in #15083
- fix model table cell text alignment by @ydshieh in #14999
- Update check_repo.py by @kamalkraj in #15014
- Make OpenAIGPTTokenizer work with SpaCy 2.x and 3.x by @cody-moveworks in #15019
- Change assignee for tokenizers by @LysandreJik in #15088
- support the trocr small models by @liminghao1630 in #14893
- [Fix doc example] RagModel by @ydshieh in #15076
- Model summary doc page horizontal banners by @mishig25 in #15058
- Use tqdm.auto in Pipeline docs by @bryant1410 in #14920
- [doc] normalize HF Transformers string by @stas00 in #15023
- Happy New Year! by @sgugger in #15094
- [DOC] fix doc examples for bart-like models by @patil-suraj in #15093
- [performance doc] Power and Cooling by @stas00 in #14935
- Add test to check reported training loss by @sgugger in #15096
- Take gradient accumulation into account when defining samplers by @sgugger in #15095
- [Fix doc example] Speech2TextForConditionalGeneration by @ydshieh in #15092
- Fix cookiecutter by @NielsRogge in #15100
- [Wav2Vec2ProcessorWithLM] improve decoder download by @patrickvonplaten in #15040
- Adds IBERT to models exportable with ONNX by @MaximovaIrina in #14868
- change metric_key_prefix in seq2seq_trainer.py by @JejuWayfarer in #15099
- Print out durations of all scheduled tests by @LysandreJik in #15102
- Fix failing W2V2 test by @LysandreJik in #15104
- Doc styler tip by @sgugger in #15105
- Update ONNX docs by @lewtun in #14904
- Fix saving FlaubertTokenizer configs by @vmaryasin in #14991
- Update TF test_step to match train_step by @Rocketknight1 in #15111
- use block_size instead of max_seq_length in tf run_clm example by @riklopfer in #15036
- fix: switch from slow to generic tokenizer class by @lvwerra in #15122
- Fix TFEncoderDecoder labels handling #14357 by @ydshieh in #15001
- Add ONNX configuration classes to docs by @lewtun in #15121
- Add
with torch.no_grad()
to DistilBERT integration test forward pass by @jaketae in #14979 - mBART support for run_summarization.py by @banda-larga in #15125
- doc-builder -> doc-build by @LysandreJik in #15134
- [Fix doc example] - ProphetNetDecoder by @ydshieh in #15124
- [examples/flax/language-modeling] set loglevel by @stas00 in #15129
- Update model_sharing.mdx by @carlos-aguayo in #15142
- Enable AMP for xla:gpu device in trainer class by @ymwangg in #15022
- [deepspeed tests] fix summarization by @stas00 in #15149
- Check the repo consistency in model templates test by @sgugger in #15141
- Add TF glu activation function by @gante in #15146
- Make sure all submodules are properly registered by @sgugger in #15144
- [Fix doc example] - OpenAIGPTDoubleHeadsModel by @ydshieh in #15143
- fix BertTokenizerFast
tokenize_chinese_chars
arg by @SaulLu in #15158 - Fix typo in test_configuration_common.py by @novice03 in #15160
- Add "open in hf spaces" gradio button issue #73 by @AK391 in #15106
- TF Bert inference - support
np.ndarray
optional arguments by @gante in #15074 - Fixing flaky test (hopefully). by @Narsil in #15154
- Better dummies by @sgugger in #15148
- Update from keras2onnx to tf2onnx by @gante in #15162
- [doc] performance: Efficient Software Prebuilds by @stas00 in #15147
- [Speech models] Disable non-existing chunking in tests by @patrickvonplaten in #15163
- Added forward pass of test_inference_image_classification_head by @MrinalTyagi in #14777
- Fix dtype issue in TF BART by @Rocketknight1 in #15178
- [doc] new MoE paper by @stas00 in #15184
- Mark bad tokenizers version by @sgugger in #15188
- [Fix doc example] UniSpeechSatForPreTraining by @ydshieh in #15152
is_ctc
needs to be updated to `self.type == "ctc". by @Narsil in #15194- [Fix doc example] TFRagModel by @ydshieh in #15187
- Error when code examples are improperly closed by @sgugger in #15186
- Fix deprecation warnings for int div by @sgugger in #15180
- Copies and docstring styling by @sgugger in #15202
- [ASR pipeline] correct with lm pipeline by @patrickvonplaten in #15200
- Remove dependency to quiet Dependabot by @sgugger in #15205
- Ignore empty subfolders when identifying submodules by @sgugger in #15204
- [MBartTokenizer] remove dep on xlm-roberta tokenizer by @patil-suraj in #15201
- fix: #14486 do not use BertPooler in DPR by @PaulLerner in #15068
- [Fix doc example] Wrong checkpoint name by @ydshieh in #15079
- [Robust Speech Event] Add guides by @patrickvonplaten in #15155
- Enable tqdm toggling by @jaketae in #15167
- [FLAX] glue training example refactor by @kamalkraj in #13815
- Rename compute_loss in TF models by @Rocketknight1 in #15207
- Build dev documentation by @LysandreJik in #15210
- [Fix doc example] TFFunnelTokenizer' is not defined by @ydshieh in #15225
- Correct Speech Event Readme by @patrickvonplaten in #15226
- [ViTMAE] Various fixes by @NielsRogge in #15221
- [Speech Event] Fix speech event readme by @patil-suraj in #15227
- Fix typo in BERT tokenization file by @qqaatw in #15228
- Fix PR number by @LysandreJik in #15231
- Adapt Common Voice Talk Title and Abstract by @patrickvonplaten in #15233
- Update Trainer code example by @NielsRogge in #15070
- Make chuking smartly (long files) work on asr ctc_with_lm. by @Narsil in #15219
- Fix usage of additional kwargs in
from_encoder_decoder_pretrained
in encoder-decoder models by @jsnfly in #15056 - Update README.md by @anton-l in #15239
- Update README.md by @anton-l in #15246
- Update pipelines.mdx by @kamalkraj in #15243
- [Fix doc example] missing import by @ydshieh in #15240
- Fixes tf_default_data_collator sometimes guessing the wrong dtype for labels by @Rocketknight1 in #15234
- Make sure to raise NotImplementedError with correct method name by @kumapo in #15253
- Fix crash when logs are empty because Keras has wiped them out of spite by @Rocketknight1 in #15258
- Tentative workflow improvement by @LysandreJik in #15255
- Fix code examples by @NielsRogge in #15257
- Adds missing module_specs for usages of _LazyModule by @jkuball in #15230
- Prepare ONNX export for torch v1.11 by @lewtun in #15270
- Fix by @novice03 in #15276
- Move BART + ONNX example to research_projects by @lewtun in #15271
- Specify providers explicitly in ORT session initialization by @wangyems in #15235
- Fixes Benchmark example link by @evandrosks in #15278
- [Robust Speech Challenge] Add timeline by @patrickvonplaten in #15274
- [Fix doc example] TFLayoutLMForTokenClassification: missing import tf by @ydshieh in #15268
- [Wav2Vec2ProcessorWithLM] improve multi processing by @patrickvonplaten in #15247
- Refine errors for pretrained objects by @sgugger in #15261
- [PyTorch-nightly-test] Fix Wav2Vec2 LM & Phoneme tests by @patrickvonplaten in #15272
- Update eval.py by @patrickvonplaten in #15310
- Update CONTRIBUTING.md by @kamalkraj in #15290
- Fix a typo in tag addition by @sgugger in #15286
- Remove old debug code leftover. by @Narsil in #15306
- [Fix doc example] fix missing import jnp by @ydshieh in #15291
- [LayoutLMV2 Tests] Make sure input is on GPU by @patrickvonplaten in #15314
- Replace NystromformerTokenizer with AutoTokenizer by @novice03 in #15312
- [Beam Search] Correct returned beam scores by @patrickvonplaten in #14654
- [Examples] Correct run ner label2id for fine-tuned models by @patrickvonplaten in #15017
- Avoid using get_list_of_files by @sgugger in #15287
- [Tests] Fix test by @NielsRogge in #15324
- Add 🤗 Accelerate tutorial by @stevhliu in #15263
- Added missing code in exemplary notebook - custom datasets fine-tuning by @Pawloch247 in #15300
- Fix encoder-decoder models when labels is passed by @ydshieh in #15172
- Fix table formatting in SegFormer docs by @deppen8 in #15337
- Fix deepspeed docs by @ngoquanghuy99 in #15346
- Fix 'eval_split_name' described as defaulting to 'train' by @FremyCompany in #15348
- Update doc writing guide by @sgugger in #15350
- Add YOSO by @novice03 in #15091
- [docs] post-PR merge fix by @stas00 in #15355
- Fix YosoConfig doc by @sgugger in #15353
- [DocTests Speech] Add doc tests for all speech models by @patrickvonplaten in #15031
- Push to hub save by @sgugger in #15327
- Fix KerasMetricCallback prediction with generate() and inference of column names by @Rocketknight1 in #15351
- Add a device argument to the eval script by @anton-l in #15371
- improve saving strategy of sentencepiece tokenizer by @SaulLu in #15328
- Implement fixes for TrainingArguments doc by @sgugger in #15370
- Super-small fix stops us confusing Keras console logging by modifying… by @Rocketknight1 in #15373
- Add proper documentation for Keras callbacks by @sgugger in #15374
- Example script for PushToHubCallback by @Rocketknight1 in #15375
Impressive community contributors
The community contributors below have significantly contributed to the v4.16.0 release. Thank you!
- @novice03, for contributing Nyströmformer, Swin Transformer and YOSO
- @qqaatw, for contributing REALM
- @stancld, for adding support for ELECTRA as a decoder, and porting RoFormer to Flax
- @ydshieh, for a myriad of documentation fixes, the port of CLIP to TensorFlow, the addition of the TensorFlow vision encoder-decoder model, and the contribution of an image captioning example in Flax.
New Contributors
- @YangDong2002 made their first contribution in #14894
- @Sanster made their first contribution in #14917
- @kleinay made their first contribution in #14930
- @MihaiBalint made their first contribution in #14946
- @milyiyo made their first contribution in #15002
- @mattchurgin made their first contribution in #15037
- @itsTurner made their first contribution in #14963
- @kct22aws made their first contribution in #14982
- @yoquankara made their first contribution in #14082
- @cody-moveworks made their first contribution in #15019
- @MaximovaIrina made their first contribution in #14868
- @JejuWayfarer made their first contribution in #15099
- @novice03 made their first contribution in #14659
- @banda-larga made their first contribution in #15125
- @manuelciosici made their first contribution in #14744
- @carlos-aguayo made their first contribution in #15142
- @gante made their first contribution in #15146
- @AK391 made their first contribution in #15106
- @MrinalTyagi made their first contribution in #14777
- @jsnfly made their first contribution in #15056
- @jkuball made their first contribution in #15230
- @wangyems made their first contribution in #15235
- @evandrosks made their first contribution in #15278
- @Pawloch247 made their first contribution in #15300
- @deppen8 made their first contribution in #15337
- @ngoquanghuy99 made their first contribution in #15346
Full Changelog: v4.15.0...v4.16.0