v4.37 Qwen2, Phi-2, SigLIP, ViP-LLaVA, Fast2SpeechConformer, 4-bit serialization, Whisper longform generation
Model releases
Qwen2
Qwen2 is the new model series of large language models from the Qwen team. Previously, the Qwen series was released, including Qwen-72B, Qwen-1.8B, Qwen-VL, Qwen-Audio, etc.
Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes.
- Add qwen2 by @JustinLin610 in #28436
Phi-2
Phi-2 is a transformer language model trained by Microsoft with exceptionally strong performance for its small size of 2.7 billion parameters. It was previously available as a custom code model, but has now been fully integrated into transformers.
- [Phi2] Add support for phi2 models by @susnato in #28211
- [Phi] Extend implementation to use GQA/MQA. by @gugarosa in #28163
- update docs to add the
phi-2
example by @susnato in #28392 - Fixes default value of
softmax_scale
inPhiFlashAttention2
. by @gugarosa in #28537
SigLIP
The SigLIP model was proposed in Sigmoid Loss for Language Image Pre-Training by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. SigLIP proposes to replace the loss function used in CLIP by a simple pairwise sigmoid loss. This results in better performance in terms of zero-shot classification accuracy on ImageNet.
- Add SigLIP by @NielsRogge in #26522
- [SigLIP] Don't pad by default by @NielsRogge in #28578
ViP-LLaVA
The VipLlava model was proposed in Making Large Multimodal Models Understand Arbitrary Visual Prompts by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.
VipLlava enhances the training protocol of Llava by marking images and interact with the model using natural cues like a “red bounding box” or “pointed arrow” during training.
- Adds VIP-llava to transformers by @younesbelkada in #27932
- Fix Vip-llava docs by @younesbelkada in #28085
FastSpeech2Conformer
The FastSpeech2Conformer model was proposed with the paper Recent Developments On Espnet Toolkit Boosted By Conformer by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang.
FastSpeech 2 is a non-autoregressive model for text-to-speech (TTS) synthesis, which develops upon FastSpeech, showing improvements in training speed, inference speed and voice quality. It consists of a variance adapter; duration, energy and pitch predictor and waveform and mel-spectrogram decoder.
- Add FastSpeech2Conformer by @connor-henderson in #23439
Wav2Vec2-BERT
The Wav2Vec2-BERT model was proposed in Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team from Meta AI.
This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.
- Add new meta w2v2-conformer BERT-like model by @ylacombe in #28165
- Add w2v2bert to pipeline by @ylacombe in #28585
4-bit serialization
Enables saving and loading transformers models in 4bit formats - you can now push bitsandbytes 4-bit weights on Hugging Face Hub. To save 4-bit models and push them on the hub, simply install the latest bitsandbytes
package from pypi pip install -U bitsandbytes
, load your model in 4-bit precision and call save_pretrained
/ push_to_hub
. An example repo here
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)
model.push_to_hub("ybelkada/opt-125m-bnb-4bit")
- [bnb] Let's make serialization of 4bit models possible by @poedator in #26037
- [
Docs
] Add 4-bit serialization docs by @younesbelkada in #28182
4D Attention mask
Enable passing in 4D attention masks to models that support it. This is useful for reducing memory footprint of certain generation tasks.
Improved quantization support
Ability to customise which modules are quantized and which are not.
- [
Awq
] Enable the possibility to skip quantization for some target modules by @younesbelkada in #27950 - add
modules_in_block_to_quantize
arg in GPTQconfig by @SunMarc in #27956
Added fused modules support
- [docs] Fused AWQ modules by @stevhliu in #27896
- [
Awq
] Add llava fused modules support by @younesbelkada in #28239 - [
Mixtral
/Awq
] Add mixtral fused modules for Awq by @younesbelkada in #28240
SDPA Support for LLaVa, Mixtral, Mistral
- Fix SDPA correctness following torch==2.1.2 regression by @fxmarty in #27973
- [
Llava
/Vip-Llava
] Add SDPA into llava by @younesbelkada in #28107 - [
Mixtral
&Mistral
] Add support for sdpa by @ArthurZucker in #28133 - [SDPA] Make sure attn mask creation is always done on CPU by @patrickvonplaten in #28400
- Fix SDPA tests by @fxmarty in #28552
Whisper: Batched state-of-the-art long-form transcription
All decoding strategies (temperature fallback, compression/log-prob/no-speech threshold, ...) of OpenAI's long-form transcription (see: https://github.com/openai/whisper or section 4.5 in paper) have been added. Contrary to https://github.com/openai/whisper, Transformers long-form transcription is fully compatible with pure FP16 and Batching!
For more information see: #27658.
- [Whisper] Finalize batched SOTA long-form generation by @patrickvonplaten in #27658
Generation: assisted generation upgrades, speculative decoding, and ngram speculation
Assisted generation was reworked to accept arbitrary sources of candidate sequences. This enabled us to smoothly integrate ngram speculation, and opens the door for new candidate generation methods. Additionally, we've added the speculative decoding strategy on top of assisted generation: when you call assisted generation with an assistant model and do_sample=True
, you'll benefit from the faster speculative decoding sampling 🏎️💨
- Generate:
assisted_decoding
now accepts arbitrary candidate generators by @gante in #27751 - Generate: assisted decoding now uses
generate
for the assistant by @gante in #28031 - Generate: speculative decoding by @gante in #27979
- Generate: fix speculative decoding by @gante in #28166
- Adding Prompt lookup decoding by @apoorvumang in #27775
- Fix _speculative_sampling implementation by @ofirzaf in #28508
torch.load pickle protection
Adding pickle protection via weights_only=True in the torch.load calls.
Build methods for TensorFlow Models
Unlike PyTorch, TensorFlow models build their weights "lazily" after model initialization, using the shape of their inputs to figure out what their weight shapes should be. We previously needed a full forward pass through TF models to ensure that all layers received an input they could use to build their weights, but with this change we now have proper build()
methods that can correctly infer shapes and build model weights. This avoids a whole range of potential issues, as well as significantly accelerating model load times.
- Proper build() methods for TF by @Rocketknight1 in #27794
- Replace build() with build_in_name_scope() for some TF tests by @Rocketknight1 in #28046
- More TF fixes by @Rocketknight1 in #28081
- Even more TF test fixes by @Rocketknight1 in #28146
Remove support for torch 1.10
The last version to support PyTorch 1.10 was 4.36.x. As it has been more than 2 years, and we're looking forward to using features available in PyTorch 1.11 and up, we do not support PyTorch 1.10 for v4.37 (i.e. we don't run the tests against torch 1.10).
Model tagging
You can now add custom tags into your model before pushing it on the Hub! This enables you to filter models that contain that tag on the Hub with a simple URL filter. For example if you want to filter models that have trl
tag you can search: https://huggingface.co/models?other=trl&sort=created
- [
core
/ FEAT] Add the possibility to push custom tags usingPreTrainedModel
itself by @younesbelkada in #28405 - e.g.
from transformers import AutoModelForCausalLM
model_name = "HuggingFaceM4/tiny-random-LlamaForCausalLM"
model = AutoModelForCausalLM.from_pretrained(model_name)
model.add_model_tags(["tag-test"])
model.push_to_hub("llama-tagged")
Bugfixes and improvements
- Fix PatchTSMixer Docstrings by @vijaye12 in #27943
- use logger.warning_once to avoid massive outputs by @ranchlai in #27428
- Docs for AutoBackbone & Backbone by @merveenoyan in #27456
- Fix test for auto_find_batch_size on multi-GPU by @muellerzr in #27947
- Update import message by @NielsRogge in #27946
- Fix parameter count in readme for mixtral 45b by @CyberTimon in #27945
- In PreTrainedTokenizerBase add missing word in error message by @petergtz in #27949
- Fix AMD scheduled CI not triggered by @ydshieh in #27951
- Add deepspeed test to amd scheduled CI by @echarlaix in #27633
- Fix a couple of typos and add an illustrative test by @rjenc29 in #26941
- fix bug in mask2former: cost matrix is infeasible by @xuchenhao001 in #27897
- Fix for stochastic depth decay rule in the TimeSformer implementation by @atawari in #27875
- fix no sequence length models error by @AdamLouly in #27522
- [
Mixtral
] Change mistral op order by @younesbelkada in #27955 - Update bounding box format everywhere by @NielsRogge in #27944
- Support PeftModel signature inspect by @dancingpipi in #27865
- fixed typos (issue 27919) by @asusevski in #27920
- Hot-fix-mixstral-loss by @ArthurZucker in #27948
- Fix link in README.md of Image Captioning by @saswatmeher in #27969
- Better key error for AutoConfig by @Rocketknight1 in #27976
- [doc] fix typo by @stas00 in #27981
- fix typo in dvclive callback by @dberenbaum in #27983
- [
Tokenizer Serialization
] Fix the broken serialisation by @ArthurZucker in #27099 - [
Whisper
] raise better errors by @ArthurZucker in #27971 - Fix PatchTSMixer slow tests by @ajati in #27997
- [
CI slow
] Fix expected values by @ArthurZucker in #27999 - Fix bug with rotating checkpoints by @muellerzr in #28009
- [Doc] Spanish translation of glossary.md by @aaronjimv in #27958
- Add model_docs from cpmant.md to derformable_detr.md by @rajveer43 in #27884
- well well well by @ArthurZucker in #28011
- [
SeamlessM4TTokenizer
] Safe import by @ArthurZucker in #28026 - [
core
/modeling
] Fix training bug with PEFT + GC by @younesbelkada in #28031 - Fix AMD push CI not triggered by @ydshieh in #28029
- SeamlessM4T:
test_retain_grad_hidden_states_attentions
is flaky by @gante in #28035 - Fix languages covered by M4Tv2 by @ylacombe in #28019
- Fixed spelling error in T5 tokenizer warning message (s/thouroughly/t… by @jeddobson in #28014
- Generate: Mistral/Mixtral FA2 cache fix when going beyond the context window by @gante in #28037
- [Seamless] Fix links in docs by @sanchit-gandhi in #27905
- Remove warning when Annotion enum is created by @amyeroberts in #28048
- [
FA-2
] Fix fa-2 issue when passingconfig
tofrom_pretrained
by @younesbelkada in #28043 - [
Modeling
/Mixtral
] Fix GC + PEFT issues with Mixtral by @younesbelkada in #28061 - [Flax BERT] Update deprecated 'split' method by @sanchit-gandhi in #28012
- [Flax LLaMA] Fix attn dropout by @sanchit-gandhi in #28059
- Remove SpeechT5 deprecated argument by @ylacombe in #28062
- doc: Correct spelling mistake by @caiyili in #28064
- [
Mixtral
] update conversion script to reflect new changes by @younesbelkada in #28068 - Skip M4T
test_retain_grad_hidden_states_attentions
by @ylacombe in #28060 - [LLaVa] Add past_key_values to _skip_keys_device_placement to fix multi-GPU dispatch by @aismlv in #28051
- Make GPT2 traceable in meta state by @kwen2501 in #28054
- Fix bug for checkpoint saving on multi node training setting by @dumpmemory in #28078
- Update fixtures-image-utils by @lhoestq in #28080
- Fix
low_cpu_mem_usage
Flag Conflict with DeepSpeed Zero 3 infrom_pretrained
for Models withkeep_in_fp32_modules
" by @kotarotanahashi in #27762 - Fix wrong examples in llava usage. by @Lyken17 in #28020
- [docs] Trainer by @stevhliu in #27986
- [docs] MPS by @stevhliu in #28016
- fix resuming from ckpt when using FSDP with FULL_STATE_DICT by @pacman100 in #27891
- Fix the deprecation warning of _torch_pytree._register_pytree_node by @cyyever in #27803
- Spelling correction by @saeneas in #28110
- in peft finetune, only the trainable parameters need to be saved by @sywangyi in #27825
- fix ConversationalPipeline docstring by @not-lain in #28091
- Disable jitter noise during evaluation in SwitchTransformers by @DaizeDong in #28077
- Remove warning if
DISABLE_TELEMETRY
is used by @Wauplin in #28113 - Fix indentation error - semantic_segmentation.md by @rajveer43 in #28117
- [docs] General doc fixes by @stevhliu in #28087
- Fix a typo in tokenizer documentation by @mssalvatore in #28118
- [Doc] Fix token link in What 🤗 Transformers can do by @aaronjimv in #28123
- When save a model on TPU, make a copy to be moved to CPU by @qihqi in #27993
- Update split string in doctest to reflect #28087 by @amyeroberts in #28135
- [
Mixtral
] Fix loss + nits by @ArthurZucker in #28115 - Update modeling_utils.py by @mzelling in #28127
- [docs] Fix mistral link in mixtral.md by @aaronjimv in #28143
- Remove deprecated CPU dockerfiles by @ashahba in #28149
- Fix FA2 integration by @pacman100 in #28142
- [gpt-neox] Add attention_bias config to support model trained without attention biases by @dalgarak in #28126
- move code to Trainer.evaluate to enable use of that function with multiple datasets by @peter-sk in #27844
- Fix weights not properly initialized due to shape mismatch by @ydshieh in #28122
- Avoid unnecessary warnings when loading
CLIPConfig
by @ydshieh in #28108 - Update FA2 exception msg to point to hub discussions by @amyeroberts in #28161
- Align backbone stage selection with out_indices & out_features by @amyeroberts in #27606
- [docs] Trainer docs by @stevhliu in #28145
- Fix yolos resizing by @amyeroberts in #27663
- disable test_retain_grad_hidden_states_attentions on SeamlessM4TModelWithTextInputTest by @dwyatte in #28169
- Fix
input_embeds
docstring in encoder-decoder architectures by @gante in #28168 - [Whisper] Use torch for stft if available by @sanchit-gandhi in #26119
- Fix slow backbone tests - out_indices must match stage name ordering by @amyeroberts in #28186
- Update YOLOS slow test values by @amyeroberts in #28187
- Update
docs/source/en/perf_infer_gpu_one.md
by @ydshieh in #28198 - Fix ONNX export for causal LM sequence classifiers by removing reverse indexing by @dwyatte in #28144
- Add Swinv2 backbone by @NielsRogge in #27742
- Fix: [SeamlessM4T - S2TT] Bug in batch loading of audio in torch.Tensor format in the SeamlessM4TFeatureExtractor class by @nicholasneo78 in #27914
- Bug:
training_args.py
fix missing import with accelerate with versionaccelerate==0.20.1
by @michaelfeil in #28171 - Fix the check of models supporting FA/SDPA not run by @ydshieh in #28202
- Drop
feature_extractor_type
when loading an image processor file by @ydshieh in #28195 - [Whisper] Fix word-level timestamps with bs>1 or num_beams>1 by @ylacombe in #28114
- Fixing visualization code for object detection to support both types of bounding box. by @Anindyadeep in #27842
- update the logger message with accordant weights_file_name by @izyForever in #28181
- [
Llava
] Fix llava index errors by @younesbelkada in #28032 - fix FA2 when using quantization by @pacman100 in #28203
- small typo by @stas00 in #28229
- Update docs around mixing hf scheduler with deepspeed optimizer by @dwyatte in #28223
- Fix trainer saving safetensors: metadata is None by @hiyouga in #28219
- fix bug:divide by zero in _maybe_log_save_evaluate() by @frankenliu in #28251
- [Whisper] Fix errors with MPS backend introduced by new code on word-level timestamps computation by @ercaronte in #28288
- Remove fast tokenization warning in Data Collators by @dbuos in #28213
- fix documentation for zero_shot_object_detection by @not-lain in #28267
- Remove token_type_ids from model_input_names (like #24788) by @Apsod in #28325
- Translate contributing.md into Chinese by @Mayfsz in #28243
- [docs] Sort es/toctree.yml | Translate performance.md by @aaronjimv in #28262
- Fix error in M4T feature extractor by @ylacombe in #28340
- README: install transformers from conda-forge channel by @kevherro in #28313
- Don't check the device when device_map=auto by @yuanwu2017 in #28351
- Fix pos_mask application and update tests accordingly by @ferjorosa in #27892
- fix FA2 when using quantization for remaining models by @susnato in #28341
- Update VITS modeling to enable ONNX export by @echarlaix in #28141
- chore: Fix typo s/exclusivelly/exclusively/ by @hugo-syn in #28361
- Enhancing Code Readability and Maintainability with Simplified Activation Function Selection. by @hi-sushanta in #28349
- Fix building alibi tensor when num_heads is not a power of 2 by @abuelnasr0 in #28380
- remove two deprecated function by @statelesshz in #28220
- Bugfix / ffmpeg input device (mic) not working on Windows by @Teapack1 in #27051
- [AttentionMaskConverter] fix sdpa unmask unattended by @zspo in #28369
- Remove shell=True from subprocess.Popen to Mitigate Security Risk by @avimanyu786 in #28299
- Add segmentation map processing to SAM Image Processor by @rwood-97 in #27463
- update warning for image processor loading by @ydshieh in #28209
- Fix initialization for missing parameters in
from_pretrained
under ZeRO-3 by @XuehaiPan in #28245 - Fix
_merge_input_ids_with_image_features
for llava model by @VictorSanh in #28333 - Use mmap option to load_state_dict by @weimingzha0 in #28331
- [BUG] BarkEosPrioritizerLogitsProcessor eos_token_id use list, tensor size mismatch by @inkinworld in #28201
- Skip now failing test in the Trainer tests by @muellerzr in #28421
- Support
DeepSpeed
when using auto find batch size by @muellerzr in #28088 - Fix number of models in README.md by @prasatee in #28430
- CI: limit natten version by @gante in #28432
- Fix for checkpoint rename race condition by @tblattner in #28364
- Fix load correct tokenizer in Mixtral model documentation by @JuanFKurucz in #28437
- [docstring] Fix docstring for ErnieConfig, ErnieMConfig by @Sparty in #27029
- [Whisper] Fix slow test by @patrickvonplaten in #28407
- Assitant model may on a different device by @jiqing-feng in #27995
- Enable multi-label image classification in pipeline by @amyeroberts in #28433
- Optimize the speed of the truncate_sequences function. by @ikkvix in #28263
- Use python 3.10 for docbuild by @ydshieh in #28399
- Fix docker file by @ydshieh in #28452
- Set
cache_dir
forevaluate.load()
in example scripts by @aphedges in #28422 - Optionally preprocess segmentation maps for MobileViT by @harisankar95 in #28420
- Correctly resolve trust_remote_code=None for AutoTokenizer by @Rocketknight1 in #28419
- Fix load balancing loss func for mixtral by @liangxuZhang in #28256
- Doc by @jiqing-feng in #28431
- Fix docstring checker issues with PIL enums by @Rocketknight1 in #28450
- Fix broken link on page by @keenranger in #28451
- Mark two logger tests as flaky by @amyeroberts in #28458
- Update metadata loading for oneformer by @amyeroberts in #28398
- Fix torch.ones usage in xlnet by @sungho-ham in #28471
- Generate: deprecate old public functions by @gante in #28478
- Docs: add model paths by @gante in #28475
- Generate: refuse to save bad generation config files by @gante in #28477
- TF: purge
TFTrainer
by @gante in #28483 - Fix docstrings and update docstring checker error message by @Rocketknight1 in #28460
- Change progress logging to once across all nodes by @siddartha-RE in #28373
- Generate: fix candidate device placement by @gante in #28493
- Fix paths to AI Sweden Models reference and model loading by @JuanFKurucz in #28423
- [
chore
] Update warning text, a word was missing by @tomaarsen in #28017 - Don't set
finetuned_from
if it is a local path by @ydshieh in #28482 - Add the XPU device check for pipeline mode by @yuanwu2017 in #28326
- Tokenizer kwargs in textgeneration pipe by @thedamnedrhino in #28362
- [GPTQ] Fix test by @SunMarc in #28018
- Fixed minor typos by @rishit5 in #28489
- Add a use_safetensors arg to TFPreTrainedModel.from_pretrained() by @Rocketknight1 in #28511
- Generate: consolidate output classes by @gante in #28494
- fix: sampling in flax keeps EOS by @borisdayma in #28378
- improve dev setup comments and hints by @4imothy in #28495
- SiLU activation wrapper for safe importing by @amyeroberts in #28509
- Remove
task
arg inload_dataset
in image-classification example by @regisss in #28408 - Improving Training Performance and Scalability Documentation by @HamzaFB in #28497
- Fix mismatching loading in from_pretrained with/without accelerate by @fxmarty in #28414
- Fix/speecht5 bug by @NimaYaqmuri in #28481
- [
TokenizationUtils
] Fixadd_special_tokens
when the token is already there by @ArthurZucker in #28520 - [
TokenizationRoformerFast
] Fix the save and loading by @ArthurZucker in #28527 - [
SpeechT5Tokenization
] Add copied from and fix theconvert_tokens_to_string
to match the fast decoding scheme by @ArthurZucker in #28522 - Clearer error for SDPA when explicitely requested by @fxmarty in #28006
- Add is_model_supported for fx by @inisis in #28521
- Config: warning when saving generation kwargs in the model config by @gante in #28514
- [Makefile] Exclude research projects from format by @patrickvonplaten in #28551
- symbolic_trace: add past_key_values, llama, sdpa support by @fxmarty in #28447
- Allow to train dinov2 with different dtypes like bf16 by @StarCycle in #28504
- Fix Switch Transformers When sparse_step = 1 by @agemagician in #28564
- Save
Processor
by @ydshieh in #27761 - Use
weights_only
only if torch >= 1.13 by @ydshieh in #28506 - [
Core Tokenization
] Support a fix for spm fast models by @ArthurZucker in #26678 - Use
LoggingLevel
context manager in 3 tests by @ydshieh in #28575 - Fix the documentation checkpoint for xlm-roberta-xl by @jeremyfowers in #28567
- [ASR Pipe] Update init to set model type and subsequently call parent init method by @sanchit-gandhi in #28486
- [Whisper Tok] Move token ids to CPU when computing offsets by @sanchit-gandhi in #28485
- [Whisper] Fix audio classification with weighted layer sum by @sanchit-gandhi in #28563
- Making CTC training example more general by @ylacombe in #28582
- Don't save
processor_config.json
if a processor has no extra attribute by @ydshieh in #28584 - Fix wrong xpu device in DistributedType.MULTI_XPU mode by @faaany in #28386
- [GPTNeoX] Fix BC issue with 4.36 by @ArthurZucker in #28602
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @aaronjimv
- @rajveer43
- @poedator
- @connor-henderson
- Add FastSpeech2Conformer (#23439)
- @JustinLin610
- Add qwen2 (#28436)
- @SangbumChoi
- enable training mask2former and maskformer for transformers trainer by @SangbumChoi in #28277
- [DETA] Improvement and Sync from DETA especially for training by @SangbumChoi in #27990
- fix auxiliary loss training in DetrSegmentation by @SangbumChoi in #28354