Update dependency transformers to v4.55.0 #196

renovate · 2024-10-08T17:41:27Z

This PR contains the following updates:

Package	Change	Age	Confidence
transformers	`4.45.1` -> `4.55.0`

Release Notes

huggingface/transformers (transformers)

`v4.55.0`: : New openai GPT OSS model!

Compare Source

Welcome GPT OSS, the new open-source model family from OpenAI!

For more detailed information about this model, we recommend reading the following blogpost: https://huggingface.co/blog/welcome-openai-gpt-oss

GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. It comprises two models: a big one with 117B parameters (gpt-oss-120b), and a smaller one with 21B parameters (gpt-oss-20b). Both are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4), enabling fast inference (thanks to fewer active parameters, see details below) while keeping resource usage low. The large model fits on a single H100 GPU, while the small one runs within 16GB of memory and is perfect for consumer hardware and on-device applications.

Overview of Capabilities and Architecture

21B and 117B total parameters, with 3.6B and 5.1B active parameters, respectively.
4-bit quantization scheme using mxfp4 format. Only applied on the MoE weights. As stated, the 120B fits in a single 80 GB GPU and the 20B fits in a single 16GB GPU.
Reasoning, text-only models; with chain-of-thought and adjustable reasoning effort levels.
Instruction following and tool use support.
Inference implementations using transformers, vLLM, llama.cpp, and ollama.
Responses API is recommended for inference.
License: Apache 2.0, with a small complementary use policy.

Architecture

Token-choice MoE with SwiGLU activations.
When calculating the MoE weights, a softmax is taken over selected experts (softmax-after-topk).
Each attention layer uses RoPE with 128K context.
Alternate attention layers: full-context, and sliding 128-token window.
Attention layers use a learned attention sink per-head, where the denominator of the softmax has an additional additive value.
It uses the same tokenizer as GPT-4o and other OpenAI API models.
Some new tokens have been incorporated to enable compatibility with the Responses API.

The following snippet shows simple inference with the 20B model. It runs on 16 GB GPUs when using mxfp4, or ~48 GB in bfloat16.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]  

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

Flash Attention 3

The models use attention sinks, a technique the vLLM team made compatible with Flash Attention 3. We have packaged and integrated their optimized kernel in kernels-community/vllm-flash-attn3. At the time of writing, this super-fast kernel has been tested on Hopper cards with PyTorch 2.7 and 2.8. We expect increased coverage in the coming days. If you run the models on Hopper cards (for example, H100 or H200), you need to pip install –upgrade kernels and add the following line to your snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
+    # Flash Attention with Sinks
+    attn_implementation="kernels-community/vllm-flash-attn3",
)  

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

Even though the 120B model fits on a single H100 GPU (using mxfp4), you can also run it easily on multiple GPUs using accelerate or torchrun. Transformers provides a default parallelization plan, and you can leverage optimized attention kernels as well. The following snippet can be run with torchrun --nproc_per_node=4 generate.py on a system with 4 GPUs:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed import DistributedConfig
import torch

model_path = "openai/gpt-oss-120b"
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")

device_map = {
    "tp_plan": "auto",    # Enable Tensor Parallelism
}

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    attn_implementation="kernels-community/vllm-flash-attn3",
    **device_map,
)

messages = [
     {"role": "user", "content": "Explain how expert parallelism works in large language models."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1000)

##### Decode and print
response = tokenizer.decode(outputs[0])
print("Model response:", response.split("<|channel|>final<|message|>")[-1].strip())

Other optimizations

If you have a Hopper GPU or better, we recommend you use mxfp4 for the reasons explained above. If you can additionally use Flash Attention 3, then by all means do enable it!

[!TIP]
If your GPU is not compatible with mxfp4, then we recommend you use MegaBlocks MoE kernels for a nice speed bump. To do so, you just need to adjust your inference code like this:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
+    # Optimize MoE layers with downloadable MegaBlocksMoeMLP
+    use_kernels=True,
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

[!TIP]
MegaBlocks optimized MoE kernels require the model to run on bfloat16, so memory consumption will be higher than running on mxfp4. We recommend you use mxfp4 if you can, otherwise opt in to MegaBlocks via use_kernels=True.

transformers serve

You can use transformers serve to experiment locally with the models, without any other dependencies. You can launch the server with just:
transformers serve

To which you can send requests using the Responses API.


##### responses API
curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{"input": [{"role": "system", "content": "hello"}], "temperature": 1.0, "stream": true, "model": "openai/gpt-oss-120b"}'

You can also send requests using the standard Completions API:


##### completions API
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 1.0, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-120b"}'

Command A Vision

Command A Vision is a state-of-the-art multimodal model designed to seamlessly integrate visual and textual information for a wide range of applications. By combining advanced computer vision techniques with natural language processing capabilities, Command A Vision enables users to analyze, understand, and generate insights from both visual and textual data.

The model excels at tasks including image captioning, visual question answering, document understanding, and chart understanding. This makes it a versatile tool for AI practitioners. Its ability to process complex visual and textual inputs makes it useful in settings where text-only representations are imprecise or unavailable, like real-world image understanding and graphics-heavy document processing.

Command A Vision is built upon a robust architecture that leverages the latest advancements in VLMs. It's highly performant and efficient, even when dealing with large-scale datasets. The model's flexibility makes it suitable for a wide range of use cases, from content moderation and image search to medical imaging analysis and robotics.

[Model] Cohere2 Vision by @zucchini-nlp in #39810

MM Grounding DINO

MM Grounding DINO model was proposed in An Open and Comprehensive Pipeline for Unified Object Grounding and Detection by Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang>.

MM Grounding DINO improves upon the Grounding DINO by improving the contrastive class head and removing the parameter sharing in the decoder, improving zero-shot detection performance on both COCO (50.6(+2.2) AP) and LVIS (31.9(+11.8) val AP and 41.4(+12.6) minival AP).

You can find all the original MM Grounding DINO checkpoints under the MM Grounding DINO collection. This model also supports LLMDet inference. You can find LLMDet checkpoints under the LLMDet collection.

Add MM Grounding DINO by @rziga in #37925

Bugfixes and improvements

More robust tied weight test by @Cyrilvallez in #39681
fix missing model._tp_size from ep refactor by @winglian in #39688
Fix missing initialization of FastSpeech2Conformer by @bvantuan in #39689
fix(tokenization): check token.content for trie by @pjo256 in #39587
xpu optimization for generation case by @sywangyi in #39573
[processors] add tests for helper fn by @zucchini-nlp in #39629
update ernie model card by @jzhang533 in #39657
[configuration] remove redundant classmethod by @zucchini-nlp in #38812
Add self-hosted runner scale set workflow for mi325 CI by @jitesh-gupta in #39651
PATCH: add back n-dim device-mesh + fix tp trainer saving by @S1ro1 in #39693
[CI] Add Eric to comment slow ci by @vasqu in #39601
Remove all expired deprecation cycles by @Cyrilvallez in #39725
mllama outputs refactor by @itazap in #39643
Update QAPipelineTests::test_large_model_course after #39193 by @ydshieh in #39666
skip Glm4MoeModelTest::test_torch_compile_for_training by @ydshieh in #39670
Fix Qwen2AudioForConditionalGeneration.forward() and test_flash_attn_kernels_inference_equivalence by @ebezzam in #39503
Fix Layer device placement in Caches by @Cyrilvallez in #39732
Fix cache-related tests by @zucchini-nlp in #39676
Fix AMD dockerfile for audio models by @remi-or in #39669
Superpoint fast image processor by @arkhamHack in #37804
Add Fast Segformer Processor by @capnmav77 in #37024
BLIPs clean-up by @zucchini-nlp in #35560
extend more trainer test cases to XPU, all pass by @yao-matrix in #39652
fix cache inheritance by @ArthurZucker in #39748
[Fix] import two missing typos in models/__init__.py for typo checking by @hebangwen in #39745
Fix: add back base model plan by @S1ro1 in #39733
update GemmaIntegrationTest::test_model_2b_bf16_dola again by @ydshieh in #39731
Update IMPORTANT_MODELS list by @ivarflakstad in #39734
Fix mamba regression by @manueldeprada in #39728
Apply several ruff SIM rules by @cyyever in #37283
Use --gpus all in workflow files by @ydshieh in #39752
AMD disable torchcodec by @ivarflakstad in #39757
Avoid OOM when other tests are failing by @ydshieh in #39758
Fix GPT2 with cross attention by @zucchini-nlp in #39754
Support loading Qwen3 MoE GGUF by @ctcanbol in #39638
Enable xpu allocator on caching_allocator_warmup by @jiqing-feng in #39654
Fix version issue in modeling_utils.py by @Cyrilvallez in #39759
add libcst to extras["testing"] in setup.py by @ydshieh in #39761
[modenbert] fix regression by @zucchini-nlp in #39750
🌐 [i18n-KO] Translated main_classes/peft.md by @luckyvickyricky in #39515
🌐 [i18n-KO] Translated albert.md to Korean by @ahnjj in #39524
🌐 [i18n-KO] Translated tvp.md to Korean by @Kim-Ju-won in #39578
🌐 [i18n-KO] Translated tokenizer.md to Korean by @seopp in #39532
🌐 [i18n-KO] Translated pipeline_gradio.md to Korean by @AhnJoonSung in #39520
🌐 [i18n-KO] Translated perf_train_gpu_one.md to Korean by @D15M4S in #39552
🌐 [i18n-KO] Translated how_to_hack_models.md to Korean by @skwh54 in #39536
fix(trainer): Correct loss scaling for incomplete gradient accumulation steps by @hutaiHang in #39659
Fix Cache.max_cache_len max value for Hybrid models by @manueldeprada in #39737
[docs] Ko doc fixes after toc update by @gante in #39660
Remove python3.7 reference from doc link by @st81 in #39706
Fix OmDet test after arg deprecation by @Cyrilvallez in #39766
docs: Update EfficientLoFTR documentation by @sbucaille in #39620
Standardize CLAP model card format by @yanamis in #39738
Don't set run_name when none by @qgallouedec in #39695
Fix Evolla and xLSTM tests by @Cyrilvallez in #39769
enable static cache on vision encoder decoder by @jiqing-feng in #39773
[ASR pipline] fix with datasets 4.0 by @eustlb in #39504
more info in model_results.json by @ydshieh in #39783
Super tiny update by @zucchini-nlp in #39727
fix chameleonvision UT failure by @yao-matrix in #39646
Fix an invalid condition by @cyyever in #39762
Simplify conditional code by @cyyever in #39781
Fix re-compilations for cross attention cache by @zucchini-nlp in #39788
standardized BARThez model card by @EthanV431 in #39701
Update model card for Cohere2 (Command R7B) by @arpon-kapuria in #39604
Update mT5 model card by @dross20 in #39702
Add callback to monitor progress in whisper transcription by @poke1024 in #37483
fix: providing a tensor to cache_position in model.generate kwargs always crashes because of boolean test by @gante in #39300
feat(tokenization): add encode_message to tokenize messages one by one by @pco111 in #39507
[docs] fix korean docs yet again by @gante in #39813
Update documentation for Cohere2Vision models by @kyle-cohere in #39817
[cohere2 vision] move doc to multimodal section by @zucchini-nlp in #39820
Fix broken links by @oToToT in #39809
Fix bad markdown links by @ebezzam in #39819
Fix tp cb by @ArthurZucker in #39838
[VLMs] split out "get placeholder mask" to helper by @zucchini-nlp in #39777
[attn_implementation] remove recursive, allows custom kernels with wrappers by @ArthurZucker in #39823
[typecheck] proper export of private symbols by @cyyever in #39729
Update ux cb by @ArthurZucker in #39845
Fix responses add tests by @LysandreJik in #39848
Add fast image processor Janus, Deepseek VL, Deepseek VL hybrid by @yonigozlan in #39739
[image-processing] deprecate plot_keypoint_matching, make visualize_keypoint_matching as a standard by @sbucaille in #39830
Allow TrackioCallback to work when pynvml is not installed by @qgallouedec in #39851
remove dtensors, not explicit by @ArthurZucker in #39840
Improve is_wandb_available function to verify WandB installation by @qgallouedec in #39875
Refactor label name handling for PEFT models in Trainer class by @qgallouedec in #39265
Use comment to build doc on PRs by @ydshieh in #39846
Add support for including in-memory videos (not just files/urls) in apply_chat_template by @akibjawad in #39494
[core] Fix attn_implementation setter with missing sub_configs by @qubvel in #39855
Fix quant docker for fp-quant by @SunMarc in #39641
Rework add-new-model-like with modular and make test filenames coherent by @Cyrilvallez in #39612
Replace Tokenizer with PreTrainedTokenizerFast in ContinuousBatchProcessor by @qgallouedec in #39858
Set torch.backends.cudnn.allow_tf32 = False for CI by @ydshieh in #39885
[typing] better return type hint for AutoModelForCausalLM and AutoModelForImageTextToText by @qubvel in #39881
Fix link to models in README by @qubvel in #39880
[DOCS] : Improved mimi model card by @rohitthewanderer in #39824
Update cohere2 vision test by @ydshieh in #39888
send some feedback when manually building doc via comment by @ydshieh in #39889
Add support for ModernBertForMultipleChoice by @netique in #39232
chore: update DETR model card by @arpon-kapuria in #39822
Reorder serving docs by @LysandreJik in #39634
[Exaone4] Fixes the attn implementation! by @ArthurZucker in #39906
fix test_working_of_tp failure of accelerate ut by @yao-matrix in #39828
[qwen] remove unnecessary CUDA sync in qwen2_5_vl by @cyyever in #39870
Avoid aliasing in cond's branches for torch 2.8 by @ydwu4 in #39488
Fix misleading WandB error when WANDB_DISABLED is set by @notkisk in #39891
Replace video_fps with fps in tests by @cyyever in #39898
Fix eval thread fork bomb by @JustinVanHeek in #39717
Fix aria tests by @zucchini-nlp in #39879

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@capnmav77
- Add Fast Segformer Processor (#37024)
@cyyever
- Apply several ruff SIM rules (#37283)
- Fix an invalid condition (#39762)
- Simplify conditional code (#39781)
- [typecheck] proper export of private symbols (#39729)
- [qwen] remove unnecessary CUDA sync in qwen2_5_vl (#39870)
- Replace video_fps with fps in tests (#39898)
@rziga
- Add MM Grounding DINO (#37925)

`v4.54.1`: Patch release 4.54.1

Compare Source

Patch release 4.54.1

We had quite a lot of bugs that got through! Release was a bit rushed, sorry everyone! 🤗
Mostly cache fixes, as we now have layered cache, and fixed to distributed.

Fix Cache.max_cache_len max value for Hybrid models, @manueldeprada, @Cyrilvallez, #39737
[modenbert] fix regression, @zucchini-nlp, #39750
Fix version issue in modeling_utils.py, @Cyrilvallez, #39759
Fix GPT2 with cross attention, @zucchini-nlp, #39754
Fix mamba regression, @manueldeprada, #39728
Fix: add back base model plan, @S1ro1, #39733
fix cache inheritance, #39748
Fix cache-related tests, @zucchini-nlp, #39676
Fix Layer device placement in Caches, @Cyrilvallez, #39732
PATCH: add back n-dim device-mesh + fix tp trainer saving, @S1ro1, @SunMarc, #39693
fix missing model._tp_size from ep refactor, @winglian, #39688

`v4.54.0`: : Kernels, Transformers Serve, Ernie, Voxtral, LFM2, DeepSeek v2, ModernBERT Decoder...

Compare Source

Important news!

In order to become the source of truth, we recognize that we need to address two common and long-heard critiques about transformers:

transformers is bloated
transformers is slow

Our team has focused on improving both aspects, and we are now ready to announce this.
The modeling files for the standard Llama models are down to 500 LOC and should be much more readable, keeping just the core of the modeling and hiding the "powerful transformers features."

The MoEs are getting some kernel magic, enabling the use of the efficient megablocks kernels, setting a good precedent to allow the community to leverage any of the most powerful kernels developed for quantization as well!
It should also be much more convenient to use with any attention implementation you want. This opens the door to some optimizations such as leveraging flash-attention on Metal (MPS Torch backend).

This is but the tip of the iceberg: with the work on kernels that we're heavily pushing forward, expect speed-ups on several backends in the coming months!!

This release also includes the first steps to enabling efficient distributed training natively in transformers. Loading a 100B model takes ~3 seconds on our cluster — we hope this will be the norm for everyone! We are working on distributed checkpointing as well, and want to make sure our API can be easily used for any type of parallelism.

We want the community to benefit from all of the advances, and as always, include all hardware and platforms! We believe the kernels library will give the tools to optimize everything, making a big difference for the industry!

New models

Ernie 4.5

The Ernie 4.5 model was released in the Ernie 4.5 Model Family release by baidu.
This family of models contains multiple different architectures and model sizes. This model in specific targets the base text
model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard Llama at its core.

Other models from the family can be found at Ernie 4.5 MoE.

[Ernie 4.5] Add ernie text models by @vasqu in #39228

Voxtral

Voxtral is an upgrade of Ministral 3B and Mistral Small 3B, extending its language capabilities with audio input support. It is designed to handle tasks such as speech transcription, translation, and audio understanding.

You can read more in Mistral's realease blog post.

The model is available in two checkpoints:

3B: mistralai/Voxtral-Mini-3B-2507
24B: mistralai/Voxtral-Small-24B-2507

Key Features

Voxtral builds on Ministral-3B by adding audio processing capabilities:

Transcription mode: Includes a dedicated mode for speech transcription. By default, Voxtral detects the spoken language and transcribes it accordingly.
Long-form context: With a 32k token context window, Voxtral can process up to 30 minutes of audio for transcription or 40 minutes for broader audio understanding.
Integrated Q&A and summarization: Supports querying audio directly and producing structured summaries without relying on separate ASR and language models.
Multilingual support: Automatically detects language and performs well across several widely spoken languages, including English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian.
Function calling via voice: Can trigger functions or workflows directly from spoken input based on detected user intent.
Text capabilities: Maintains the strong text processing performance of its Ministral-3B foundation.

Add voxtral by @eustlb in #39429

LFM2

LFM2 represents a new generation of Liquid Foundation Models developed by Liquid AI, specifically designed for edge AI and on-device deployment.

The models are available in three sizes (350M, 700M, and 1.2B parameters) and are engineered to run efficiently on CPU, GPU, and NPU hardware, making them particularly well-suited for applications requiring low latency, offline operation, and privacy.

LFM2 by @paulpak58 in #39340

DeepSeek v2

The DeepSeek-V2 model was proposed in DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model by DeepSeek-AI Team.

The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.

Add DeepSeek V2 Model into Transformers by @VladOS95-cyber in #36400

ModernBERT Decoder models

ModernBERT Decoder is the same architecture as ModernBERT but trained from scratch with a causal language modeling (CLM) objective. This allows for using the same architecture for comparing encoders and decoders. This is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.

Like the encoder version, ModernBERT Decoder incorporates modern architectural improvements such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention patterns. However, it uses causal (unidirectional) attention to enable autoregressive generation.

Add ModernBERT Decoder Models - ModernBERT, but trained with CLM! by @orionw in #38967

EoMT

The Encoder-only Mask Transformer (EoMT) model was introduced in the CVPR 2025 Highlight Paper Your ViT is Secretly an Image Segmentation Model by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus.
EoMT reveals Vision Transformers can perform image segmentation efficiently without task-specific components.

✨ Add EoMT Model || 🚨 Fix Mask2Former loss calculation by @yaswanth19 in #37610

Doge

Doge is a series of small language models based on the Doge architecture, aiming to combine the advantages of state-space and self-attention algorithms, calculate dynamic masks from cached value states using the zero-order hold method, and solve the problem of existing mainstream language models getting lost in context. It uses the wsd_scheduler scheduler to pre-train on the smollm-corpus, and can continue training on new datasets or add sparse activation feedforward networks from stable stage checkpoints.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F426/transformers/model\_doc/doge\_architecture.png" alt="drawing" width="600"

Add Doge model by @LoserCheems in #35891

AIM v2

The AIMv2 model was proposed in Multimodal Autoregressive Pre-training of Large Vision Encoders by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby.

The abstract from the paper is the following:

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.

Add Aimv2 model by @yaswanth19 in #36625

PerceptionLM

The PerceptionLM model was proposed in PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding by Jang Hyun Cho et al. It's a fully open, reproducible model for transparent research in image and video understanding. PLM consists of
a vision encoder with a small scale (<8B parameters) LLM decoder.

PerceptionLM by @shuminghu in #37878

Efficient LoFTR

The EfficientLoFTR model was proposed in Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed by Yifan Wang, Xingyi He, Sida Peng, Dongli Tan and Xiaowei Zhou.

This model consists of matching two images together by finding pixel correspondences. It can be used to estimate the pose between them.
This model is useful for tasks such as image matching, homography estimation, etc.

Add EfficientLoFTR model by @sbucaille in #36355

EVOLLA

Evolla is an advanced 80-billion-parameter protein-language generative model designed to decode the molecular language of proteins. It integrates information from protein sequences, structures, and user queries to generate precise and contextually nuanced insights into protein function. Trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, Evolla significantly advances research in proteomics and functional genomics, providing expert-level insights and shedding light on the molecular logic encoded in proteins.

Add evolla rebase main by @zhoubay in #36232

DeepSeek VL

Deepseek-VL was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages LLaMA as its text encoder, while SigLip is used for encoding images.

Add support for DeepseekAI's DeepseekVL by @geetu040 in #36248

xLSTM

The xLSTM model was proposed in xLSTM: Extended Long Short-Term Memory by Maximilian Beck*, Korbinian Pöppel*, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter and Sepp Hochreiter.
xLSTM updates the original LSTM architecture to be competitive with Transformer models by introducing exponential gating, matrix memory expansion, and parallelizable training and ingestion.

The 7B model variant was trained by the xLSTM team Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick Blies, Sebastian Böck and Sepp Hochreiter at NXAI.

Add xlstm model by @Cyrilvallez in #39665

EXAONE 4.0

EXAONE 4.0 model is the language model, which integrates a Non-reasoning mode and Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean.

The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications.

Add EXAONE 4.0 model by @lgai-exaone in #39129

Parallelisation

We've added Expert Parallel support for Llama4, next release will include it for all model! You can just set a distributed_config with enable_expert_parallel=True. This is enabling efficient training of sparse Mixture-of-Experts (MoE) models across multiple devices. This allows each expert in the MoE layer to run in parallel (instead of previous TP which requires more communication), significantly improving scalability and memory efficiency.

Add ep by @ArthurZucker in #39501

Quantization

FP Quant

FP-Quant is a quantization method optimized for Blackwell-generation Nvidia GPUs, supporting efficient post-training quantization (PTQ) and quantization-aware training (QAT) of LLMs using MXFP4 and NVFP4 formats.

Currently, only PTQ with MXFP4 is available. You can quantize models on-the-fly using transformers:

from transformers import AutoModelForCausalLM, FPQuantConfig

model = AutoModelForCausalLM.from_pretrained(
    "qwen/Qwen3-8B",
    quantization_config=FPQuantConfig(),
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)

FP-Quant requires a Blackwell GPU and runs via the QuTLASS library. No Blackwell GPU? Use FPQuantConfig(pseudoquant=True) to emulate quantization (no QuTLASS needed).

The following results show the inference speedup of QuTLASS MXFP4 over PyTorch BF16 in Transformers. MXFP4 gives consistent speedups across all batch sizes, reaching up to 4× faster at larger scales.

FP-Quant support by @BlackSamorez in #38696

Kernels

The kernels project aims to become the single trusted source for high-performance kernels in the Transformers ecosystem. We're working toward centralizing all kernels on the Hub, so updates, bug fixes, and improvements can happen in one place—no more scattered repos and no compilation headaches!

You can already try it out today by setting use_kernels=True in from_pretrained. Any contributor can build their kernel, register it and use it right away—no extra setup, more on this here

Even better: want to use Flash Attention 3? No need to deal with tricky compilation and missing symbols issues! Just drop in:

model.set_attn_implementation("kernels-community/flash-attn3")

This automatically fetches the right build for your setup (e.g. CUDA and PyTorch versions).

We’re also teaming up with amazing kernel devs from Unsloth, Liger, vLLM, and more to bring their work directly to the Hub—making it easier than ever to access amazing performance with a single line of code.

Kernels flash attn by @ArthurZucker in #39474

Transformers Serve

short_clip.mp4

Over the past few months, we have been putting more and more functionality in the transformers chat utility, which offers a CLI-based app to chat with chat models. We've chosen to push this further by splitting the backend of transformers chat in a new, separate utility called transformers serve.

This is ideal for experimentation purposes, or to run models locally for personal and private use. It does not aim to compete with dedicated inference engines such as vLLM or SGLang.

Models of diverse modalities supported by transformers may be served with the transformers serve CLI. It spawns a local server that offers compatibility with the OpenAI SDK, which is the de-facto standard for LLM conversations and other related tasks. This way, you can use the server from many third party applications, or test it using the transformers chat CLI (docs).

The server supports the following REST APIs:

/v1/chat/completions
/v1/responses
/v1/audio/transcriptions
/v1/models

Relevant commits:

Split transformers chat and transformers serve by @LysandreJik in #38443
[serve] Cursor support, move docs into separate page, add more examples by @gante in [#39133](https://redirect.

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

pigri · 2024-10-08T17:46:55Z

Blocked by this issue explosion/spaCy#13649

sonarqubecloud · 2025-04-14T12:00:00Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

sonarqubecloud · 2025-08-05T16:36:54Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

renovate bot force-pushed the renovate/transformers-4.x branch from 310a637 to 88e0cf6 Compare October 15, 2024 09:04

renovate bot changed the title ~~fix(deps): update dependency transformers to v4.45.2~~ fix(deps): update dependency transformers to v4.46.0 Oct 24, 2024

renovate bot force-pushed the renovate/transformers-4.x branch 2 times, most recently from a56915d to 4fb4b29 Compare October 29, 2024 18:39

renovate bot changed the title ~~fix(deps): update dependency transformers to v4.46.0~~ fix(deps): update dependency transformers to v4.46.1 Oct 29, 2024

renovate bot force-pushed the renovate/transformers-4.x branch from 4fb4b29 to 63d9975 Compare November 5, 2024 21:42

renovate bot changed the title ~~fix(deps): update dependency transformers to v4.46.1~~ fix(deps): update dependency transformers to v4.46.2 Nov 5, 2024

renovate bot force-pushed the renovate/transformers-4.x branch from 63d9975 to 4d29005 Compare November 6, 2024 18:33

renovate bot requested review from pigri, krichard1212 and waroca as code owners November 6, 2024 18:33

renovate bot force-pushed the renovate/transformers-4.x branch 3 times, most recently from 935a07d to e4bca8a Compare November 6, 2024 20:26

renovate bot force-pushed the renovate/transformers-4.x branch from e4bca8a to 0ab21e1 Compare November 19, 2024 00:07

renovate bot changed the title ~~fix(deps): update dependency transformers to v4.46.2~~ fix(deps): update dependency transformers to v4.46.3 Nov 19, 2024

renovate bot force-pushed the renovate/transformers-4.x branch 2 times, most recently from 77a84e6 to d847960 Compare December 5, 2024 19:50

renovate bot changed the title ~~fix(deps): update dependency transformers to v4.46.3~~ fix(deps): update dependency transformers to v4.47.0 Dec 5, 2024

renovate bot force-pushed the renovate/transformers-4.x branch from d847960 to 8c48a28 Compare December 17, 2024 20:33

renovate bot changed the title ~~fix(deps): update dependency transformers to v4.47.0~~ fix(deps): update dependency transformers to v4.47.1 Dec 17, 2024

renovate bot force-pushed the renovate/transformers-4.x branch from 8c48a28 to 4243d97 Compare January 10, 2025 15:58

renovate bot changed the title ~~fix(deps): update dependency transformers to v4.47.1~~ fix(deps): update dependency transformers to v4.48.0 Jan 10, 2025

renovate bot force-pushed the renovate/transformers-4.x branch 2 times, most recently from 6339d86 to 43c36db Compare January 20, 2025 17:18

renovate bot changed the title ~~fix(deps): update dependency transformers to v4.48.0~~ fix(deps): update dependency transformers to v4.48.1 Jan 20, 2025

renovate bot force-pushed the renovate/transformers-4.x branch from 43c36db to e1f43d9 Compare January 23, 2025 12:58

renovate bot changed the title ~~fix(deps): update dependency transformers to v4.48.1~~ Update dependency transformers to v4.48.1 Jan 27, 2025

renovate bot force-pushed the renovate/transformers-4.x branch from e1f43d9 to a5e811b Compare January 30, 2025 22:18

renovate bot changed the title ~~Update dependency transformers to v4.51.1~~ Update dependency transformers to v4.51.2 Apr 10, 2025

renovate bot force-pushed the renovate/transformers-4.x branch from c58756a to 2ab5408 Compare April 14, 2025 11:59

renovate bot changed the title ~~Update dependency transformers to v4.51.2~~ Update dependency transformers to v4.51.3 Apr 14, 2025

renovate bot force-pushed the renovate/transformers-4.x branch from 2ab5408 to 974d4b1 Compare May 20, 2025 17:04

renovate bot changed the title ~~Update dependency transformers to v4.51.3~~ Update dependency transformers to v4.52.0 May 20, 2025

renovate bot force-pushed the renovate/transformers-4.x branch from 974d4b1 to cf37dd2 Compare May 20, 2025 23:11

renovate bot changed the title ~~Update dependency transformers to v4.52.0~~ Update dependency transformers to v4.52.1 May 20, 2025

renovate bot force-pushed the renovate/transformers-4.x branch from cf37dd2 to d324412 Compare May 21, 2025 14:04

renovate bot changed the title ~~Update dependency transformers to v4.52.1~~ Update dependency transformers to v4.52.2 May 21, 2025

renovate bot force-pushed the renovate/transformers-4.x branch from d324412 to 026c3ad Compare May 22, 2025 17:14

renovate bot changed the title ~~Update dependency transformers to v4.52.2~~ Update dependency transformers to v4.52.3 May 22, 2025

renovate bot force-pushed the renovate/transformers-4.x branch from 026c3ad to b821770 Compare May 30, 2025 15:06

renovate bot changed the title ~~Update dependency transformers to v4.52.3~~ Update dependency transformers to v4.52.4 May 30, 2025

renovate bot force-pushed the renovate/transformers-4.x branch from b821770 to c080d9c Compare June 26, 2025 20:45

renovate bot changed the title ~~Update dependency transformers to v4.52.4~~ Update dependency transformers to v4.53.0 Jun 26, 2025

renovate bot force-pushed the renovate/transformers-4.x branch from c080d9c to d487eb4 Compare July 4, 2025 12:42

renovate bot changed the title ~~Update dependency transformers to v4.53.0~~ Update dependency transformers to v4.53.1 Jul 4, 2025

renovate bot force-pushed the renovate/transformers-4.x branch from d487eb4 to 4995ef5 Compare July 11, 2025 15:51

renovate bot changed the title ~~Update dependency transformers to v4.53.1~~ Update dependency transformers to v4.53.2 Jul 11, 2025

renovate bot force-pushed the renovate/transformers-4.x branch from 4995ef5 to 417082c Compare July 22, 2025 08:32

renovate bot changed the title ~~Update dependency transformers to v4.53.2~~ Update dependency transformers to v4.53.3 Jul 22, 2025

renovate bot force-pushed the renovate/transformers-4.x branch from 417082c to b73cd5d Compare July 25, 2025 20:14

renovate bot changed the title ~~Update dependency transformers to v4.53.3~~ Update dependency transformers to v4.54.0 Jul 25, 2025

renovate bot force-pushed the renovate/transformers-4.x branch from b73cd5d to 3dc713f Compare July 29, 2025 20:27

renovate bot changed the title ~~Update dependency transformers to v4.54.0~~ Update dependency transformers to v4.54.1 Jul 29, 2025

Update dependency transformers to v4.55.0

cf5d511

renovate bot force-pushed the renovate/transformers-4.x branch from 3dc713f to cf5d511 Compare August 5, 2025 16:36

renovate bot changed the title ~~Update dependency transformers to v4.54.1~~ Update dependency transformers to v4.55.0 Aug 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update dependency transformers to v4.55.0 #196

Update dependency transformers to v4.55.0 #196

Uh oh!

renovate bot commented Oct 8, 2024 •

edited

Loading

Uh oh!

pigri commented Oct 8, 2024

Uh oh!

sonarqubecloud bot commented Apr 14, 2025

Uh oh!

sonarqubecloud bot commented Aug 5, 2025

Uh oh!

Uh oh!

Update dependency transformers to v4.55.0 #196

Are you sure you want to change the base?

Update dependency transformers to v4.55.0 #196

Uh oh!

Conversation

renovate bot commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Notes

v4.55.0: : New openai GPT OSS model!

Welcome GPT OSS, the new open-source model family from OpenAI!

Overview of Capabilities and Architecture

Architecture

Flash Attention 3

Other optimizations

transformers serve

Command A Vision

MM Grounding DINO

Bugfixes and improvements

Significant community contributions

v4.54.1: Patch release 4.54.1

Patch release 4.54.1

v4.54.0: : Kernels, Transformers Serve, Ernie, Voxtral, LFM2, DeepSeek v2, ModernBERT Decoder...

Important news!

New models

Ernie 4.5

Voxtral

Key Features

LFM2

DeepSeek v2

ModernBERT Decoder models

EoMT

Doge

AIM v2

PerceptionLM

Efficient LoFTR

EVOLLA

DeepSeek VL

xLSTM

EXAONE 4.0

Parallelisation

Quantization

FP Quant

Kernels

Transformers Serve

Configuration

Uh oh!

pigri commented Oct 8, 2024

Uh oh!

sonarqubecloud bot commented Apr 14, 2025

Quality Gate passed

Uh oh!

sonarqubecloud bot commented Aug 5, 2025

Quality Gate passed

Uh oh!

Uh oh!

renovate bot commented Oct 8, 2024 •

edited

Loading

`v4.55.0`: : New openai GPT OSS model!

`v4.54.1`: Patch release 4.54.1

`v4.54.0`: : Kernels, Transformers Serve, Ernie, Voxtral, LFM2, DeepSeek v2, ModernBERT Decoder...