-
Notifications
You must be signed in to change notification settings - Fork 9
Update dependency transformers to v4.55.0 #196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
renovate
wants to merge
1
commit into
main
Choose a base branch
from
renovate/transformers-4.x
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Blocked by this issue explosion/spaCy#13649 |
310a637
to
88e0cf6
Compare
a56915d
to
4fb4b29
Compare
4fb4b29
to
63d9975
Compare
63d9975
to
4d29005
Compare
935a07d
to
e4bca8a
Compare
e4bca8a
to
0ab21e1
Compare
77a84e6
to
d847960
Compare
d847960
to
8c48a28
Compare
8c48a28
to
4243d97
Compare
6339d86
to
43c36db
Compare
43c36db
to
e1f43d9
Compare
e1f43d9
to
a5e811b
Compare
c58756a
to
2ab5408
Compare
|
2ab5408
to
974d4b1
Compare
974d4b1
to
cf37dd2
Compare
cf37dd2
to
d324412
Compare
d324412
to
026c3ad
Compare
026c3ad
to
b821770
Compare
b821770
to
c080d9c
Compare
c080d9c
to
d487eb4
Compare
d487eb4
to
4995ef5
Compare
4995ef5
to
417082c
Compare
417082c
to
b73cd5d
Compare
b73cd5d
to
3dc713f
Compare
3dc713f
to
cf5d511
Compare
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
4.45.1
->4.55.0
Release Notes
huggingface/transformers (transformers)
v4.55.0
: : New openai GPT OSS model!Compare Source
Welcome GPT OSS, the new open-source model family from OpenAI!
For more detailed information about this model, we recommend reading the following blogpost: https://huggingface.co/blog/welcome-openai-gpt-oss
GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. It comprises two models: a big one with 117B parameters (gpt-oss-120b), and a smaller one with 21B parameters (gpt-oss-20b). Both are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4), enabling fast inference (thanks to fewer active parameters, see details below) while keeping resource usage low. The large model fits on a single H100 GPU, while the small one runs within 16GB of memory and is perfect for consumer hardware and on-device applications.
Overview of Capabilities and Architecture
Architecture
The following snippet shows simple inference with the 20B model. It runs on 16 GB GPUs when using mxfp4, or ~48 GB in bfloat16.
Flash Attention 3
The models use attention sinks, a technique the vLLM team made compatible with Flash Attention 3. We have packaged and integrated their optimized kernel in kernels-community/vllm-flash-attn3. At the time of writing, this super-fast kernel has been tested on Hopper cards with PyTorch 2.7 and 2.8. We expect increased coverage in the coming days. If you run the models on Hopper cards (for example, H100 or H200), you need to pip install –upgrade kernels and add the following line to your snippet:
Even though the 120B model fits on a single H100 GPU (using mxfp4), you can also run it easily on multiple GPUs using accelerate or torchrun. Transformers provides a default parallelization plan, and you can leverage optimized attention kernels as well. The following snippet can be run with torchrun --nproc_per_node=4 generate.py on a system with 4 GPUs:
Other optimizations
If you have a Hopper GPU or better, we recommend you use mxfp4 for the reasons explained above. If you can additionally use Flash Attention 3, then by all means do enable it!
transformers serve
You can use transformers serve to experiment locally with the models, without any other dependencies. You can launch the server with just:
transformers serve
To which you can send requests using the Responses API.
You can also send requests using the standard Completions API:
Command A Vision
Command A Vision is a state-of-the-art multimodal model designed to seamlessly integrate visual and textual information for a wide range of applications. By combining advanced computer vision techniques with natural language processing capabilities, Command A Vision enables users to analyze, understand, and generate insights from both visual and textual data.
The model excels at tasks including image captioning, visual question answering, document understanding, and chart understanding. This makes it a versatile tool for AI practitioners. Its ability to process complex visual and textual inputs makes it useful in settings where text-only representations are imprecise or unavailable, like real-world image understanding and graphics-heavy document processing.
Command A Vision is built upon a robust architecture that leverages the latest advancements in VLMs. It's highly performant and efficient, even when dealing with large-scale datasets. The model's flexibility makes it suitable for a wide range of use cases, from content moderation and image search to medical imaging analysis and robotics.
MM Grounding DINO
MM Grounding DINO model was proposed in An Open and Comprehensive Pipeline for Unified Object Grounding and Detection by Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang>.
MM Grounding DINO improves upon the Grounding DINO by improving the contrastive class head and removing the parameter sharing in the decoder, improving zero-shot detection performance on both COCO (50.6(+2.2) AP) and LVIS (31.9(+11.8) val AP and 41.4(+12.6) minival AP).
You can find all the original MM Grounding DINO checkpoints under the MM Grounding DINO collection. This model also supports LLMDet inference. You can find LLMDet checkpoints under the LLMDet collection.
Bugfixes and improvements
FastSpeech2Conformer
by @bvantuan in #39689classmethod
by @zucchini-nlp in #38812CI
] Add Eric to comment slow ci by @vasqu in #39601QAPipelineTests::test_large_model_course
after #39193 by @ydshieh in #39666Glm4MoeModelTest::test_torch_compile_for_training
by @ydshieh in #39670Qwen2AudioForConditionalGeneration.forward()
andtest_flash_attn_kernels_inference_equivalence
by @ebezzam in #39503models/__init__.py
for typo checking by @hebangwen in #39745GemmaIntegrationTest::test_model_2b_bf16_dola
again by @ydshieh in #39731--gpus all
in workflow files by @ydshieh in #39752libcst
toextras["testing"]
insetup.py
by @ydshieh in #39761main_classes/peft.md
by @luckyvickyricky in #39515tvp.md
to Korean by @Kim-Ju-won in #39578tokenizer.md
to Korean by @seopp in #39532pipeline_gradio.md
to Korean by @AhnJoonSung in #39520perf_train_gpu_one.md
to Korean by @D15M4S in #39552how_to_hack_models.md
to Korean by @skwh54 in #39536run_name
when none by @qgallouedec in #39695model_results.json
by @ydshieh in #39783attn_implementation
] remove recursive, allows custom kernels with wrappers by @ArthurZucker in #39823plot_keypoint_matching
, makevisualize_keypoint_matching
as a standard by @sbucaille in #39830TrackioCallback
to work when pynvml is not installed by @qgallouedec in #39851is_wandb_available
function to verify WandB installation by @qgallouedec in #39875sub_configs
by @qubvel in #39855Tokenizer
withPreTrainedTokenizerFast
inContinuousBatchProcessor
by @qgallouedec in #39858torch.backends.cudnn.allow_tf32 = False
for CI by @ydshieh in #39885AutoModelForCausalLM
andAutoModelForImageTextToText
by @qubvel in #39881ModernBertForMultipleChoice
by @netique in #39232Exaone4
] Fixes the attn implementation! by @ArthurZucker in #39906Significant community contributions
The following contributors have made significant changes to the library over the last release:
v4.54.1
: Patch release 4.54.1Compare Source
Patch release 4.54.1
We had quite a lot of bugs that got through! Release was a bit rushed, sorry everyone! 🤗
Mostly cache fixes, as we now have layered cache, and fixed to distributed.
v4.54.0
: : Kernels, Transformers Serve, Ernie, Voxtral, LFM2, DeepSeek v2, ModernBERT Decoder...Compare Source
Important news!
In order to become the source of truth, we recognize that we need to address two common and long-heard critiques about
transformers
:transformers
is bloatedtransformers
is slowOur team has focused on improving both aspects, and we are now ready to announce this.
The modeling files for the standard
Llama
models are down to 500 LOC and should be much more readable, keeping just the core of the modeling and hiding the "powerful transformers features."The MoEs are getting some kernel magic, enabling the use of the efficient
megablocks
kernels, setting a good precedent to allow the community to leverage any of the most powerful kernels developed for quantization as well!It should also be much more convenient to use with any attention implementation you want. This opens the door to some optimizations such as leveraging
flash-attention
on Metal (MPS Torch backend).This is but the tip of the iceberg: with the work on kernels that we're heavily pushing forward, expect speed-ups on several backends in the coming months!!
This release also includes the first steps to enabling efficient distributed training natively in transformers. Loading a 100B model takes ~3 seconds on our cluster — we hope this will be the norm for everyone! We are working on distributed checkpointing as well, and want to make sure our API can be easily used for any type of parallelism.
We want the community to benefit from all of the advances, and as always, include all hardware and platforms! We believe the kernels library will give the tools to optimize everything, making a big difference for the industry!
New models
Ernie 4.5
The Ernie 4.5 model was released in the Ernie 4.5 Model Family release by baidu.
This family of models contains multiple different architectures and model sizes. This model in specific targets the base text
model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard Llama at its core.
Other models from the family can be found at Ernie 4.5 MoE.
Ernie 4.5
] Add ernie text models by @vasqu in #39228Voxtral
Voxtral is an upgrade of Ministral 3B and Mistral Small 3B, extending its language capabilities with audio input support. It is designed to handle tasks such as speech transcription, translation, and audio understanding.
You can read more in Mistral's realease blog post.
The model is available in two checkpoints:
Key Features
Voxtral builds on Ministral-3B by adding audio processing capabilities:
LFM2
LFM2 represents a new generation of Liquid Foundation Models developed by Liquid AI, specifically designed for edge AI and on-device deployment.
The models are available in three sizes (350M, 700M, and 1.2B parameters) and are engineered to run efficiently on CPU, GPU, and NPU hardware, making them particularly well-suited for applications requiring low latency, offline operation, and privacy.
DeepSeek v2
The DeepSeek-V2 model was proposed in DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model by DeepSeek-AI Team.
The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.
ModernBERT Decoder models
ModernBERT Decoder is the same architecture as ModernBERT but trained from scratch with a causal language modeling (CLM) objective. This allows for using the same architecture for comparing encoders and decoders. This is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.
Like the encoder version, ModernBERT Decoder incorporates modern architectural improvements such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention patterns. However, it uses causal (unidirectional) attention to enable autoregressive generation.
EoMT
The Encoder-only Mask Transformer (EoMT) model was introduced in the CVPR 2025 Highlight Paper Your ViT is Secretly an Image Segmentation Model by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus.
EoMT reveals Vision Transformers can perform image segmentation efficiently without task-specific components.
Doge
Doge is a series of small language models based on the Doge architecture, aiming to combine the advantages of state-space and self-attention algorithms, calculate dynamic masks from cached value states using the zero-order hold method, and solve the problem of existing mainstream language models getting lost in context. It uses the
wsd_scheduler
scheduler to pre-train on thesmollm-corpus
, and can continue training on new datasets or add sparse activation feedforward networks from stable stage checkpoints.<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F426/transformers/model\_doc/doge\_architecture.png" alt="drawing" width="600"
AIM v2
The AIMv2 model was proposed in Multimodal Autoregressive Pre-training of Large Vision Encoders by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby.
The abstract from the paper is the following:
We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.
PerceptionLM
The PerceptionLM model was proposed in PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding by Jang Hyun Cho et al. It's a fully open, reproducible model for transparent research in image and video understanding. PLM consists of
a vision encoder with a small scale (<8B parameters) LLM decoder.
Efficient LoFTR
The EfficientLoFTR model was proposed in Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed by Yifan Wang, Xingyi He, Sida Peng, Dongli Tan and Xiaowei Zhou.
This model consists of matching two images together by finding pixel correspondences. It can be used to estimate the pose between them.
This model is useful for tasks such as image matching, homography estimation, etc.
EVOLLA
Evolla is an advanced 80-billion-parameter protein-language generative model designed to decode the molecular language of proteins. It integrates information from protein sequences, structures, and user queries to generate precise and contextually nuanced insights into protein function. Trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, Evolla significantly advances research in proteomics and functional genomics, providing expert-level insights and shedding light on the molecular logic encoded in proteins.
DeepSeek VL
Deepseek-VL was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages LLaMA as its text encoder, while SigLip is used for encoding images.
xLSTM
The xLSTM model was proposed in xLSTM: Extended Long Short-Term Memory by Maximilian Beck*, Korbinian Pöppel*, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter and Sepp Hochreiter.
xLSTM updates the original LSTM architecture to be competitive with Transformer models by introducing exponential gating, matrix memory expansion, and parallelizable training and ingestion.
The 7B model variant was trained by the xLSTM team Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick Blies, Sebastian Böck and Sepp Hochreiter at NXAI.
EXAONE 4.0
EXAONE 4.0 model is the language model, which integrates a Non-reasoning mode and Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean.
The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications.
Parallelisation
We've added Expert Parallel support for Llama4, next release will include it for all model! You can just set a
distributed_config
withenable_expert_parallel=True
. This is enabling efficient training of sparse Mixture-of-Experts (MoE) models across multiple devices. This allows each expert in the MoE layer to run in parallel (instead of previous TP which requires more communication), significantly improving scalability and memory efficiency.Quantization
FP Quant
FP-Quant is a quantization method optimized for Blackwell-generation Nvidia GPUs, supporting efficient post-training quantization (PTQ) and quantization-aware training (QAT) of LLMs using MXFP4 and NVFP4 formats.
Currently, only PTQ with MXFP4 is available. You can quantize models on-the-fly using transformers:
FP-Quant requires a Blackwell GPU and runs via the QuTLASS library. No Blackwell GPU? Use
FPQuantConfig(pseudoquant=True)
to emulate quantization (no QuTLASS needed).The following results show the inference speedup of QuTLASS MXFP4 over PyTorch BF16 in Transformers. MXFP4 gives consistent speedups across all batch sizes, reaching up to 4× faster at larger scales.
Kernels
The kernels project aims to become the single trusted source for high-performance kernels in the Transformers ecosystem. We're working toward centralizing all kernels on the Hub, so updates, bug fixes, and improvements can happen in one place—no more scattered repos and no compilation headaches!
You can already try it out today by setting
use_kernels=True
infrom_pretrained
. Any contributor can build their kernel, register it and use it right away—no extra setup, more on this hereEven better: want to use Flash Attention 3? No need to deal with tricky compilation and missing symbols issues! Just drop in:
This automatically fetches the right build for your setup (e.g. CUDA and PyTorch versions).
We’re also teaming up with amazing kernel devs from Unsloth, Liger, vLLM, and more to bring their work directly to the Hub—making it easier than ever to access amazing performance with a single line of code.
Transformers Serve
short_clip.mp4
Over the past few months, we have been putting more and more functionality in the
transformers chat
utility, which offers a CLI-based app to chat with chat models. We've chosen to push this further by splitting the backend oftransformers chat
in a new, separate utility calledtransformers serve
.This is ideal for experimentation purposes, or to run models locally for personal and private use. It does not aim to compete with dedicated inference engines such as vLLM or SGLang.
Models of diverse modalities supported by transformers may be served with the transformers serve CLI. It spawns a local server that offers compatibility with the OpenAI SDK, which is the de-facto standard for LLM conversations and other related tasks. This way, you can use the server from many third party applications, or test it using the transformers chat CLI (docs).
The server supports the following REST APIs:
/v1/chat/completions
/v1/responses
/v1/audio/transcriptions
/v1/models
Relevant commits:
transformers chat
andtransformers serve
by @LysandreJik in #38443Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.