Skip to content

Commit 77999d9

Browse files
authored
Transformers backend -> Transformers modeling backend (#116)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
1 parent bc7f724 commit 77999d9

File tree

1 file changed

+19
-19
lines changed

1 file changed

+19
-19
lines changed

_posts/2025-04-11-transformers-backend.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,21 @@
11
---
22
layout: post
3-
title: "Transformers backend integration in vLLM"
3+
title: "Transformers modeling backend integration in vLLM"
44
author: "The Hugging Face Team"
55
image: /assets/figures/transformers-backend/transformers-backend.png
66
---
77

88
The [Hugging Face Transformers library](https://huggingface.co/docs/transformers/main/en/index)
99
offers a flexible, unified interface to a vast ecosystem of model architectures. From research to
10-
fine-tuning on custom dataset, transformers is the go-to toolkit for all.
10+
fine-tuning on custom dataset, Transformers is the go-to toolkit for all.
1111

1212
But when it comes to *deploying* these models at scale, inference speed and efficiency often take
1313
center stage. Enter [vLLM](https://docs.vllm.ai/en/latest/), a library engineered for high-throughput
1414
inference, pulling models from the Hugging Face Hub and optimizing them for production-ready performance.
1515

16-
A recent addition to the vLLM codebase enables leveraging transformers as a backend to run models.
17-
vLLM will therefore optimize throughput/latency on top of existing transformers architectures.
18-
In this post, we’ll explore how vLLM leverages the transformers backend to combine **flexibility**
16+
A recent addition to the vLLM codebase enables leveraging Transformers as a backend for model implementations.
17+
vLLM will therefore optimize throughput/latency on top of existing Transformers architectures.
18+
In this post, we’ll explore how vLLM leverages the Transformers modeling backend to combine **flexibility**
1919
with **efficiency**, enabling you to deploy state-of-the-art models faster and smarter.
2020

2121
## Updates
@@ -24,10 +24,10 @@ This section will hold all the updates that have taken place since the blog post
2424

2525
### Support for Vision Language Models (21st July 2025)
2626

27-
vLLM with the transformers backend now supports **Vision Language Models**. When user adds `model_impl="transformers"`,
27+
vLLM with the Transformers modeling backend now supports **Vision Language Models**. When user adds `model_impl="transformers"`,
2828
the correct class for text-only and multimodality will be deduced and loaded.
2929

30-
Here is how one can serve a multimodal model using the transformers backend.
30+
Here is how one can serve a multimodal model using the Transformers modeling backend.
3131
```bash
3232
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf \
3333
--model_impl transformers \
@@ -119,7 +119,7 @@ for o in outputs:
119119
Let’s start with a simple text generation task using the `meta-llama/Llama-3.2-1B` model to see how
120120
these libraries stack up.
121121

122-
**Infer with transformers**
122+
**Infer with Transformers**
123123

124124
The transformers library shines in its simplicity and versatility. Using its `pipeline` API, inference is a breeze:
125125

@@ -186,29 +186,29 @@ print("Completion result:", completion.choices[0].text)
186186

187187
This compatibility slashes costs and boosts control, letting you scale inference locally with vLLM’s optimizations.
188188

189-
## Why do we need the transformers backend?
189+
## Why do we need the Transformers modeling backend?
190190

191-
The transformers library is optimized for contributions and
191+
The Transformers library is optimized for contributions and
192192
[addition of new models](https://huggingface.co/docs/transformers/en/add_new_model). Adding a new
193193
model to vLLM on the other hand is a little
194194
[more involved](https://docs.vllm.ai/en/latest/contributing/model/index.html).
195195

196196
In the **ideal world**, we would be able to use the new model in vLLM as soon as it is added to
197-
transformers. With the integration of the transformers backend, we step towards that ideal world.
197+
Transformers. With the integration of the Transformers modeling backend, we step towards that ideal world.
198198

199199
Here is the [official documentation](https://docs.vllm.ai/en/latest/models/supported_models.html#custom-models)
200-
on how to make your transformers model compatible with vLLM for the integration to kick in.
200+
on how to make your Transformers model compatible with vLLM for the integration to kick in.
201201
We followed this and made `modeling_gpt2.py` compatible with the integration! You can follow the
202-
changes in this [transformers pull request](https://github.com/huggingface/transformers/pull/36934).
202+
changes in this [Transformers pull request](https://github.com/huggingface/transformers/pull/36934).
203203

204-
For a model already in transformers (and compatible with vLLM), this is what we would need to:
204+
For a model already in Transformers (and compatible with vLLM), this is what we would need to:
205205

206206
```py
207207
llm = LLM(model="new-transformers-model", model_impl="transformers")
208208
```
209209

210210
> [!NOTE]
211-
> It is not a strict necessity to add `model_impl` parameter. vLLM switches to the transformers
211+
> It is not a strict necessity to add `model_impl` parameter. vLLM switches to the Transformers
212212
> implementation on its own if the model is not natively supported in vLLM.
213213
214214
Or for a custom model from the Hugging Face Hub:
@@ -218,12 +218,12 @@ llm = LLM(model="custom-hub-model", model_impl="transformers", trust_remote_code
218218
```
219219

220220
This backend acts as a **bridge**, marrying transformers’ plug-and-play flexibility with vLLM’s
221-
inference prowess. You get the best of both worlds: rapid prototyping with transformers
221+
inference prowess. You get the best of both worlds: rapid prototyping with Transformers
222222
and optimized deployment with vLLM.
223223

224224
## Case Study: Helium
225225

226-
[Kyutai Team’s Helium](https://huggingface.co/docs/transformers/en/model_doc/helium) is not yet supported by vLLM. You might want to run optimized inference on the model with vLLM, and this is where the transformers backend shines.
226+
[Kyutai Team’s Helium](https://huggingface.co/docs/transformers/en/model_doc/helium) is not yet supported by vLLM. You might want to run optimized inference on the model with vLLM, and this is where the Transformers modeling backend shines.
227227

228228
Let’s see this in action:
229229

@@ -248,8 +248,8 @@ completion = client.completions.create(model="kyutai/helium-1-preview-2b", promp
248248
print("Completion result:", completion)
249249
```
250250

251-
Here, vLLM efficiently processes inputs, leveraging the transformers backend to load
252-
`kyutai/helium-1-preview-2b` seamlessly. Compared to running this natively in transformers,
251+
Here, vLLM efficiently processes inputs, leveraging the Transformers modeling backend to load
252+
`kyutai/helium-1-preview-2b` seamlessly. Compared to running this natively in Transformers,
253253
vLLM delivers lower latency and better resource utilization.
254254

255255
By pairing Transformers’ model ecosystem with vLLM’s inference optimizations, you unlock a workflow

0 commit comments

Comments
 (0)