Skip to content

Commit e9756cd

Browse files
authored
[docs] Serving LLMs (#36522)
* initial * fix * model-impl
1 parent af9b2ea commit e9756cd

File tree

2 files changed

+66
-0
lines changed

2 files changed

+66
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,8 @@
7474
title: Optimizing inference
7575
- local: kv_cache
7676
title: KV cache strategies
77+
- local: serving
78+
title: Serving
7779
- local: cache_explanation
7880
title: Caching
7981
- local: llm_tutorial_optimization

docs/source/en/serving.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# Serving
18+
19+
Transformer models can be served for inference with specialized libraries such as Text Generation Inference (TGI) and vLLM. These libraries are specifically designed to optimize performance with LLMs and include many unique optimization features that may not be included in Transformers.
20+
21+
## TGI
22+
23+
[TGI](https://huggingface.co/docs/text-generation-inference/index) can serve models that aren't [natively implemented](https://huggingface.co/docs/text-generation-inference/supported_models) by falling back on the Transformers implementation of the model. Some of TGIs high-performance features aren't available in the Transformers implementation, but other features like continuous batching and streaming are still supported.
24+
25+
> [!TIP]
26+
> Refer to the [Non-core model serving](https://huggingface.co/docs/text-generation-inference/basic_tutorials/non_core_models) guide for more details.
27+
28+
Serve a Transformers implementation the same way you'd serve a TGI model.
29+
30+
```docker
31+
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2
32+
```
33+
34+
Add `--trust-remote_code` to the command to serve a custom Transformers model.
35+
36+
```docker
37+
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id <CUSTOM_MODEL_ID> --trust-remote-code
38+
```
39+
40+
## vLLM
41+
42+
[vLLM](https://docs.vllm.ai/en/latest/index.html) can also serve a Transformers implementation of a model if it isn't [natively implemented](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models) in vLLM.
43+
44+
Many features like quantization, LoRA adapters, and distributed inference and serving are supported for the Transformers implementation.
45+
46+
> [!TIP]
47+
> Refer to the [Transformers fallback](https://docs.vllm.ai/en/latest/models/supported_models.html#transformers-fallback) section for more details.
48+
49+
By default, vLLM serves the native implementation and if it doesn't exist, it falls back on the Transformers implementation. But you can also set `--model-impl transformers` to explicitly use the Transformers model implementation.
50+
51+
```shell
52+
vllm serve Qwen/Qwen2.5-1.5B-Instruct \
53+
--task generate \
54+
--model-impl transformers \
55+
```
56+
57+
Add the `trust-remote-code` parameter to enable loading a remote code model.
58+
59+
```shell
60+
vllm serve Qwen/Qwen2.5-1.5B-Instruct \
61+
--task generate \
62+
--model-impl transformers \
63+
--trust-remote-code \
64+
```

0 commit comments

Comments
 (0)