Skip to content

Commit 767ae97

Browse files
qthequartermastermancharlifu
authored andcommitted
[docs] Prompt Embedding feature support (vllm-project#25288)
Signed-off-by: Andrew Sansom <andrew@protopia.ai> Signed-off-by: charlifu <charlifu@amd.com>
1 parent e585586 commit 767ae97

File tree

2 files changed

+18
-19
lines changed

2 files changed

+18
-19
lines changed

docs/features/README.md

Lines changed: 18 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -36,22 +36,23 @@ th:not(:first-child) {
3636
}
3737
</style>
3838

39-
| Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
40-
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
41-
| [CP][chunked-prefill] || | | | | | | | | | | | | |
42-
| [APC](automatic_prefix_caching.md) ||| | | | | | | | | | | | |
43-
| [LoRA](lora.md) |||| | | | | | | | | | | |
44-
| [SD](spec_decode.md) ||||| | | | | | | | | | |
45-
| CUDA graph |||||| | | | | | | | | |
46-
| [pooling](../models/pooling_models.md) | 🟠\* | 🟠\* ||||| | | | | | | | |
47-
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> || [](gh-issue:7366) || [](gh-issue:7366) |||| | | | | | | |
48-
| <abbr title="Logprobs">logP</abbr> ||||||||| | | | | | |
49-
| <abbr title="Prompt Logprobs">prmpt logP</abbr> |||||||||| | | | | |
50-
| <abbr title="Async Output Processing">async output</abbr> ||||||||||| | | | |
51-
| multi-step |||||||||||| | | |
52-
| [mm](multimodal_inputs.md) ||| [🟠](gh-pr:4194)<sup>^</sup> |||||||||| | |
53-
| best-of |||| [](gh-issue:6137) ||||||| [](gh-issue:7968) ||| |
54-
| beam-search |||| [](gh-issue:6137) ||||||| [](gh-issue:7968) ||||
39+
| Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | [prompt-embeds](prompt_embeds.md) |
40+
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
41+
| [CP][chunked-prefill] || | | | | | | | | | | | | | |
42+
| [APC](automatic_prefix_caching.md) ||| | | | | | | | | | | | | |
43+
| [LoRA](lora.md) |||| | | | | | | | | | | | |
44+
| [SD](spec_decode.md) ||||| | | | | | | | | | | |
45+
| CUDA graph |||||| | | | | | | | | | |
46+
| [pooling](../models/pooling_models.md) | 🟠\* | 🟠\* ||||| | | | | | | | | |
47+
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> || [](gh-issue:7366) || [](gh-issue:7366) |||| | | | | | | | |
48+
| <abbr title="Logprobs">logP</abbr> ||||||||| | | | | | | |
49+
| <abbr title="Prompt Logprobs">prmpt logP</abbr> |||||||||| | | | | | |
50+
| <abbr title="Async Output Processing">async output</abbr> ||||||||||| | | | | |
51+
| multi-step |||||||||||| | | | |
52+
| [mm](multimodal_inputs.md) ||| [🟠](gh-pr:4194)<sup>^</sup> |||||||||| | | |
53+
| best-of |||| [](gh-issue:6137) ||||||| [](gh-issue:7968) ||| | |
54+
| beam-search |||| [](gh-issue:6137) ||||||| [](gh-issue:7968) |||| |
55+
| [prompt-embeds](prompt_embeds.md) || [](gh-issue:25096) | ? ||||||| ? | ? || ? | ? ||
5556

5657
\* Chunked prefill and prefix caching are only applicable to last-token pooling.
5758
<sup>^</sup> LoRA is only applicable to the language backbone of multimodal models.
@@ -76,3 +77,4 @@ th:not(:first-child) {
7677
| multi-step |||||| [](gh-issue:8477) |||
7778
| best-of |||||||||
7879
| beam-search |||||||||
80+
| [prompt-embeds](prompt_embeds.md) ||||||| ? | [](gh-issue:25097) |

docs/features/prompt_embeds.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,6 @@ This page teaches you how to pass prompt embedding inputs to vLLM.
66

77
The traditional flow of text data for a Large Language Model goes from text to token ids (via a tokenizer) then from token ids to prompt embeddings. For a traditional decoder-only model (such as meta-llama/Llama-3.1-8B-Instruct), this step of converting token ids to prompt embeddings happens via a look-up from a learned embedding matrix, but the model is not limited to processing only the embeddings corresponding to its token vocabulary.
88

9-
!!! note
10-
Prompt embeddings are currently only supported in the v0 engine.
11-
129
## Offline Inference
1310

1411
To input multi-modal data, follow this schema in [vllm.inputs.EmbedsPrompt][]:

0 commit comments

Comments
 (0)