Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
cef6894
(vllm) add input embedding
Jan 2, 2025
c51d8fb
improve embedding input
Bryce1010 Jan 6, 2025
9564b40
(vllm) fix import error
Bryce1010 Mar 6, 2025
c60298a
(vllm) fix pre commit error
Bryce1010 Mar 6, 2025
0c24a82
apply ruff and isort fixes
qthequartermasterman Mar 25, 2025
403a165
apply ruff and isort fixes
qthequartermasterman Mar 25, 2025
b1ac072
styling
qthequartermasterman Mar 25, 2025
0390c33
fix missing imports from rebase
qthequartermasterman Mar 25, 2025
0ca4dae
typing fixes
qthequartermasterman Mar 25, 2025
35320fe
type fix
qthequartermasterman Mar 25, 2025
0a77630
type fix
qthequartermasterman Mar 25, 2025
11b6c02
remove unnecessary changes
qthequartermasterman Mar 25, 2025
cb92a3d
remove unnecessary changes
qthequartermasterman Mar 25, 2025
375bd5b
re-add deleted whitespace
qthequartermasterman Mar 25, 2025
c9d8024
Include unit tests from #6869.
qthequartermasterman Mar 25, 2025
a64e627
remove unrelated qwen2 changes
qthequartermasterman Mar 26, 2025
6ab349e
guard clause around fully consumed prompt embeds to avoid returning e…
qthequartermasterman Mar 27, 2025
26c8784
use v0 for prompt embeds model runner tests
qthequartermasterman Mar 27, 2025
b71a13c
fix batching of input embeddings
qthequartermasterman Apr 2, 2025
4aa9ade
style formatting
qthequartermasterman Apr 2, 2025
e2c4c26
remove incorrect overload
qthequartermasterman Apr 3, 2025
26d108a
remove incorrect overload
qthequartermasterman Apr 3, 2025
af20435
Update representations
qthequartermasterman Apr 4, 2025
25aaf3f
remove unrelated changes to docs
qthequartermasterman Apr 4, 2025
bc05860
remove unrelated typing change
qthequartermasterman Apr 4, 2025
b55800d
fix missing syntax
qthequartermasterman Apr 4, 2025
be42a17
do not schedule prompt embeds and non-prompt embeds in the same batch
qthequartermasterman Apr 4, 2025
c8fcfe4
fix style linelength
qthequartermasterman Apr 4, 2025
b21688f
Merge branch 'main' into feature/vllm/add-input-embedding
qthequartermasterman Apr 7, 2025
1e359ae
propogate embeddings for sampled output tokens for decoding
qthequartermasterman Apr 11, 2025
59fbe70
fix type check
qthequartermasterman Apr 11, 2025
c152a3a
do not schedule decode sequence groups with batches containing both p…
qthequartermasterman Apr 11, 2025
42ad800
Merge branch 'main' into feature/vllm/add-input-embedding
qthequartermasterman Apr 11, 2025
e7ab2a2
fix type check
qthequartermasterman Apr 11, 2025
911adbe
add default value to optional parameter
qthequartermasterman Apr 11, 2025
82d923d
remove unused comments
qthequartermasterman Apr 14, 2025
c951479
properly pass in placeholder token ids when testing prompt embeds
qthequartermasterman Apr 15, 2025
01e1a6e
do not test mixed token_ids/prompt_embeds batches in the model_runner
qthequartermasterman Apr 15, 2025
193ad5c
refactor cuda_prepare_decode test
qthequartermasterman Apr 15, 2025
74bd9f4
use correct expected input embeds length for prepare_decode_cuda_grap…
qthequartermasterman Apr 15, 2025
d949f1b
add scheduler test to ensure prompt embeds and prompt tokens are not …
qthequartermasterman Apr 15, 2025
62bbc88
support inputs_embeds in compiled mode
qthequartermasterman Apr 16, 2025
1d1ae4b
fix typing in test
qthequartermasterman Apr 16, 2025
1914676
use corrector operator precedence for handling empty strings
qthequartermasterman Apr 16, 2025
70198f6
only test decoder models with input embeds in v0 backend
qthequartermasterman Apr 16, 2025
934ceae
Merge branch 'vllm-project:main' into feature/vllm/add-input-embedding
qthequartermasterman Apr 16, 2025
5595b45
adjust type hints for modelinputforgpubuilder.build
qthequartermasterman Apr 18, 2025
3343d3e
simplify conditional logic
qthequartermasterman Apr 18, 2025
5010ea0
simplify compilation conditional logic
qthequartermasterman Apr 18, 2025
2075e53
refactor decoder only language model tests to reduce number of times …
qthequartermasterman Apr 18, 2025
9a4fb3c
break up multiple assignments for readability
qthequartermasterman Apr 18, 2025
8ad4091
update type hints in scheduler
qthequartermasterman Apr 18, 2025
9055daf
clear existing lists instead of instantiating new ones
qthequartermasterman Apr 18, 2025
9a57aca
preprocess tensors to handle batched/misshaped prompt embeds to avoid…
qthequartermasterman Apr 18, 2025
bbfb0f0
use seperate Embedsprompt class for preprocessing inputs embeddings
qthequartermasterman Apr 18, 2025
933e567
fix typing
qthequartermasterman Apr 18, 2025
4e0d12f
fix type errors
qthequartermasterman Apr 19, 2025
164aeb5
Merge branch 'vllm-project:main' into feature/vllm/add-input-embedding
qthequartermasterman Apr 19, 2025
9e6909e
fix mistaken type change
qthequartermasterman Apr 19, 2025
90b950a
add missing type hint
qthequartermasterman Apr 19, 2025
01d83f4
add spaces for style
qthequartermasterman Apr 20, 2025
6985452
seperate EmbedsInputs from TokenInputs and embeds_inputs from token_i…
qthequartermasterman Apr 20, 2025
e916551
fix docstrings for EmbedsInputs
qthequartermasterman Apr 20, 2025
69f8725
fix typing for token_type_ids
qthequartermasterman Apr 20, 2025
9c2c89f
fix typing for embeds_tokens in InputRegistry and InputsAdapter
qthequartermasterman Apr 20, 2025
499dc6a
remove prompts and prompt_token_ids from EmbedsPrompts
qthequartermasterman Apr 21, 2025
20668ca
Merge branch 'main' into feature/vllm/add-input-embedding
qthequartermasterman Apr 28, 2025
6712ba6
fight mypy to get correct typing for not embeds prompts
qthequartermasterman Apr 28, 2025
740b290
remove incorrect call to embeds_inputs
qthequartermasterman Apr 28, 2025
8f9bd51
wrestle with mypy and typeddict type narrowing
qthequartermasterman Apr 29, 2025
b8d36c6
wrestle with mypy and typeddict type narrowing
qthequartermasterman Apr 29, 2025
b764c19
support indexing graph runners that with inputs_embeds
qthequartermasterman Apr 29, 2025
0e75db4
feat: completions using embeddings
Nan2018 Oct 28, 2024
cb6ff22
Merge branch 'main' into feature/vllm/add-input-embedding
qthequartermasterman May 1, 2025
85642d0
support encoder decoder models with inputs_embeds
qthequartermasterman May 1, 2025
b226fd6
simplify redundant ternary statement
qthequartermasterman May 1, 2025
b738d3f
explicitly remove support for inputs embeds with speculative decoding…
qthequartermasterman May 1, 2025
2340119
fix occasional device mismatch errors when appending output tokens to…
qthequartermasterman May 1, 2025
6a3173a
Merge remote-tracking branch 'andrew/feature/vllm/add-input-embedding…
Nan2018 May 1, 2025
06215c0
Merge remote-tracking branch 'nan/main' into feature/vllm/input-embed…
Nan2018 May 2, 2025
4776355
Merge remote-tracking branch 'vllm/main' into feature/vllm/input-embe…
Nan2018 May 5, 2025
ab5ea30
fix typing
Nan2018 May 9, 2025
2c2dc0a
torch load weights only; raise error if prompt embeds and lora or pro…
Nan2018 May 9, 2025
6147e3c
refactor to resolve type errors in serving_completion
qthequartermasterman May 9, 2025
61d2641
refactor to resolve type errors in serving_engine.py
qthequartermasterman May 9, 2025
4af2b64
serving completions typing
qthequartermasterman May 9, 2025
27ed406
prefer prompt embeds for completion requests when available
qthequartermasterman May 12, 2025
72e1244
explicitly do not support echo and prompt embeds
qthequartermasterman May 12, 2025
db00178
refactor tests for completions endpoints with prompt embeds to requir…
qthequartermasterman May 12, 2025
12faae6
Merge branch 'main' into feature/vllm/input-embedding-completion-api
qthequartermasterman May 12, 2025
318ee3f
style
qthequartermasterman May 12, 2025
78754b0
add None check
qthequartermasterman May 12, 2025
719168d
appease mypy
qthequartermasterman May 12, 2025
1ea957e
pass in empty string prompts to preprocess to allow downstream handling
qthequartermasterman May 13, 2025
03db71a
re-add ability to allow model to be None in completion requests (acci…
qthequartermasterman May 13, 2025
c7122c4
update type hint
qthequartermasterman May 13, 2025
c0e0647
pass in env_dict instead of failing to mock properly
qthequartermasterman May 13, 2025
56f10df
enable lora with prompt embeds
qthequartermasterman May 14, 2025
92b336a
disable chunked prefill in openai + prompt embeds checks
qthequartermasterman May 14, 2025
72674e0
move prompt embeds completions endpoint tests to their own file to av…
qthequartermasterman May 14, 2025
7134fe1
allow mixed embeds/text prompts to completions endpoint
qthequartermasterman May 14, 2025
38c366d
refactor serving engine to allow mixed embeds/text prompts to complet…
qthequartermasterman May 14, 2025
a56b7f4
remove vestigial comments
qthequartermasterman May 14, 2025
db62b8c
Merge branch 'main' into feature/vllm/input-embedding-completion-api
qthequartermasterman May 15, 2025
8c1dde9
add documentation for serving prompt embeddings
qthequartermasterman May 16, 2025
204952c
remove explicit dependence on v0 for prompt embeddings test since the…
qthequartermasterman May 16, 2025
9396f8a
Merge branch 'main' into feature/vllm/input-embedding-completion-api
qthequartermasterman May 16, 2025
1351bdd
add prompt embeds docs to toctree
qthequartermasterman May 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ training/rlhf.md
serving/offline_inference
serving/openai_compatible_server
serving/multimodal_inputs
serving/prompt_embeds
serving/distributed_serving
serving/metrics
serving/engine_args
Expand Down
142 changes: 142 additions & 0 deletions docs/source/serving/prompt_embeds.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Prompt Embedding Inputs

This page teaches you how to pass prompt embedding inputs to vLLM.

## What are prompt embeddings?

The traditional flow of text data for a Large Language Model goes from text to token ids (via a tokenizer) then from token ids to prompt embeddings. For a traditional decoder-only model (such as meta-llama/Llama-3.1-8B-Instruct), this step of converting token ids to prompt embeddings happens via a look-up from a learned embedding matrix, but the model is not limited to processing only the embeddings corresponding to its token vocabulary.

:::{note}
Prompt embeddings are currently only supported in the v0 engine.
:::

## Offline Inference

To input multi-modal data, follow this schema in {class}`vllm.inputs.EmbedsPrompt`:

- `prompt_embeds`: A torch tensor representing a sequence of prompt/token embeddings. This has the shape (sequence_length, hidden_size), where sequence length is the number of tokens embeddings and hidden_size is the hidden size (embedding size) of the model.

### Hugging Face Transformers Inputs

You can pass prompt embeddings from Hugging Face Transformers models to the `'prompt_embeds'` field of the prompt embedding dictionary, as shown in the following examples:

```python
from vllm import LLM
import transformers

model_name = "meta-llama/Llama-3.2-1B-Instruct"

# Transformers
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
transformers_model = transformers.AutoModelForCausalLM.from_pretrained(model_name)

llm = LLM(model=model_name, enable_prompt_embeds=True)

# Refer to the HuggingFace repo for the correct format to use
chat = [{"role": "user", "content": "Please tell me about the capital of France."}]
token_ids = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors='pt')

prompt_embeds = embedding_layer(token_ids).squeeze(0)

# Single prompt inference
outputs = llm.generate({
"prompt_embeds": prompt_embeds,
})

for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)

# Batch inference

chats = [
[{"role": "user", "content": "Please tell me about the capital of France."}],
[{"role": "user", "content": "When is the day longest during the year?"}],
[{"role": "user", "content": "Where is bigger, the moon or the sun?"}]
]

token_ids_list = [
tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors='pt') for chat in chats
]
prompt_embeds_list = [embedding_layer(token_ids).squeeze(0) for token_ids in token_ids_list]

outputs = llm.generate(
[
{
"prompt_embeds": prompt_embeds,
} for prompt_embeds in prompt_embeds_list
]
)

for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```

## Online Serving

Our OpenAI-compatible server accepts prompt embeddings inputs via the [Completions API](https://platform.openai.com/docs/api-reference/completions). Prompt embeddings inputs are added via a new `'prompt_embeds'` key in the JSON package.

When a mixture of `'prompt_embeds'` and `'prompt'` inputs are provided in a single request, the prompt embeds are always returned first.

Prompt embeddings are passed in as base64 encoded torch tensors.

### Transformers Inputs via OpenAI Client

First, launch the OpenAI-compatible server:

```bash
vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \
--max-model-len 4096 --enable-prompt-embeds
```

Then, you can use the OpenAI client as follows:

```python
from openai import OpenAI
import transformers
import torch

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)

model_name = "meta-llama/Llama-3.2-1B-Instruct"

# Transformers
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
transformers_model = transformers.AutoModelForCausalLM.from_pretrained(model_name)


# Refer to the HuggingFace repo for the correct format to use
chat = [{"role": "user", "content": "Please tell me about the capital of France."}]
token_ids = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors='pt')

prompt_embeds = embedding_layer(token_ids).squeeze(0)

# Prompt embeddings
buffer = io.BytesIO()
torch.save(prompt_embeds, buffer)
buffer.seek(0)
binary_data = buffer.read()
encoded_embeds = base64.b64encode(binary_data).decode('utf-8')


completion = client_with_prompt_embeds.completions.create(
model=model_name,
# NOTE: The OpenAI client does not allow `None` as an input to
# `prompt`. Use an empty string if you have no text prompts.
prompt="",
max_tokens=5,
temperature=0.0,
# NOTE: The OpenAI client allows passing in extra JSON body via the
# `extra_body` argument.
extra_body={"prompt_embeds": encoded_embeds}
)

print(completion.choices[0].text)
```
Loading