Skip to content

Conversation

@Nan2018
Copy link
Contributor

@Nan2018 Nan2018 commented May 2, 2025

adds support for passing prompt_embeds as b64 encoded bytes to the completions api.

Start the server with

VLLM_USE_V1=0 vllm serve HuggingFaceH4/zephyr-7b-beta

query example:

url = "http://localhost:8000/v1/completions"

prompt_embeds = []
for input_embeds in inputs_embeds: # inputs_embeds is list of 2d tensors of shape (seq_len, embed_dim)
    buff = io.BytesIO()
    torch.save(input_embeds.detach().cpu(), buff)
    prompt_embeds.append(b64encode(buff.getvalue()).decode("utf-8"))

resps = requests.post(
    url,
    json={
        "model": request_model,
        "prompt_embeds": prompt_embeds,
    },
)

# or with openai client

completions = client.completions.create(
    model=request_model,
    prompt="",
    extra_body={"prompt_embeds": prompt_embeds},
)

Note

this does not work with lora or prompt adapters

临景 and others added 30 commits April 2, 2025 14:37
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Co-authored-by: Nan2018 <nan@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
…mpty tensors instead of none

Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <qthequartermasterman@gmail.com>

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
…oid having two vLLM instances in memory at once

Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
…ion endpoint while remaining type safe for non-completions endpoints

Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
@Nan2018 Nan2018 requested a review from DarkLight1337 May 15, 2025 20:31
@Nan2018
Copy link
Contributor Author

Nan2018 commented May 15, 2025

I spent the better part of the afternoon trying to pin point exactly where the failure is occuring. It seems to be only affecting vLLM instances launched via subprocess (in particular via tests.utils.RemoteOpenAIServer in the tests). Launching a vLLM instance normally with the exact same environment variables and arguments works perfectly...

@DarkLight1337 any ideas about this? Do you think it is a blocker for this pr?

@DarkLight1337
Copy link
Member

I'm fine with not supporting LoRA for now, unless LoRA is a very important use case for this.

@DarkLight1337
Copy link
Member

Can you add an example script to the documentation for both offline and online inference?

@mergify mergify bot added the documentation Improvements or additions to documentation label May 16, 2025
@qthequartermasterman
Copy link
Contributor

I'm fine with not supporting LoRA for now, unless LoRA is a very important use case for this.

I don't think this is an important use case at this time. I think it only came up because the existing completion tests checked for LoRA compatibility and @Nan2018 tried to use both of them together.

Can you add an example script to the documentation for both offline and online inference?

I added the docs/source/serving/prompt_embeds.md file. I don't need to add anything to add the page to the sphinx site, correct? It'll automatically find it? I couldn't find anything in the sphinx documentation explicitly mentioning the other files in that directory.

@DarkLight1337
Copy link
Member

Yeah they should be added automatically

Signed-off-by: Andrew Sansom <andrew@protopia.ai>
@qthequartermasterman
Copy link
Contributor

@DarkLight1337 It looks like docs build timed out. All of the fast checks are passing. I do think this PR is ready for review.

Thanks for your help with this!

@DarkLight1337
Copy link
Member

DarkLight1337 commented May 19, 2025

Regarding the subprocess issue, it may be related to #18308 (comment)

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this first though

@vllm-bot vllm-bot merged commit 221cfc2 into vllm-project:main May 19, 2025
12 of 13 checks passed
@CandiedCode
Copy link

@DarkLight1337 will this make it into the v0.9.0 release?

@DarkLight1337
Copy link
Member

Yes



@pytest.fixture(scope="module")
def zephyr_lora_added_tokens_files(zephyr_lora_files):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this lora module used for?

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Nan2018 <nan@protopia.ai>
Co-authored-by: 临景 <linjing.yx@alibaba-inc.com>
Co-authored-by: Bryce1010 <bryceyx@gmail.com>
Co-authored-by: Andrew Sansom <andrew@protopia.ai>
Co-authored-by: Andrew Sansom <qthequartermasterman@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants