-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Add support for the /rerank endpoint in vllm bench serve #26602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for the /rerank endpoint in vllm bench serve #26602
Conversation
The /rerank API can be support both by embedding models and native reranker models. However, with reranker models the query is concatenated with each document with a separator token in between. Therefore the amount of tokens that passes through the model has to be accounted for differently in each case. Because of these details, this PR, adds a specialized random dataset to generates requests which send the expected amount of tokens. So when the use sets `random-input-len`, `num-prompts` and `random-batch-size`, in both cases we will generate requests such that the total amount of tokens is prompts*input-len in batches of size batch-size*input-len. Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
|
Documentation preview: https://vllm--26602.org.readthedocs.build/en/26602/ |
|
cc: @noooop , @DarkLight1337 , @ZJY0516 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds valuable support for benchmarking the /rerank endpoint, including a new specialized random dataset and documentation. The implementation is well-structured, refactoring existing embedding benchmark logic into a more general _run_pooling_request function to accommodate both embeddings and reranking. However, I've identified a critical issue that can cause the benchmark to crash under specific default conditions. Please see the detailed comment for the fix.
|
Related to #21796 |
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: 1994 <1994@users.noreply.github.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: bbartels <benjamin@bartels.dev>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
The /rerank API can be support both by embedding models and native reranker models. However, with reranker models the query is concatenated with each document with a separator token in between. Therefore the amount of tokens that passes through the model has to be accounted for differently in each case.
Because of these details, this PR, adds a specialized random dataset to generates requests which send the expected amount of tokens. So when the use sets
random-input-len,num-promptsandrandom-batch-size, in both cases we will generate requests such that the total amount of tokens is promptsinput-len in batches of size batch-sizeinput-len.Here is an example of how this works. With the server running an reranker or embedding model
run a benchmark using the vllm-rerank backend and the
random-rerankdataset: