-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[benchmark] Make request IDs unique across clients by default #27723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a crash issue caused by duplicate request IDs when running benchmark clients in parallel. The fix, which prepends the process ID to the request ID, ensures uniqueness for clients on a single machine. I've provided one suggestion to make this even more robust by using a random prefix, which would guarantee uniqueness in distributed scenarios across multiple machines.
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
|
Thank you for the PR Indeed, it looks like when we added See this comment in #26929 for where we saw this issue recently in P/D setups. The reporter confirmed on Slack that they were using
I think the simplest solution here is to just make |
vllm/v1/core/sched/scheduler.py
Outdated
| def add_request(self, request: Request) -> None: | ||
| request_id = request.request_id | ||
| if request_id in self.requests: | ||
| raise ValueError(f"Request id {request_id} already exists.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not safe - it will cause the engine to exit
Ideally we would just return an error for this single request, but that is not a trivial change
Let's remove the scheduler part from this PR and just fix the request IDs used by vllm serve bench
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either way, the engine will crash due to duplicate IDs. But point taken, removed in favor of backing out/failing the request the right way.
|
xref #27189 |
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
|
Thanks for the review @markmc! Updated the PR with suggestion. |
| type=str, | ||
| required=False, | ||
| default="benchmark-serving", | ||
| default=f"bench-{uuid.uuid4().hex[:8]}-", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we keep this and simply add the uuid to the prefix?
So request_id will be
<prefix>-<uuid>-<cnt>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean add the uuid in main() so that we will also add it to any user-supplied prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kouroshHakha I'm not sure this works, because then users won't have full control over their prefix. I.e. a user must edit the code to get around us adding a uuid to their chosen prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a hard limit on this pr obv. but I was thinking we would need to guarantee uniquness of the request-ids even if the end user wants to override the prefix. that way uuid is part of the auto appended suffix.
Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom request ID. The motivation for this is that it can be very helpful when you need to correlate vLLM logs with logs of a related service. Since the request ID is used ubiquitously across vLLM as a unique key, it obviously is problematic if we ever have multiple in-flight requests using the same client-provided request ID. We saw this happening recently when `vllm serve bench` started including a request ID and the request IDs from multiple concurrent instances caused collisions. See vllm-project#27723 We try to guard against request ID collisions currently in the frontend in OutputProcessor: ``` def add_request(...): if request_id in self.request_states: raise ValueError(f"Request id {request_id} already running.") ``` however, this is not always effective: 1) We can have abort race conditions where a request is no longer tracked by the frontend, but still not completed in the engine. See vllm-project#15326 for an attempt to fix this. 2) With P/D, a request will continue to be tracked by the prefill engine long after the prefill request has been completed in the frontend, while we wait for the decode side to fetch the KV blocks Let's instead ensure we use a unique request ID internaly, even when a client provides a custom request ID. We can do this simply by prepending a short random prefix given that we already add a prefix to the client-provided ID. A full 32 character random UUID would be overkill as a prefix, so how many random characters would be sufficient? 8 characters gives us 32 bits of entropy, or 16^8 possible prefixes. Using the collision probability approximation from https://preshing.com/20110504/hash-collision-probabilities: N = 16^8 and k is the number of generated prefixes, then the probability of collision is (k^2)/(2N), so If a client somehow caused vLLM to hold 10k requests that reuse the same client-provided ID, then there would be a 1.16% chance of collision: ``` >>> (k**2)/(2*N) 0.011641532182693481 ``` That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt]. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom request ID. The motivation for this is that it can be very helpful when you need to correlate vLLM logs with logs of a related service. Since the request ID is used ubiquitously across vLLM as a unique key, it obviously is problematic if we ever have multiple in-flight requests using the same client-provided request ID. We saw this happening recently when `vllm serve bench` started including a request ID and the request IDs from multiple concurrent instances caused collisions. See vllm-project#27723 We try to guard against request ID collisions currently in the frontend in OutputProcessor: ``` def add_request(...): if request_id in self.request_states: raise ValueError(f"Request id {request_id} already running.") ``` however, this is not always effective: 1) We can have abort race conditions where a request is no longer tracked by the frontend, but still not completed in the engine. See vllm-project#15326 for an attempt to fix this. 2) With P/D, a request will continue to be tracked by the prefill engine long after the prefill request has been completed in the frontend, while we wait for the decode side to fetch the KV blocks Let's instead ensure we use a unique request ID internaly, even when a client provides a custom request ID. We can do this simply by prepending a short random prefix given that we already add a prefix to the client-provided ID. A full 32 character random UUID would be overkill as a prefix, so how many random characters would be sufficient? 8 characters gives us 32 bits of entropy, or 16^8 possible prefixes. Using the collision probability approximation from https://preshing.com/20110504/hash-collision-probabilities: N = 16^8 and k is the number of generated prefixes, then the probability of collision is (k^2)/(2N), so If a client somehow caused vLLM to hold 10k requests that reuse the same client-provided ID, then there would be a 1.16% chance of collision: ``` >>> (k**2)/(2*N) 0.011641532182693481 ``` That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt]. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
…roject#27723) Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…roject#27723) Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Purpose
vllm bench serveclients #27711request_states, leading to engine receiving non-unique request IDs (and bad downstream effects) instead of raisingValueErrorthat a given request ID is already running. I think the standardValueErroris not seen withapi-server-count 1due to queuing.vllm bench servesends unique request IDs when clients run in parallelTest Plan
Executed twice, simultaneously:
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.