Skip to content

Conversation

@eicherseiji
Copy link
Contributor

@eicherseiji eicherseiji commented Oct 29, 2025

Purpose

Test Plan

MODEL_PATH="Qwen/Qwen2.5-0.5B-Instruct"
MODEL_ID="qw-0.5B"
vllm serve "$MODEL_PATH" \
  --served-model-name "$MODEL_ID" \
  --api-server-count 2

Executed twice, simultaneously:

MODEL_ID="qw-0.5B"
MODEL_PATH="Qwen/Qwen2.5-0.5B-Instruct"

    vllm bench serve \
        --model "$MODEL_PATH" \
        --served-model-name "$MODEL_ID" \

Test Result

  • No engine crash due to duplicate request IDs is seen
Screenshot 2025-10-29 at 1 28 53 AM
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@mergify mergify bot added the performance Performance-related issues label Oct 29, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a crash issue caused by duplicate request IDs when running benchmark clients in parallel. The fix, which prepends the process ID to the request ID, ensures uniqueness for clients on a single machine. I've provided one suggestion to make this even more robust by using a random prefix, which would guarantee uniqueness in distributed scenarios across multiple machines.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@markmc
Copy link
Member

markmc commented Oct 29, 2025

Thank you for the PR

Indeed, it looks like when we added X-Request-Id: benchmark-serving-{request index} to requests from vllm serve bench in #23065 (v0.10.2)

parser.add_argument(
        "--request-id-prefix",
        type=str,
        required=False,
        default="benchmark-serving",
        help="Specify the prefix of request id.",
    )
    
ind = 0
for item in self.data:
    sampled_requests.append(
                SampleRequest(
                    ...
                    request_id=request_id_prefix + str(ind),
                )
            )
    ind += 1    

See this comment in #26929 for where we saw this issue recently in P/D setups. The reporter confirmed on Slack that they were using vllm serve bench

the workload generator here was vllm bench serve launched in multiple pods near simultaneously

I think the simplest solution here is to just make --request-id-prefix to something unique, for example:

default=f"bench-{uuid.uuid4().hex[:8]}-"

def add_request(self, request: Request) -> None:
request_id = request.request_id
if request_id in self.requests:
raise ValueError(f"Request id {request_id} already exists.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not safe - it will cause the engine to exit

Ideally we would just return an error for this single request, but that is not a trivial change

Let's remove the scheduler part from this PR and just fix the request IDs used by vllm serve bench

Copy link
Contributor Author

@eicherseiji eicherseiji Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either way, the engine will crash due to duplicate IDs. But point taken, removed in favor of backing out/failing the request the right way.

@markmc
Copy link
Member

markmc commented Oct 29, 2025

xref #27189

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@eicherseiji
Copy link
Contributor Author

Thanks for the review @markmc! Updated the PR with suggestion.

type=str,
required=False,
default="benchmark-serving",
default=f"bench-{uuid.uuid4().hex[:8]}-",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we keep this and simply add the uuid to the prefix?

So request_id will be

<prefix>-<uuid>-<cnt>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean add the uuid in main() so that we will also add it to any user-supplied prefix?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kouroshHakha I'm not sure this works, because then users won't have full control over their prefix. I.e. a user must edit the code to get around us adding a uuid to their chosen prefix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a hard limit on this pr obv. but I was thinking we would need to guarantee uniquness of the request-ids even if the end user wants to override the prefix. that way uuid is part of the auto appended suffix.

@markmc markmc added ready ONLY add when PR is ready to merge/full CI is needed and removed performance Performance-related issues labels Oct 30, 2025
@mergify mergify bot added the performance Performance-related issues label Oct 30, 2025
@eicherseiji eicherseiji changed the title [benchmark] Make request IDs unique [benchmark] Make request IDs unique across clients by default Oct 30, 2025
@njhill njhill merged commit b2e65cb into vllm-project:main Oct 31, 2025
50 of 51 checks passed
markmc added a commit to markmc/vllm that referenced this pull request Nov 3, 2025
Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom
request ID. The motivation for this is that it can be very helpful
when you need to correlate vLLM logs with logs of a related service.

Since the request ID is used ubiquitously across vLLM as a unique
key, it obviously is problematic if we ever have multiple in-flight
requests using the same client-provided request ID.

We saw this happening recently when `vllm serve bench` started
including a request ID and the request IDs from multiple concurrent
instances caused collisions. See vllm-project#27723

We try to guard against request ID collisions currently in the
frontend in OutputProcessor:

```
    def add_request(...):
        if request_id in self.request_states:
            raise ValueError(f"Request id {request_id} already running.")
```

however, this is not always effective:

1) We can have abort race conditions where a request is no longer
   tracked by the frontend, but still not completed in the engine.
   See vllm-project#15326 for an attempt to fix this.
2) With P/D, a request will continue to be tracked by the prefill
   engine long after the prefill request has been completed in
   the frontend, while we wait for the decode side to fetch the
   KV blocks

Let's instead ensure we use a unique request ID internaly, even
when a client provides a custom request ID. We can do this simply
by prepending a short random prefix given that we already add
a prefix to the client-provided ID.

A full 32 character random UUID would be overkill as a prefix,
so how many random characters would be sufficient? 8 characters
gives us 32 bits of entropy, or 16^8 possible prefixes.

Using the collision probability approximation from
https://preshing.com/20110504/hash-collision-probabilities:

N = 16^8 and k is the number of generated prefixes, then the
probability of collision is (k^2)/(2N), so If a client somehow
caused vLLM to hold 10k requests that reuse the same client-provided
ID, then there would be a 1.16% chance of collision:

```
>>> (k**2)/(2*N)
0.011641532182693481
```

That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt].

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
markmc added a commit to markmc/vllm that referenced this pull request Nov 3, 2025
Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom
request ID. The motivation for this is that it can be very helpful
when you need to correlate vLLM logs with logs of a related service.

Since the request ID is used ubiquitously across vLLM as a unique
key, it obviously is problematic if we ever have multiple in-flight
requests using the same client-provided request ID.

We saw this happening recently when `vllm serve bench` started
including a request ID and the request IDs from multiple concurrent
instances caused collisions. See vllm-project#27723

We try to guard against request ID collisions currently in the
frontend in OutputProcessor:

```
    def add_request(...):
        if request_id in self.request_states:
            raise ValueError(f"Request id {request_id} already running.")
```

however, this is not always effective:

1) We can have abort race conditions where a request is no longer
   tracked by the frontend, but still not completed in the engine.
   See vllm-project#15326 for an attempt to fix this.
2) With P/D, a request will continue to be tracked by the prefill
   engine long after the prefill request has been completed in
   the frontend, while we wait for the decode side to fetch the
   KV blocks

Let's instead ensure we use a unique request ID internaly, even
when a client provides a custom request ID. We can do this simply
by prepending a short random prefix given that we already add
a prefix to the client-provided ID.

A full 32 character random UUID would be overkill as a prefix,
so how many random characters would be sufficient? 8 characters
gives us 32 bits of entropy, or 16^8 possible prefixes.

Using the collision probability approximation from
https://preshing.com/20110504/hash-collision-probabilities:

N = 16^8 and k is the number of generated prefixes, then the
probability of collision is (k^2)/(2N), so If a client somehow
caused vLLM to hold 10k requests that reuse the same client-provided
ID, then there would be a 1.16% chance of collision:

```
>>> (k**2)/(2*N)
0.011641532182693481
```

That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt].

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug][0.11.1rc3]: Engine crash with multiple API servers + multiple vllm bench serve clients

4 participants