[V1] add generate optional in health api #24491

lengrongfu · 2025-09-09T07:52:04Z

Purpose

FIX #24207

Add a optional generate param to /health?generate=true api, can to generate a max_tokens is 2.

Test Plan

If EngineCore progress crash, this api can response 500 http status code.

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request adds an optional generate parameter to the /health endpoint for a more thorough health check. The implementation is a good start, but the error handling can be made more robust. I've suggested a change to catch a broader range of exceptions and to handle all parts of the health check consistently within a single try...except block. This will prevent unhandled exceptions and ensure that any failure during the health check correctly returns a 500 status code.

vllm/entrypoints/openai/api_server.py

tomasruizt · 2025-09-09T09:28:20Z

2 ideas:

The abstract EngineClient could define a method minimal_generation() that encapsulates this logic. It already defines other logic like beam_search(). Perhaps even returning the response string. This makes the minimal generation usable even without the REST API. The REST API delgates to this method when generate=True.

prompt = "Hi"
sampling_params = SamplingParams(temperature=0, max_tokens=2)
request_id = random_uuid()
async for _ in client.generate(prompt, sampling_params,
                               request_id):
    pass

Currently, the exception is translated into a http 500 response, but the causing exception is not logged. You could log the exception in the try catch before returning (https://stackoverflow.com/a/5191885/5730291),

except Exception as e:
    logger.exception(e)
    return Response(status_code=500)

lengrongfu · 2025-09-10T02:48:37Z

@tomasruizt hi, according to you suggest, update done.

simon-mo · 2025-09-10T04:56:41Z

I have some concern over this design. The request will be enqueued, but if the queue is long (but vLLM is still healthy), the request will timeout. Priority scheduling is not enabled by default. This creates false positive signal for health check generation.

Instead, we should only do generation if there is no generation when engine is idle so we are not interrupting current batch.

Also cc @njhill @robertgshaw2-redhat on the AsyncLLM interface addition

tomasruizt · 2025-09-10T05:43:12Z

@simon-mo very good point!

I'd like to argue nevertheless, that a timeout on generate is highly informative for the client, since it means, as you said, that the server cannot serve generation requests with the requested timeout constraint.

It's a different outcome than http 500 error, which signals that the server is completely dead.

This is a nuanced difference, but the users of this endpoint (kubernetes users) are also likely to understand it from the endpoint docs.

The difference in outcomes is also clear by the fact that in the timeout outcome the EngineClient does not throw any exception.

lengrongfu · 2025-09-10T06:16:40Z

We whether can to check scheduler running queue length, if it gt 0, don't exec generate method.

lengrongfu · 2025-09-13T02:44:14Z

@simon-mo very good point!

I'd like to argue nevertheless, that a timeout on generate is highly informative for the client, since it means, as you said, that the server cannot serve generation requests with the requested timeout constraint.

It's a different outcome than http 500 error, which signals that the server is completely dead.

This is a nuanced difference, but the users of this endpoint (kubernetes users) are also likely to understand it from the endpoint docs.

The difference in outcomes is also clear by the fact that in the timeout outcome the EngineClient does not throw any exception.

I think /health api only make sure server is ok, can work, but generate timeout should use metric to export this issue. so when scheduler running queue length is more than 0 or wating queue length more than 0, we should think this serve is health.
@simon-mo @tomasruizt

cadedaniel · 2025-09-22T23:39:11Z

It is often helpful to distinguish between readiness and liveness for healthchecks. Ready being ready to handle more load, and live being the server is live and either recovering or starting up.

Separately, can we add a test for this? Otherwise, it is difficult to rely on this feature as it can break at any time.

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>

mergify · 2025-10-16T10:39:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lengrongfu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc · 2025-10-16T13:28:04Z

vllm/v1/engine/async_llm.py

+        result_text = ""
+        count = await self.engine_core.get_request_count()
+        num_running_reqs, num_waiting_reqs = count[0], count[1]
+        if num_running_reqs > 0 or num_waiting_reqs > 0:


The assumption here is that if there is any running or waiting request, a new request would complete successfully? Why is that a valid assumption?

Oh, I see Simon's comment:

Instead, we should only do generation if there is no generation when engine is idle so we are not interrupting current batch.

Isn't it likely that if the engine or worker processes are stuck in some way, that there would be in-flight requests also stuck?

Yes, I can only assume that the engine is still working and it is ready, do you have a better suggestion? @markmc

I'm saying we need to actually try to execute a request here if we want to detect hung engine or worker processes

Yes, there are several issues with this approach:

If there are too many requests in the queue, the /health request will time out.

If the /health request is executed every 3 seconds, it may affect actual user requests.

We need to make a careful trade-off.

markmc · 2025-10-16T13:28:11Z

vllm/v1/engine/async_llm.py

+            return result_text
+        async for output in self.generate(prompt, sampling_params, request_id):
+            for completion in output.outputs:
+                result_text = completion.text


No need to collect the outputs, since you don't use them

markmc · 2025-10-16T13:29:11Z

vllm/v1/engine/async_llm.py

                custom_stat_loggers=None,
            )

+    async def minimal_generation(self) -> str:


This doesn't feel like something that should be part of the AsyncLLM API. Can it just be a helper function in the API server code?

markmc · 2025-10-22T11:12:05Z

vllm/v1/engine/async_llm.py

+        num_running_reqs, num_waiting_reqs = count[0], count[1]
+        if num_running_reqs > 0 or num_waiting_reqs > 0:
+            return result_text
+        async for output in self.generate(prompt, sampling_params, request_id):


Can this hang indefinitely? Can we time out and return unhealthy if hangs?

Currently it is blocking, and a timeout can be set.

However, if you generate directly without checking whether the queue is empty, misjudgment may occur, for example, when a large number of user requests are being processed at this time.

lengrongfu requested a review from aarnphm as a code owner September 9, 2025 07:52

mergify bot added the frontend label Sep 9, 2025

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved

lengrongfu force-pushed the feat/health-generate branch from 7648745 to bbe9dae Compare September 9, 2025 09:02

lengrongfu requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 9, 2025 16:18

mergify bot added the v1 label Sep 9, 2025

lengrongfu mentioned this pull request Sep 16, 2025

feat(api): Return 503 on /health when engine is dead #24897

Merged

lengrongfu force-pushed the feat/health-generate branch from e581133 to dbb7c06 Compare September 19, 2025 10:21

lengrongfu requested a review from chaunceyjiang as a code owner September 19, 2025 10:21

lengrongfu force-pushed the feat/health-generate branch from dbb7c06 to 7559956 Compare September 19, 2025 10:51

lengrongfu force-pushed the feat/health-generate branch from 7559956 to 353f3cb Compare September 28, 2025 01:54

lengrongfu requested review from DarkLight1337, NickLucche and simon-mo as code owners September 28, 2025 01:54

lengrongfu force-pushed the feat/health-generate branch from 353f3cb to 26258af Compare September 28, 2025 02:12

lengrongfu added 3 commits October 15, 2025 19:09

[V1] add generate optional in health api

588361d

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>

add minimal_generation to async_llm

21399a9

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>

add check running queue length and waiting queue length

5b8b403

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>

lengrongfu force-pushed the feat/health-generate branch from 26258af to 4dda15d Compare October 16, 2025 02:21

add test_health to test health api

9ed7e4c

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>

lengrongfu force-pushed the feat/health-generate branch from 4dda15d to 9ed7e4c Compare October 16, 2025 10:39

mergify bot added the needs-rebase label Oct 16, 2025

markmc reviewed Oct 16, 2025

View reviewed changes

markmc reviewed Oct 22, 2025

View reviewed changes

markmc mentioned this pull request Oct 28, 2025

[RFC] Clarifying vLLM Shutdown Semantics #24885

Open

1 task

Uh oh!

[V1] add generate optional in health api #24491

Are you sure you want to change the base?

[V1] add generate optional in health api #24491

Conversation

lengrongfu commented Sep 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

tomasruizt commented Sep 9, 2025

Uh oh!

lengrongfu commented Sep 10, 2025

Uh oh!

simon-mo commented Sep 10, 2025

Uh oh!

tomasruizt commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lengrongfu commented Sep 10, 2025

Uh oh!

lengrongfu commented Sep 13, 2025

Uh oh!

cadedaniel commented Sep 22, 2025

Uh oh!

mergify bot commented Oct 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lengrongfu Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lengrongfu commented Sep 9, 2025 •

edited by github-actions bot

Loading

tomasruizt commented Sep 10, 2025 •

edited

Loading

lengrongfu Oct 16, 2025 •

edited

Loading