-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[V1] add generate optional in health api #24491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds an optional generate parameter to the /health endpoint for a more thorough health check. The implementation is a good start, but the error handling can be made more robust. I've suggested a change to catch a broader range of exceptions and to handle all parts of the health check consistently within a single try...except block. This will prevent unhandled exceptions and ensure that any failure during the health check correctly returns a 500 status code.
7648745 to
bbe9dae
Compare
|
2 ideas:
prompt = "Hi"
sampling_params = SamplingParams(temperature=0, max_tokens=2)
request_id = random_uuid()
async for _ in client.generate(prompt, sampling_params,
request_id):
pass
except Exception as e:
logger.exception(e)
return Response(status_code=500) |
|
@tomasruizt hi, according to you suggest, update done. |
|
I have some concern over this design. The request will be enqueued, but if the queue is long (but vLLM is still healthy), the request will timeout. Priority scheduling is not enabled by default. This creates false positive signal for health check generation. Instead, we should only do generation if there is no generation when engine is idle so we are not interrupting current batch. Also cc @njhill @robertgshaw2-redhat on the AsyncLLM interface addition |
|
@simon-mo very good point! I'd like to argue nevertheless, that a timeout on generate is highly informative for the client, since it means, as you said, that the server cannot serve generation requests with the requested timeout constraint. It's a different outcome than http 500 error, which signals that the server is completely dead. This is a nuanced difference, but the users of this endpoint (kubernetes users) are also likely to understand it from the endpoint docs. The difference in outcomes is also clear by the fact that in the timeout outcome the |
|
We whether can to check scheduler running queue length, if it gt 0, don't exec generate method. |
I think |
e581133 to
dbb7c06
Compare
dbb7c06 to
7559956
Compare
|
It is often helpful to distinguish between readiness and liveness for healthchecks. Ready being ready to handle more load, and live being the server is live and either recovering or starting up. Separately, can we add a test for this? Otherwise, it is difficult to rely on this feature as it can break at any time. |
7559956 to
353f3cb
Compare
353f3cb to
26258af
Compare
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
26258af to
4dda15d
Compare
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
4dda15d to
9ed7e4c
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
| result_text = "" | ||
| count = await self.engine_core.get_request_count() | ||
| num_running_reqs, num_waiting_reqs = count[0], count[1] | ||
| if num_running_reqs > 0 or num_waiting_reqs > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption here is that if there is any running or waiting request, a new request would complete successfully? Why is that a valid assumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see Simon's comment:
Instead, we should only do generation if there is no generation when engine is idle so we are not interrupting current batch.
Isn't it likely that if the engine or worker processes are stuck in some way, that there would be in-flight requests also stuck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can only assume that the engine is still working and it is ready, do you have a better suggestion? @markmc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm saying we need to actually try to execute a request here if we want to detect hung engine or worker processes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are several issues with this approach:
- If there are too many requests in the queue, the /health request will time out.
- If the /health request is executed every 3 seconds, it may affect actual user requests.
We need to make a careful trade-off.
| return result_text | ||
| async for output in self.generate(prompt, sampling_params, request_id): | ||
| for completion in output.outputs: | ||
| result_text = completion.text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to collect the outputs, since you don't use them
| custom_stat_loggers=None, | ||
| ) | ||
|
|
||
| async def minimal_generation(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't feel like something that should be part of the AsyncLLM API. Can it just be a helper function in the API server code?
| num_running_reqs, num_waiting_reqs = count[0], count[1] | ||
| if num_running_reqs > 0 or num_waiting_reqs > 0: | ||
| return result_text | ||
| async for output in self.generate(prompt, sampling_params, request_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this hang indefinitely? Can we time out and return unhealthy if hangs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it is blocking, and a timeout can be set.
However, if you generate directly without checking whether the queue is empty, misjudgment may occur, for example, when a large number of user requests are being processed at this time.
Purpose
FIX #24207
Add a optional generate param to
/health?generate=trueapi, can to generate a max_tokens is 2.Test Plan
If EngineCore progress crash, this api can response 500 http status code.
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.