-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: monitoring - add /metrics prometheus compatible endpoint #5708
server: monitoring - add /metrics prometheus compatible endpoint #5708
Conversation
@ggerganov I understand we want to globally refactor the server code, although this is a small change with no logic added in order to help people monitoring the server performances. Also I would be happy to submit an end-to-end kubernetes example later on based on this. |
…one call thread is notified
@@ -441,7 +441,7 @@ struct llama_server_response { | |||
{ | |||
LOG_VERBOSE("queue_results.push_back", {}); | |||
queue_results.push_back(result); | |||
condition_results.notify_one(); | |||
condition_results.notify_all(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov @ngxson I was looking for a while why health/metrics/slots endpoints were blocking when a completion request is processing: only one caller thread was notified. Please confirm my fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing, that's LGTM. Seems like this is the root cause of many stability issues of concurrent requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, notify_all
is correct
…ible endpoint (ggml-org#5708) * server: monitoring - add /metrics prometheus compatible endpoint * server: concurrency issue, when 2 task are waiting for results, only one call thread is notified * server: metrics - move to a dedicated struct
…ible endpoint (ggml-org#5708) * server: monitoring - add /metrics prometheus compatible endpoint * server: concurrency issue, when 2 task are waiting for results, only one call thread is notified * server: metrics - move to a dedicated struct
Motivation
Monitoring inference performance in time series dashboard is a must have for LLM.
Prometheus format is the standard implementation to expose metrics.
Other LLM inference server offer this feature, for example vLLM.
Changes
/metrics
--metrics
option