Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: monitoring - add /metrics prometheus compatible endpoint #5708

Merged
merged 3 commits into from
Feb 25, 2024

Conversation

phymbert
Copy link
Collaborator

Motivation

Monitoring inference performance in time series dashboard is a must have for LLM.

Prometheus format is the standard implementation to expose metrics.

Other LLM inference server offer this feature, for example vLLM.

Changes

  • Use the existing health task type to collect metrics
  • expose a simple prometheus metric format in /metrics
  • must be activated with --metrics option
  • added tests with the official prometheus python client to collect metrics

@phymbert phymbert requested review from ngxson and ggerganov February 25, 2024 09:44
@phymbert
Copy link
Collaborator Author

@ggerganov I understand we want to globally refactor the server code, although this is a small change with no logic added in order to help people monitoring the server performances. Also I would be happy to submit an end-to-end kubernetes example later on based on this.

@@ -441,7 +441,7 @@ struct llama_server_response {
{
LOG_VERBOSE("queue_results.push_back", {});
queue_results.push_back(result);
condition_results.notify_one();
condition_results.notify_all();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov @ngxson I was looking for a while why health/metrics/slots endpoints were blocking when a completion request is processing: only one caller thread was notified. Please confirm my fix.

Copy link
Collaborator

@ngxson ngxson Feb 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing, that's LGTM. Seems like this is the root cause of many stability issues of concurrent requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, notify_all is correct

@phymbert phymbert requested a review from ngxson February 25, 2024 11:45
@phymbert phymbert merged commit d52d781 into ggml-org:master Feb 25, 2024
50 of 108 checks passed
@phymbert phymbert deleted the feature/server-metrics-exporter branch February 25, 2024 12:49
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
…ible endpoint (ggml-org#5708)

* server: monitoring - add /metrics prometheus compatible endpoint

* server: concurrency issue, when 2 task are waiting for results, only one call thread is notified

* server: metrics - move to a dedicated struct
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
…ible endpoint (ggml-org#5708)

* server: monitoring - add /metrics prometheus compatible endpoint

* server: concurrency issue, when 2 task are waiting for results, only one call thread is notified

* server: metrics - move to a dedicated struct
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants