server: monitoring - add /metrics prometheus compatible endpoint #5708

phymbert · 2024-02-25T09:44:01Z

Motivation

Monitoring inference performance in time series dashboard is a must have for LLM.

Prometheus format is the standard implementation to expose metrics.

Other LLM inference server offer this feature, for example vLLM.

Changes

Use the existing health task type to collect metrics
expose a simple prometheus metric format in /metrics
must be activated with --metrics option
added tests with the official prometheus python client to collect metrics

phymbert · 2024-02-25T09:44:20Z

@ggerganov I understand we want to globally refactor the server code, although this is a small change with no logic added in order to help people monitoring the server performances. Also I would be happy to submit an end-to-end kubernetes example later on based on this.

examples/server/server.cpp

…one call thread is notified

phymbert · 2024-02-25T11:21:47Z

examples/server/utils.hpp

@@ -441,7 +441,7 @@ struct llama_server_response {
            {
                LOG_VERBOSE("queue_results.push_back", {});
                queue_results.push_back(result);
-                condition_results.notify_one();
+                condition_results.notify_all();


@ggerganov @ngxson I was looking for a while why health/metrics/slots endpoints were blocking when a completion request is processing: only one caller thread was notified. Please confirm my fix.

Thanks for noticing, that's LGTM. Seems like this is the root cause of many stability issues of concurrent requests.

Yes, notify_all is correct

…ible endpoint (ggml-org#5708) * server: monitoring - add /metrics prometheus compatible endpoint * server: concurrency issue, when 2 task are waiting for results, only one call thread is notified * server: metrics - move to a dedicated struct

server: monitoring - add /metrics prometheus compatible endpoint

b8322be

phymbert requested review from ngxson and ggerganov February 25, 2024 09:44

ggerganov approved these changes Feb 25, 2024

View reviewed changes

ngxson reviewed Feb 25, 2024

View reviewed changes

examples/server/server.cpp Show resolved Hide resolved

examples/server/server.cpp Show resolved Hide resolved

examples/server/server.cpp Outdated Show resolved Hide resolved

examples/server/server.cpp Show resolved Hide resolved

server: concurrency issue, when 2 task are waiting for results, only …

7b29648

…one call thread is notified

phymbert commented Feb 25, 2024

View reviewed changes

server: metrics - move to a dedicated struct

542f42a

phymbert requested a review from ngxson February 25, 2024 11:45

phymbert mentioned this pull request Feb 25, 2024

Server: Improve work queue stability #5710

Closed

phymbert merged commit d52d781 into ggml-org:master Feb 25, 2024
50 of 108 checks passed

phymbert deleted the feature/server-metrics-exporter branch February 25, 2024 12:49

phymbert mentioned this pull request Feb 25, 2024

server: tests - slow inference causes timeout on the CI #5715

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: monitoring - add /metrics prometheus compatible endpoint #5708

server: monitoring - add /metrics prometheus compatible endpoint #5708

phymbert commented Feb 25, 2024

phymbert commented Feb 25, 2024

phymbert Feb 25, 2024

ngxson Feb 25, 2024 •

edited

Loading

ggerganov Feb 25, 2024

server: monitoring - add /metrics prometheus compatible endpoint #5708

server: monitoring - add /metrics prometheus compatible endpoint #5708

Conversation

phymbert commented Feb 25, 2024

Motivation

Changes

phymbert commented Feb 25, 2024

phymbert Feb 25, 2024

Choose a reason for hiding this comment

ngxson Feb 25, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov Feb 25, 2024

Choose a reason for hiding this comment

ngxson Feb 25, 2024 •

edited

Loading