diff --git a/docs/design/metrics.md b/docs/design/metrics.md index 5cec253e9699..72616ad97b9b 100644 --- a/docs/design/metrics.md +++ b/docs/design/metrics.md @@ -1,12 +1,12 @@ # Metrics -Ensure the v1 LLM Engine exposes a superset of the metrics available in v0. +vLLM exposes a rich set of metrics to support observability and capacity planning for the V1 engine. ## Objectives -- Achieve parity of metrics between v0 and v1. -- The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments. -- Logging support (i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases. +- Provide comprehensive coverage of engine and request level metrics to aid production monitoring. +- Prioritize Prometheus integrations, as this is what we expect to be used in production environments. +- Offer logging support (i.e. printing metrics to the info log) for ad-hoc testing, debugging, development, and exploratory use cases. ## Background @@ -17,45 +17,36 @@ Metrics in vLLM can be categorized as follows: The mental model is that server-level metrics help explain the values of request-level metrics. -### v0 Metrics - -In v0, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix: - -- `vllm:num_requests_running` (Gauge) -- `vllm:num_requests_swapped` (Gauge) -- `vllm:num_requests_waiting` (Gauge) -- `vllm:gpu_cache_usage_perc` (Gauge) -- `vllm:cpu_cache_usage_perc` (Gauge) -- `vllm:gpu_prefix_cache_hit_rate` (Gauge) -- `vllm:cpu_prefix_cache_hit_rate` (Gauge) -- `vllm:prompt_tokens_total` (Counter) -- `vllm:generation_tokens_total` (Counter) -- `vllm:request_success_total` (Counter) -- `vllm:request_prompt_tokens` (Histogram) -- `vllm:request_generation_tokens` (Histogram) -- `vllm:time_to_first_token_seconds` (Histogram) -- `vllm:time_per_output_token_seconds` (Histogram) -- `vllm:e2e_request_latency_seconds` (Histogram) -- `vllm:request_queue_time_seconds` (Histogram) -- `vllm:request_inference_time_seconds` (Histogram) -- `vllm:request_prefill_time_seconds` (Histogram) -- `vllm:request_decode_time_seconds` (Histogram) -- `vllm:request_max_num_generation_tokens` (Histogram) -- `vllm:num_preemptions_total` (Counter) -- `vllm:cache_config_info` (Gauge) -- `vllm:lora_requests_info` (Gauge) -- `vllm:tokens_total` (Counter) -- `vllm:iteration_tokens_total` (Histogram) -- `vllm:time_in_queue_requests` (Histogram) -- `vllm:model_forward_time_milliseconds` (Histogram) -- `vllm:model_execute_time_milliseconds` (Histogram) -- `vllm:request_params_n` (Histogram) -- `vllm:request_params_max_tokens` (Histogram) -- `vllm:spec_decode_draft_acceptance_rate` (Gauge) -- `vllm:spec_decode_efficiency` (Gauge) -- `vllm:spec_decode_num_accepted_tokens_total` (Counter) -- `vllm:spec_decode_num_draft_tokens_total` (Counter) -- `vllm:spec_decode_num_emitted_tokens_total` (Counter) +### Metrics Overview + +### v1 Metrics + +In v1, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix: + +- `vllm:num_requests_running` (Gauge) - Number of requests currently running. +- `vllm:num_requests_waiting` (Gauge) - Number of requests currently waiting. +- `vllm:kv_cache_usage_perc` (Gauge) - Fraction of used KV cache blocks (0–1). +- `vllm:prefix_cache_queries` (Counter) - Number of prefix cache queries. +- `vllm:prefix_cache_hits` (Counter) - Number of prefix cache hits. +- `vllm:mm_cache_queries` (Counter) - (For multimodal models) Number of multimodal cache queries. +- `vllm:mm_cache_hits` (Counter) - (For multimodal models) Number of multimodal cache hits. +- `vllm:num_preemptions_total` (Counter) - Number of preemptions. +- `vllm:prompt_tokens_total` (Counter) - Total number of prompt tokens processed. +- `vllm:generation_tokens_total` (Counter) - Total number of generated tokens. +- `vllm:iteration_tokens_total` (Histogram) - Histogram of tokens processed in each engine step. +- `vllm:cache_config_info` (Gauge) - Information about the cache configuration. +- `vllm:request_success_total` (Counter) - Number of finished requests (by finish reason). +- `vllm:request_prompt_tokens` (Histogram) - Histogram of input prompt token counts. +- `vllm:request_generation_tokens` (Histogram) - Histogram of generation token counts. +- `vllm:request_params_n` (Histogram) - Histogram of request parameter n. +- `vllm:request_params_max_tokens` - (Histogram) - Histogram of max_tokens parameter in requests. +- `vllm:time_to_first_token_seconds` (Histogram) - Time to first token (TTFT). +- `vllm:inter_token_latency_seconds` (Histogram) - Inter-token latency. +- `vllm:e2e_request_latency_seconds` (Histogram) - End-to-end request latency. +- `vllm:request_queue_time_seconds` (Histogram) - Time spent in the queue. +- `vllm:request_inference_time_seconds` (Histogram) - Request inference time. +- `vllm:request_prefill_time_seconds` (Histogram) - Request prefill time. +- `vllm:request_decode_time_seconds` (Histogram) - Request decode time. These are documented under [Inferencing and Serving -> Production Metrics](../usage/metrics.md). @@ -86,7 +77,7 @@ See [the PR which added this Dashboard](https://github.com/vllm-project/vllm/pul Prometheus support was initially added [using the aioprometheus library](https://github.com/vllm-project/vllm/pull/1890), but a switch was made quickly to [prometheus_client](https://github.com/vllm-project/vllm/pull/2730). The rationale is discussed in both linked PRs. -With the switch to `aioprometheus`, we lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](https://github.com/vllm-project/vllm/pull/15657): +During those migrations we briefly lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](https://github.com/vllm-project/vllm/pull/15657): ```bash $ curl http://0.0.0.0:8000/metrics 2>/dev/null | grep -P '^http_(?!.*(_bucket|_created|_sum)).*' @@ -99,7 +90,9 @@ http_request_duration_seconds_count{handler="/v1/completions",method="POST"} 201 ### Multi-process Mode -In v0, metrics are collected in the engine core process and we use multiprocess mode to make them available in the API server process. See . +Historically, metrics were collected in the engine core process and multiprocess mode was used to make them available in the API server process. See . + +More recently, metrics are collected in the API server process and multiprocess mode is only used when `--api-server-count > 1`. See and details on [API server scale-out](../serving/data_parallel_deployment.md#internal-load-balancing). ### Built in Python/Process Metrics @@ -116,14 +109,15 @@ The following metrics are supported by default by `prometheus_client`, but they - `process_open_fds` - `process_max_fds` -This is relevant because if we move away from multiprocess mode in v1, -we get these back. However, it's questionable how relevant these are -if they don't aggregate these stats for all processes that make up a -vLLM instance. +Therefore, these metrics are unavailable when `--api-server-count > 1`. It's questionable how relevant these are since they do not aggregate these stats for all processes that make up a vLLM instance. + +## Metrics Design -### v0 PRs and Issues +The ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where was where much of the metrics design was planned. For example, see where [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781). -For background, these are some of the relevant PRs which added the v0 metrics: +### Legacy PRs + +To help understand the background to the metrics design, here are some of the relevant PRs which added the original, now legacy, metrics: - - @@ -131,14 +125,9 @@ For background, these are some of the relevant PRs which added the v0 metrics: - - -Also note the ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where e.g. [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781). - -## v1 Design +### Metrics Implementation PRs -### v1 PRs - -For background, here are the relevant v1 PRs relating to the v1 -metrics issue : +For background, here are the relevant PRs relating to the metrics implementation : - - @@ -369,7 +358,7 @@ vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="F However, `prometheus_client` has [never supported Info metrics in multiprocessing mode](https://github.com/prometheus/client_python/pull/300) - -for [unclear reasons](https://github.com/vllm-project/vllm/pull/7279#discussion_r1710417152). We +for [unclear reasons](gh-pr:7279#discussion_r1710417152). We simply use a `Gauge` metric set to 1 and `multiprocess_mode="mostrecent"` instead. @@ -396,9 +385,8 @@ recent metric is used, but only from currently running processes. This was added in and there is [at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54). -If we revisit this design and deprecate the old metric, we should reduce -the need for a significant deprecation period by making the change in -v0 also and asking this project to move to the new metric. +If we revisit this design and deprecate the old metric, we should +coordinate with downstream users so they can migrate before the removal. ### Prefix Cache metrics @@ -478,22 +466,20 @@ us with: ```python if seq_group.is_finished(): - if ( - seq_group.metrics.first_scheduled_time is not None - and seq_group.metrics.first_token_time is not None - ): + if (seq_group.metrics.first_scheduled_time is not None and + seq_group.metrics.first_token_time is not None): time_queue_requests.append( seq_group.metrics.first_scheduled_time - - seq_group.metrics.arrival_time - ) + seq_group.metrics.arrival_time) ... if seq_group.metrics.time_in_queue is not None: - time_in_queue_requests.append(seq_group.metrics.time_in_queue) + time_in_queue_requests.append( + seq_group.metrics.time_in_queue) ``` This seems duplicative, and one of them should be removed. The latter is used by the Grafana dashboard, so we should deprecate or remove the -former from v0. +former. ### Prefix Cache Hit Rate @@ -502,7 +488,7 @@ See above - we now expose 'queries' and 'hits' counters rather than a ### KV Cache Offloading -Two v0 metrics relate to a "swapped" preemption mode that is no +Two legacy metrics relate to a "swapped" preemption mode that is no longer relevant in v1: - `vllm:num_requests_swapped` @@ -513,7 +499,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU memory. This is also known as "KV cache offloading" and is configured with `--swap-space` and `--preemption-mode`. -In v0, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The +Historically, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The SequenceGroup encapsulated the idea of N Sequences which all shared the same prompt kv blocks. This enabled KV cache block sharing between requests, and copy-on-write to do branching. CPU @@ -526,7 +512,7 @@ and the part of the prompt that was evicted can be recomputed. SequenceGroup was removed in V1, although a replacement will be required for "parallel sampling" (`n>1`). -[Beam search was moved out of the core (in V0)](https://github.com/vllm-project/vllm/issues/8306). There was a +[Beam search was moved out of the core](https://github.com/vllm-project/vllm/issues/8306). There was a lot of complex code for a very uncommon feature. In V1, with prefix caching being better (zero over head) and therefore @@ -537,7 +523,7 @@ better. ### Parallel Sampling -Some v0 metrics are only relevant in the context of "parallel +Some legacy metrics are only relevant in the context of "parallel sampling". This is where the `n` parameter in a request is used to request multiple completions from the same prompt. @@ -556,7 +542,7 @@ also add these metrics. ### Speculative Decoding -Some v0 metrics are specific to "speculative decoding". This is where +Some legacy metrics are specific to "speculative decoding". This is where we generate candidate tokens using a faster, approximate method or model and then validate those tokens with the larger model. @@ -568,7 +554,7 @@ model and then validate those tokens with the larger model. There is a PR under review () to add "prompt lookup (ngram)" speculative decoding to v1. Other techniques will follow. We should -revisit the v0 metrics in this context. +revisit these metrics in this context. !!! note We should probably expose acceptance rate as separate accepted @@ -641,7 +627,7 @@ metrics are often relatively straightforward to add: metrics are usually of very limited use unless they can be enabled by default and in production. 3. They have an impact on development and maintenance of the - project. Every metric added to v0 has made this v1 effort more + project. Every metric added over time has made this effort more time-consuming, and perhaps not all metrics justify this ongoing investment in their maintenance. @@ -652,24 +638,24 @@ performance and health. Tracing, on the other hand, tracks individual requests as they move through different services and components. Both fall under the more general heading of "Observability". -v0 has support for OpenTelemetry tracing: +vLLM has support for OpenTelemetry tracing: -- Added by +- Added by and reinstated by - Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces` - [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/) - [User-facing docs](../examples/online_serving/opentelemetry.md) - [Blog post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f) - [IBM product docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview) - + OpenTelemetry has a [Gen AI Working Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md). -Since metrics is a big enough topic on its own, we are going to tackle -the topic of tracing in v1 separately. +Since metrics is a big enough topic on its own, we consider the topic +of tracing to be quite separate from metrics. ### OpenTelemetry Model Forward vs Execute Time -In v0, we have the following two metrics: +The current implementation exposes the following two metrics: - `vllm:model_forward_time_milliseconds` (Histogram) - The time spent in the model forward pass when this request was in the batch.