You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add e2e request latency histogram to prometheus metrics.
Add reportHistogramValue function to be used for reporting values in histogram metrics
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* Additional metrics - vllm:request_queue_time_seconds, vllm:request_inference_time_seconds, vllm:request_prefill_time_seconds, and vllm:request_decode_time_seconds
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* typo in metric name
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* Initial tests for new metrics + create constant for part of metrics names
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* Fix bug in metrics test + add latency test for streaming mode
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* Move common simulator tests helper functions to test_utils.go, use same model name is all tests, refactoring in server start functions
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* Add test for vllm:request_queue_time_seconds and vllm:request_inference_time_seconds
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* Define constant for metrics names, use helper functions in metrics test for histogram buckets validation
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* - Add full list of supported metrics to readme
- Create constants for all metrics
- Define all latency related fake metrics in config
- Add validation for new fake metrics in config
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* add license to test_utils.go
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* Set fake latency metrics if defined in configuration, added tests for latency fake metrics
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* add fake latency metrics test
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* fix sending latency metrics, use WriteToChannel function
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* fix merge
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
---------
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Copy file name to clipboardExpand all lines: README.md
+12-1Lines changed: 12 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,18 @@ In addition, it supports a subset of vLLM's Prometheus metrics. These metrics ar
26
26
| vllm:lora_requests_info| Running stats on LoRA requests |
27
27
| vllm:num_requests_running| Number of requests currently running on GPU |
28
28
| vllm:num_requests_waiting| Prometheus metric for the number of queued requests |
29
-
29
+
| vllm:e2e_request_latency_seconds| Histogram of end to end request latency in seconds |
30
+
| vllm:request_inference_time_seconds| Histogram of time spent in RUNNING phase for request |
31
+
| vllm:request_queue_time_seconds| Histogram of time spent in WAITING phase for request |
32
+
| vllm:request_prefill_time_seconds| Histogram of time spent in PREFILL phase for request |
33
+
| vllm:request_decode_time_seconds| Histogram of time spent in DECODE phase for request |
34
+
| vllm:time_to_first_token_seconds| Histogram of time to first token in seconds |
35
+
| vllm:time_per_output_token_seconds| Histogram of time per output token in seconds |
36
+
| vllm:request_generation_tokens| Number of generation tokens processed |
37
+
| vllm:request_params_max_tokens| Histogram of the max_tokens request parameter |
38
+
| vllm:request_prompt_tokens| Number of prefill tokens processed |
39
+
| vllm:request_success_total| Count of successfully processed requests |
40
+
30
41
The simulated inference has no connection with the model and LoRA adapters specified in the command line parameters or via the /v1/load_lora_adapter HTTP REST endpoint. The /v1/models endpoint returns simulated results based on those same command line parameters and those loaded via the /v1/load_lora_adapter HTTP REST endpoint.
0 commit comments