Skip to content

Commit de71f5d

Browse files
authored
Additional latency related metrics (#237)
* Add e2e request latency histogram to prometheus metrics. Add reportHistogramValue function to be used for reporting values in histogram metrics Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Additional metrics - vllm:request_queue_time_seconds, vllm:request_inference_time_seconds, vllm:request_prefill_time_seconds, and vllm:request_decode_time_seconds Signed-off-by: Maya Barnea <mayab@il.ibm.com> * typo in metric name Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Initial tests for new metrics + create constant for part of metrics names Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Fix bug in metrics test + add latency test for streaming mode Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Move common simulator tests helper functions to test_utils.go, use same model name is all tests, refactoring in server start functions Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Add test for vllm:request_queue_time_seconds and vllm:request_inference_time_seconds Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Define constant for metrics names, use helper functions in metrics test for histogram buckets validation Signed-off-by: Maya Barnea <mayab@il.ibm.com> * - Add full list of supported metrics to readme - Create constants for all metrics - Define all latency related fake metrics in config - Add validation for new fake metrics in config Signed-off-by: Maya Barnea <mayab@il.ibm.com> * add license to test_utils.go Signed-off-by: Maya Barnea <mayab@il.ibm.com> * Set fake latency metrics if defined in configuration, added tests for latency fake metrics Signed-off-by: Maya Barnea <mayab@il.ibm.com> * add fake latency metrics test Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix sending latency metrics, use WriteToChannel function Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix merge Signed-off-by: Maya Barnea <mayab@il.ibm.com> --------- Signed-off-by: Maya Barnea <mayab@il.ibm.com>
1 parent cd611e8 commit de71f5d

File tree

17 files changed

+1193
-706
lines changed

17 files changed

+1193
-706
lines changed

README.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,18 @@ In addition, it supports a subset of vLLM's Prometheus metrics. These metrics ar
2626
| vllm:lora_requests_info | Running stats on LoRA requests |
2727
| vllm:num_requests_running | Number of requests currently running on GPU |
2828
| vllm:num_requests_waiting | Prometheus metric for the number of queued requests |
29-
29+
| vllm:e2e_request_latency_seconds | Histogram of end to end request latency in seconds |
30+
| vllm:request_inference_time_seconds | Histogram of time spent in RUNNING phase for request |
31+
| vllm:request_queue_time_seconds | Histogram of time spent in WAITING phase for request |
32+
| vllm:request_prefill_time_seconds | Histogram of time spent in PREFILL phase for request |
33+
| vllm:request_decode_time_seconds | Histogram of time spent in DECODE phase for request |
34+
| vllm:time_to_first_token_seconds | Histogram of time to first token in seconds |
35+
| vllm:time_per_output_token_seconds | Histogram of time per output token in seconds |
36+
| vllm:request_generation_tokens | Number of generation tokens processed |
37+
| vllm:request_params_max_tokens | Histogram of the max_tokens request parameter |
38+
| vllm:request_prompt_tokens | Number of prefill tokens processed |
39+
| vllm:request_success_total | Count of successfully processed requests |
40+
3041
The simulated inference has no connection with the model and LoRA adapters specified in the command line parameters or via the /v1/load_lora_adapter HTTP REST endpoint. The /v1/models endpoint returns simulated results based on those same command line parameters and those loaded via the /v1/load_lora_adapter HTTP REST endpoint.
3142

3243
The simulator supports two modes of operation:

pkg/common/config.go

Lines changed: 53 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -232,16 +232,17 @@ type Metrics struct {
232232
WaitingRequests int64 `yaml:"waiting-requests" json:"waiting-requests"`
233233
// KVCacheUsagePercentage is the fraction of KV-cache blocks currently in use (from 0 to 1)
234234
KVCacheUsagePercentage float32 `yaml:"kv-cache-usage" json:"kv-cache-usage"`
235-
// TTFTBuckets is an array of values for time-to-first-token buckets,
236-
// each value in this array is a value for the corresponding bucket.
235+
236+
// Histogram metrics - defined by array of values.
237+
// Each value in this array is a value for the corresponding bucket.
237238
// Array may contain less values than number of buckets, all trailing missing values assumed as 0.
239+
240+
// TTFTBuckets is an array of values for time-to-first-token buckets.
238241
// Buckets upper boundaries in seconds are:
239242
// 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5,
240243
// 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, 160.0, 640.0, 2560.0, +Inf
241244
TTFTBucketValues []int `yaml:"ttft-buckets-values" json:"ttft-buckets-values"`
242-
// TPOTBuckets is an array of values for time-per-output-token buckets,
243-
// each value in this array is a value for the corresponding bucket.
244-
// Array may contain less values than number of buckets, all trailing missing values assumed as 0.
245+
// TPOTBuckets is an array of values for time-per-output-token buckets.
245246
// Buckets upper boundaries in seconds are:
246247
// 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75,
247248
// 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, +Inf
@@ -253,6 +254,21 @@ type Metrics struct {
253254
RequestParamsMaxTokens []int `yaml:"request-params-max-tokens" json:"request-params-max-tokens"` // max_tokens parameter samples
254255
// RequestSuccessTotal is the number of successful requests, key: finish-reason (stop, length, etc.).
255256
RequestSuccessTotal map[string]int64 `yaml:"request-success-total" json:"request-success-total"`
257+
258+
// Latency histograms - have same buckets upper boundaries in seconds are:
259+
// 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0,
260+
// 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Inf
261+
262+
// E2ERequestLatencyBucketValues is an array of values for e2e request latency buckets.
263+
E2ERequestLatencyBucketValues []int `yaml:"e2erl-buckets-values" json:"e2erl-buckets-values"`
264+
// ReqQueueTimeBucketValues is an array of values for request queue time buckets.
265+
ReqQueueTimeBucketValues []int `yaml:"queue-time-buckets-values" json:"queue-time-buckets-values"`
266+
// ReqInfTimeBucketValues is an array of values for request inference time buckets.
267+
ReqInfTimeBucketValues []int `yaml:"inf-time-buckets-values" json:"inf-time-buckets-values"`
268+
// ReqPrefillTimeBucketValues is an array of values for request prefill time buckets.
269+
ReqPrefillTimeBucketValues []int `yaml:"prefill-time-buckets-values" json:"prefill-time-buckets-values"`
270+
// ReqDecodeTimeBucketValues is an array of values for request decode time buckets.
271+
ReqDecodeTimeBucketValues []int `yaml:"decode-time-buckets-values" json:"decode-time-buckets-values"`
256272
}
257273

258274
type LorasMetrics struct {
@@ -588,6 +604,38 @@ func (c *Configuration) validate() error {
588604
return errors.New("fake metrics request-params-max-tokens cannot contain negative values")
589605
}
590606
}
607+
608+
for _, v := range c.FakeMetrics.RequestParamsMaxTokens {
609+
if v < 0 {
610+
return errors.New("fake metrics request-params-max-tokens cannot contain negative values")
611+
}
612+
}
613+
614+
for _, v := range c.FakeMetrics.E2ERequestLatencyBucketValues {
615+
if v < 0 {
616+
return errors.New("fake metrics e2erl-buckets-values cannot contain negative values")
617+
}
618+
}
619+
for _, v := range c.FakeMetrics.ReqQueueTimeBucketValues {
620+
if v < 0 {
621+
return errors.New("fake metrics queue-time-buckets-values cannot contain negative values")
622+
}
623+
}
624+
for _, v := range c.FakeMetrics.ReqInfTimeBucketValues {
625+
if v < 0 {
626+
return errors.New("fake metrics inf-time-buckets-values cannot contain negative values")
627+
}
628+
}
629+
for _, v := range c.FakeMetrics.ReqPrefillTimeBucketValues {
630+
if v < 0 {
631+
return errors.New("fake metrics prefill-time-buckets-values cannot contain negative values")
632+
}
633+
}
634+
for _, v := range c.FakeMetrics.ReqDecodeTimeBucketValues {
635+
if v < 0 {
636+
return errors.New("fake metrics decode-time-buckets-values cannot contain negative values")
637+
}
638+
}
591639
}
592640

593641
if c.DPSize < 1 || c.DPSize > 8 {

pkg/common/utils.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@ var TTFTBucketsBoundaries = []float64{0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08
3232
var TPOTBucketsBoundaries = []float64{0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75,
3333
1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0}
3434

35+
var RequestLatencyBucketsBoundaries = []float64{0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0,
36+
20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0}
37+
3538
// ValidateContextWindow checks if the request fits within the model's context window
3639
// Returns validation result, actual completion tokens, and total tokens
3740
func ValidateContextWindow(promptTokens int, maxCompletionTokens *int64, maxModelLen int) (bool, int64, int64) {

pkg/llm-d-inference-sim/failures_test.go

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -126,15 +126,15 @@ var _ = Describe("Failures", func() {
126126
BeforeEach(func() {
127127
ctx = context.Background()
128128
var err error
129-
client, err = startServerWithArgs(ctx, "", []string{
130-
"cmd", "--model", model,
129+
client, err = startServerWithArgs(ctx, []string{
130+
"cmd", "--model", testModel,
131131
"--failure-injection-rate", "100",
132-
}, nil)
132+
})
133133
Expect(err).ToNot(HaveOccurred())
134134
})
135135

136136
It("should always return an error response for chat completions", func() {
137-
openaiClient, params := getOpenAIClientAndChatParams(client, model, userMessage, false)
137+
openaiClient, params := getOpenAIClientAndChatParams(client, testModel, testUserMessage, false)
138138
_, err := openaiClient.Chat.Completions.New(ctx, params)
139139
Expect(err).To(HaveOccurred())
140140

@@ -147,7 +147,7 @@ var _ = Describe("Failures", func() {
147147
})
148148

149149
It("should always return an error response for text completions", func() {
150-
openaiClient, params := getOpenAIClientAndChatParams(client, model, userMessage, false)
150+
openaiClient, params := getOpenAIClientAndChatParams(client, testModel, testUserMessage, false)
151151
_, err := openaiClient.Chat.Completions.New(ctx, params)
152152
Expect(err).To(HaveOccurred())
153153

@@ -164,16 +164,16 @@ var _ = Describe("Failures", func() {
164164
BeforeEach(func() {
165165
ctx = context.Background()
166166
var err error
167-
client, err = startServerWithArgs(ctx, "", []string{
168-
"cmd", "--model", model,
167+
client, err = startServerWithArgs(ctx, []string{
168+
"cmd", "--model", testModel,
169169
"--failure-injection-rate", "100",
170170
"--failure-types", common.FailureTypeRateLimit,
171-
}, nil)
171+
})
172172
Expect(err).ToNot(HaveOccurred())
173173
})
174174

175175
It("should return only rate limit errors", func() {
176-
openaiClient, params := getOpenAIClientAndChatParams(client, model, userMessage, false)
176+
openaiClient, params := getOpenAIClientAndChatParams(client, testModel, testUserMessage, false)
177177
_, err := openaiClient.Chat.Completions.New(ctx, params)
178178
Expect(err).To(HaveOccurred())
179179

@@ -182,24 +182,24 @@ var _ = Describe("Failures", func() {
182182
Expect(ok).To(BeTrue())
183183
Expect(openaiError.StatusCode).To(Equal(429))
184184
Expect(openaiError.Type).To(Equal(openaiserverapi.ErrorCodeToType(429)))
185-
Expect(strings.Contains(openaiError.Message, model)).To(BeTrue())
185+
Expect(strings.Contains(openaiError.Message, testModel)).To(BeTrue())
186186
})
187187
})
188188

189189
Context("with multiple specific failure types", func() {
190190
BeforeEach(func() {
191191
ctx = context.Background()
192192
var err error
193-
client, err = startServerWithArgs(ctx, "", []string{
194-
"cmd", "--model", model,
193+
client, err = startServerWithArgs(ctx, []string{
194+
"cmd", "--model", testModel,
195195
"--failure-injection-rate", "100",
196196
"--failure-types", common.FailureTypeInvalidAPIKey, common.FailureTypeServerError,
197-
}, nil)
197+
})
198198
Expect(err).ToNot(HaveOccurred())
199199
})
200200

201201
It("should return only specified error types", func() {
202-
openaiClient, params := getOpenAIClientAndChatParams(client, model, userMessage, false)
202+
openaiClient, params := getOpenAIClientAndChatParams(client, testModel, testUserMessage, false)
203203

204204
// Make multiple requests to verify we get the expected error types
205205
for i := 0; i < 10; i++ {
@@ -222,35 +222,35 @@ var _ = Describe("Failures", func() {
222222
BeforeEach(func() {
223223
ctx = context.Background()
224224
var err error
225-
client, err = startServerWithArgs(ctx, "", []string{
226-
"cmd", "--model", model,
225+
client, err = startServerWithArgs(ctx, []string{
226+
"cmd", "--model", testModel,
227227
"--failure-injection-rate", "0",
228-
}, nil)
228+
})
229229
Expect(err).ToNot(HaveOccurred())
230230
})
231231

232232
It("should never return errors and behave like random mode", func() {
233-
openaiClient, params := getOpenAIClientAndChatParams(client, model, userMessage, false)
233+
openaiClient, params := getOpenAIClientAndChatParams(client, testModel, testUserMessage, false)
234234
resp, err := openaiClient.Chat.Completions.New(ctx, params)
235235
Expect(err).ToNot(HaveOccurred())
236236
Expect(resp.Choices).To(HaveLen(1))
237237
Expect(resp.Choices[0].Message.Content).ToNot(BeEmpty())
238-
Expect(resp.Model).To(Equal(model))
238+
Expect(resp.Model).To(Equal(testModel))
239239
})
240240
})
241241

242242
Context("testing all predefined failure types", func() {
243243
DescribeTable("should return correct error for each failure type",
244244
func(failureType string, expectedStatusCode int, expectedErrorType string) {
245245
ctx := context.Background()
246-
client, err := startServerWithArgs(ctx, "", []string{
247-
"cmd", "--model", model,
246+
client, err := startServerWithArgs(ctx, []string{
247+
"cmd", "--model", testModel,
248248
"--failure-injection-rate", "100",
249249
"--failure-types", failureType,
250-
}, nil)
250+
})
251251
Expect(err).ToNot(HaveOccurred())
252252

253-
openaiClient, params := getOpenAIClientAndChatParams(client, model, userMessage, false)
253+
openaiClient, params := getOpenAIClientAndChatParams(client, testModel, testUserMessage, false)
254254
_, err = openaiClient.Chat.Completions.New(ctx, params)
255255
Expect(err).To(HaveOccurred())
256256

pkg/llm-d-inference-sim/lora_test.go

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -34,22 +34,22 @@ var _ = Describe("LoRAs", func() {
3434
Context("LoRAs config and load", func() {
3535
It("Should config, load and load LoRAs correctly", func() {
3636
ctx := context.TODO()
37-
client, err := startServerWithArgs(ctx, "",
38-
[]string{"cmd", "--model", model, "--mode", common.ModeEcho,
37+
client, err := startServerWithArgs(ctx,
38+
[]string{"cmd", "--model", testModel, "--mode", common.ModeEcho,
3939
"--lora-modules", "{\"name\":\"lora3\",\"path\":\"/path/to/lora3\"}",
40-
"{\"name\":\"lora4\",\"path\":\"/path/to/lora4\"}"}, nil)
40+
"{\"name\":\"lora4\",\"path\":\"/path/to/lora4\"}"})
4141
Expect(err).NotTo(HaveOccurred())
4242

4343
// Request to lora3
44-
openaiclient, params := getOpenAIClientAndChatParams(client, "lora3", userMessage, false)
44+
openaiclient, params := getOpenAIClientAndChatParams(client, "lora3", testUserMessage, false)
4545
resp, err := openaiclient.Chat.Completions.New(ctx, params)
4646
Expect(err).ToNot(HaveOccurred())
4747

4848
Expect(resp.Choices).ShouldNot(BeEmpty())
4949
Expect(string(resp.Object)).To(Equal(chatCompletionObject))
5050

5151
msg := resp.Choices[0].Message.Content
52-
Expect(msg).Should(Equal(userMessage))
52+
Expect(msg).Should(Equal(testUserMessage))
5353

5454
// Unknown model, should return 404
5555
params.Model = "lora1"
@@ -88,7 +88,7 @@ var _ = Describe("LoRAs", func() {
8888
Expect(string(resp.Object)).To(Equal(chatCompletionObject))
8989

9090
msg = resp.Choices[0].Message.Content
91-
Expect(msg).Should(Equal(userMessage))
91+
Expect(msg).Should(Equal(testUserMessage))
9292

9393
// Unload lora3
9494
payload = map[string]string{

0 commit comments

Comments
 (0)