Skip to content

Commit a51e074

Browse files
SLO Aware Routing Sidecar + Plugin EPP Integration and Helm Deployment (#1839)
* Add latency predictor plugins, deployment, and runner.go integration * Update dockerfile, fix issues with SLO context not being set when prediciton id off * Remove outdated inferencepool-resources deployment * Fix streamed request being called one final time after request complete, add predictor check to the beginning of each requestcontrol hook * add guide, update helm charts and readme, minor scorer changes * Make small guide update * Add helm values and polish README and SLO routing guide * Clean up errors from rebase, add running request metric to datasource, add predictor to new 2 phase configuration parser * Fix epp image and add placeholder docker repos for latency sidecars * Update guide, README, and values.yaml * Moved predictor setup logic into plugin * Move predictor startup login completely out of manager and into plugin, running routines there, move predictor helm section into new tpl file, rename slo-aware-routing guide and names in docs * Remove max-score-picker from list of plugin types in helm chart * Fix formatting * Revert go.mod to main * Fix typo in config, remove depreicated runtime flag * Rename latency prediction plugins, change docs accordingly, make sidecars not fail immediatly during EPP spinup * Update docs with new total running requests metric * Small plugin bugfix
1 parent ecf1139 commit a51e074

File tree

28 files changed

+818
-60
lines changed

28 files changed

+818
-60
lines changed

Dockerfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ COPY internal ./internal
2424
COPY apix ./apix
2525
COPY api ./api
2626
COPY version ./version
27+
COPY sidecars ./sidecars
2728
WORKDIR /src/cmd/epp
2829
RUN go build -ldflags="-X sigs.k8s.io/gateway-api-inference-extension/version.CommitSHA=${COMMIT_SHA} -X sigs.k8s.io/gateway-api-inference-extension/version.BuildRef=${BUILD_REF}" -o /epp
2930

cmd/epp/runner/runner.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ import (
6969
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/saturationdetector"
7070
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling"
7171
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/framework/plugins/multi/prefix"
72+
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/framework/plugins/multi/slo_aware_router"
7273
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/framework/plugins/picker"
7374
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/framework/plugins/profile"
7475
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/framework/plugins/scorer"
@@ -430,6 +431,9 @@ func (r *Runner) registerInTreePlugins() {
430431
plugins.Register(scorer.KvCacheUtilizationScorerType, scorer.KvCacheUtilizationScorerFactory)
431432
plugins.Register(scorer.QueueScorerType, scorer.QueueScorerFactory)
432433
plugins.Register(scorer.LoraAffinityScorerType, scorer.LoraAffinityScorerFactory)
434+
// Latency predictor plugins
435+
plugins.Register(slo_aware_router.SLOAwareRouterPluginType, slo_aware_router.SLOAwareRouterFactory)
436+
plugins.Register(profile.SLOAwareProfileHandlerType, profile.SLOAwareProfileHandlerFactory)
433437
// register filter for test purpose only (used in conformance tests)
434438
plugins.Register(testfilter.HeaderBasedTestingFilterType, testfilter.HeaderBasedTestingFilterFactory)
435439
// register response received plugin for test purpose only (used in conformance tests)

config/charts/inferencepool/README.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,35 @@ $ helm install triton-llama3-8b-instruct \
121121
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
122122
```
123123

124+
### Install with Latency-Based Routing
125+
126+
For full details see the dedicated [Latency-Based Routing Guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/latency-based-predictor.md)
127+
128+
#### Latency-Based Router Configuration
129+
130+
The behavior of the latency-based router can be fine-tuned using the configuration parameters under `inferenceExtension.latencyPredictor.sloAwareRouting` in your `values.yaml` file.
131+
132+
| Parameter | Description | Default |
133+
| -------------------------------- | ------------------------------------------------------------------------------------------------------- | ----------- |
134+
| `samplingMean` | The sampling mean (lambda) for the Poisson distribution of token sampling. | `100.0` |
135+
| `maxSampledTokens` | The maximum number of tokens to sample for TPOT prediction. | `20` |
136+
| `sloBufferFactor` | A buffer to apply to the SLO to make it more or less strict. | `1.0` |
137+
| `negHeadroomTTFTWeight` | The weight to give to the TTFT when a pod has negative headroom. | `0.8` |
138+
| `negHeadroomTPOTWeight` | The weight to give to the TPOT when a pod has negative headroom. | `0.2` |
139+
| `headroomTTFTWeight` | The weight to give to the TTFT when a pod has positive headroom. | `0.8` |
140+
| `headroomTPOTWeight` | The weight to give to the TPOT when a pod has positive headroom. | `0.2` |
141+
| `headroomSelectionStrategy` | The strategy to use for selecting a pod based on headroom. Options: `least`, `most`, `composite-least`, `composite-most`, `composite-only`. | `least` |
142+
| `compositeKVWeight` | The weight for KV cache in the composite score. | `1.0` |
143+
| `compositeQueueWeight` | The weight for queue size in the composite score. | `1.0` |
144+
| `compositePrefixWeight` | The weight for prefix cache in the composite score. | `1.0` |
145+
| `epsilonExploreSticky` | Exploration factor for sticky sessions. | `0.01` |
146+
| `epsilonExploreNeg` | Exploration factor for negative headroom. | `0.01` |
147+
| `affinityGateTau` | Affinity gate threshold. | `0.80` |
148+
| `affinityGateTauGlobal` | Global affinity gate threshold. | `0.99` |
149+
| `selectionMode` | The mode for selection (e.g., "linear"). | `linear` |
150+
151+
**Note:** Enabling SLO-aware routing also exposes a number of Prometheus metrics for monitoring the feature, including actual vs. predicted latency, SLO violations, and more.
152+
124153
### Install with High Availability (HA)
125154

126155
To deploy the EndpointPicker in a high-availability (HA) active-passive configuration set replicas to be greater than one. In such a setup, only one "leader" replica will be active and ready to process traffic at any given time. If the leader pod fails, another pod will be elected as the new leader, ensuring service continuity.
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
{{/*
2+
Latency Predictor Env
3+
*/}}
4+
{{- define "gateway-api-inference-extension.latencyPredictor.env" -}}
5+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
6+
- name: PREDICTION_SERVER_URL
7+
value: "{{- $count := int .Values.inferenceExtension.latencyPredictor.predictionServers.count -}}
8+
{{- $startPort := int .Values.inferenceExtension.latencyPredictor.predictionServers.startPort -}}
9+
{{- range $i := until $count -}}
10+
{{- if $i }},{{ end }}http://localhost:{{ add $startPort $i }}
11+
{{- end }}"
12+
- name: TRAINING_SERVER_URL
13+
value: "http://localhost:{{ .Values.inferenceExtension.latencyPredictor.trainingServer.port }}"
14+
{{- range $key, $value := .Values.inferenceExtension.latencyPredictor.eppEnv }}
15+
- name: {{ $key }}
16+
value: {{ $value | quote }}
17+
{{- end }}
18+
{{- end }}
19+
{{- end }}
20+
21+
{{/*
22+
Latency Predictor Sidecar Containers
23+
*/}}
24+
{{- define "gateway-api-inference-extension.latencyPredictor.containers" -}}
25+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
26+
# Training Server Sidecar Container
27+
- name: training-server
28+
image: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.hub }}/{{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.name }}:{{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.tag }}
29+
imagePullPolicy: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.pullPolicy }}
30+
ports:
31+
- containerPort: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.port }}
32+
name: training-port
33+
livenessProbe:
34+
{{- toYaml .Values.inferenceExtension.latencyPredictor.trainingServer.livenessProbe | nindent 4 }}
35+
readinessProbe:
36+
{{- toYaml .Values.inferenceExtension.latencyPredictor.trainingServer.readinessProbe | nindent 4 }}
37+
resources:
38+
{{- toYaml .Values.inferenceExtension.latencyPredictor.trainingServer.resources | nindent 4 }}
39+
envFrom:
40+
- configMapRef:
41+
name: {{ include "gateway-api-inference-extension.name" . }}-latency-predictor-training
42+
env:
43+
- name: POD_NAME
44+
valueFrom:
45+
fieldRef:
46+
fieldPath: metadata.name
47+
- name: SERVER_TYPE
48+
value: "training"
49+
volumeMounts:
50+
- name: training-server-storage
51+
mountPath: /models
52+
{{- range $i := until (int .Values.inferenceExtension.latencyPredictor.predictionServers.count) }}
53+
# Prediction Server Sidecar Container {{ add $i 1 }}
54+
- name: prediction-server-{{ add $i 1 }}
55+
image: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.hub }}/{{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.name }}:{{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.tag }}
56+
imagePullPolicy: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.pullPolicy }}
57+
command: ["uvicorn"]
58+
args: ["prediction_server:app", "--host", "0.0.0.0", "--port", "{{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}"]
59+
ports:
60+
- containerPort: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
61+
name: predict-port-{{ add $i 1 }}
62+
livenessProbe:
63+
httpGet:
64+
path: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.livenessProbe.httpGet.path }}
65+
port: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
66+
initialDelaySeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.livenessProbe.initialDelaySeconds }}
67+
periodSeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.livenessProbe.periodSeconds }}
68+
readinessProbe:
69+
httpGet:
70+
path: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.httpGet.path }}
71+
port: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
72+
initialDelaySeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.initialDelaySeconds }}
73+
periodSeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.periodSeconds }}
74+
failureThreshold: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.failureThreshold }}
75+
resources:
76+
{{- toYaml $.Values.inferenceExtension.latencyPredictor.predictionServers.resources | nindent 4 }}
77+
envFrom:
78+
- configMapRef:
79+
name: {{ include "gateway-api-inference-extension.name" $ }}-latency-predictor-prediction
80+
env:
81+
- name: PREDICT_PORT
82+
value: "{{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}"
83+
- name: POD_NAME
84+
valueFrom:
85+
fieldRef:
86+
fieldPath: metadata.name
87+
- name: SERVER_TYPE
88+
value: "prediction-{{ add $i 1 }}"
89+
- name: TRAINING_SERVER_URL
90+
value: "http://localhost:{{ $.Values.inferenceExtension.latencyPredictor.trainingServer.port }}"
91+
volumeMounts:
92+
- name: prediction-server-{{ add $i 1 }}-storage
93+
mountPath: /server_models
94+
{{- end }}
95+
{{- end }}
96+
{{- end }}
97+
98+
{{/*
99+
Latency Predictor Volumes
100+
*/}}
101+
{{- define "gateway-api-inference-extension.latencyPredictor.volumes" -}}
102+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
103+
- name: training-server-storage
104+
emptyDir:
105+
sizeLimit: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.volumeSize }}
106+
{{- range $i := until (int .Values.inferenceExtension.latencyPredictor.predictionServers.count) }}
107+
- name: prediction-server-{{ add $i 1 }}-storage
108+
emptyDir:
109+
sizeLimit: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.volumeSize }}
110+
{{- end }}
111+
{{- end }}
112+
{{- end }}

config/charts/inferencepool/templates/epp-config.yaml

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,45 @@ data:
1111
- type: queue-scorer
1212
- type: kv-cache-utilization-scorer
1313
- type: prefix-cache-scorer
14+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
15+
- type: predicted-latency-scorer
16+
parameters:
17+
{{- with .Values.inferenceExtension.latencyPredictor.sloAwareRouting | default dict }}
18+
samplingMean: {{ .samplingMean | default 100.0 }}
19+
maxSampledTokens: {{ .maxSampledTokens | default 20 }}
20+
sloBufferFactor: {{ .sloBufferFactor | default 1.0 }}
21+
negHeadroomTTFTWeight: {{ .negHeadroomTTFTWeight | default 0.8 }}
22+
negHeadroomTPOTWeight: {{ .negHeadroomTPOTWeight | default 0.2 }}
23+
headroomTTFTWeight: {{ .headroomTTFTWeight | default 0.8 }}
24+
headroomTPOTWeight: {{ .headroomTPOTWeight | default 0.2 }}
25+
headroomSelectionStrategy: {{ .headroomSelectionStrategy | default "least" | quote }}
26+
compositeKVWeight: {{ .compositeKVWeight | default 1.0 }}
27+
compositeQueueWeight: {{ .compositeQueueWeight | default 1.0 }}
28+
compositePrefixWeight: {{ .compositePrefixWeight | default 1.0 }}
29+
epsilonExploreSticky: {{ .epsilonExploreSticky | default 0.01 }}
30+
epsilonExploreNeg: {{ .epsilonExploreNeg | default 0.01 }}
31+
affinityGateTau: {{ .affinityGateTau | default 0.80 }}
32+
affinityGateTauGlobal: {{ .affinityGateTauGlobal | default 0.99 }}
33+
selectionMode: {{ .selectionMode | default "linear" | quote }}
34+
{{- end }}
35+
- type: predicted-latency-profile-handler
36+
{{- end }}
1437
schedulingProfiles:
38+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
39+
- name: predicted-latency-prefix
40+
plugins:
41+
- pluginRef: prefix-cache-scorer
42+
- name: predicted-latency-no-routing
43+
plugins:
44+
- pluginRef: prefix-cache-scorer
45+
- pluginRef: predicted-latency-scorer
46+
weight: 0
47+
- pluginRef: queue-scorer
48+
- pluginRef: kv-cache-utilization-scorer
49+
- name: predicted-latency-routing
50+
plugins:
51+
- pluginRef: predicted-latency-scorer
52+
{{- else }}
1553
- name: default
1654
plugins:
1755
- pluginRef: queue-scorer
@@ -20,6 +58,7 @@ data:
2058
weight: 2
2159
- pluginRef: prefix-cache-scorer
2260
weight: 3
61+
{{- end }}
2362
{{- if (hasKey .Values.inferenceExtension "pluginsCustomConfig") }}
2463
{{- .Values.inferenceExtension.pluginsCustomConfig | toYaml | nindent 2 }}
2564
{{- end }}
@@ -34,3 +73,25 @@ metadata:
3473
data:
3574
{{- .Values.inferenceExtension.sidecar.configMap.data | toYaml | nindent 2 }}
3675
{{- end }}
76+
---
77+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
78+
apiVersion: v1
79+
kind: ConfigMap
80+
metadata:
81+
name: {{ include "gateway-api-inference-extension.name" . }}-latency-predictor-training
82+
namespace: {{ .Release.Namespace }}
83+
data:
84+
{{- range $key, $value := .Values.inferenceExtension.latencyPredictor.trainingServer.config }}
85+
{{ $key }}: {{ $value | quote }}
86+
{{- end }}
87+
---
88+
apiVersion: v1
89+
kind: ConfigMap
90+
metadata:
91+
name: {{ include "gateway-api-inference-extension.name" . }}-latency-predictor-prediction
92+
namespace: {{ .Release.Namespace }}
93+
data:
94+
{{- range $key, $value := .Values.inferenceExtension.latencyPredictor.predictionServers.config }}
95+
{{ $key }}: {{ $value | quote }}
96+
{{- end }}
97+
{{- end }}

config/charts/inferencepool/templates/epp-deployment.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,7 @@ spec:
151151
valueFrom:
152152
fieldRef:
153153
fieldPath: metadata.name
154+
{{- include "gateway-api-inference-extension.latencyPredictor.env" . | nindent 8 }}
154155
{{- if .Values.inferenceExtension.tracing.enabled }}
155156
- name: OTEL_SERVICE_NAME
156157
value: "gateway-api-inference-extension"
@@ -181,13 +182,15 @@ spec:
181182
volumeMounts:
182183
- name: plugins-config-volume
183184
mountPath: "/config"
185+
{{- include "gateway-api-inference-extension.latencyPredictor.containers" . | nindent 6 }}
184186
volumes:
185187
{{- if .Values.inferenceExtension.sidecar.volumes }}
186188
{{- tpl (toYaml .Values.inferenceExtension.sidecar.volumes) $ | nindent 6 }}
187189
{{- end }}
188190
- name: plugins-config-volume
189191
configMap:
190192
name: {{ include "gateway-api-inference-extension.name" . }}
193+
{{- include "gateway-api-inference-extension.latencyPredictor.volumes" . | nindent 6 }}
191194
{{- if .Values.inferenceExtension.affinity }}
192195
affinity:
193196
{{- toYaml .Values.inferenceExtension.affinity | nindent 8 }}

config/charts/inferencepool/values.yaml

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,89 @@ inferenceExtension:
7171
sampler: "parentbased_traceidratio"
7272
samplerArg: "0.1"
7373

74+
# Latency Predictor Configuration
75+
latencyPredictor:
76+
enabled: false
77+
78+
# Training Server Configuration
79+
trainingServer:
80+
image:
81+
hub: path/to/your/docker/repo # NOTE: Update with your Docker repository path for sidecars
82+
name: latencypredictor-training-server
83+
tag: latest
84+
pullPolicy: Always
85+
port: 8000
86+
resources:
87+
requests:
88+
cpu: "2000m"
89+
memory: "4Gi"
90+
limits:
91+
cpu: "4000m"
92+
memory: "8Gi"
93+
livenessProbe:
94+
httpGet:
95+
path: /healthz
96+
port: 8000
97+
initialDelaySeconds: 30
98+
periodSeconds: 20
99+
readinessProbe:
100+
httpGet:
101+
path: /readyz
102+
port: 8000
103+
initialDelaySeconds: 45
104+
periodSeconds: 10
105+
volumeSize: "20Gi"
106+
config:
107+
LATENCY_RETRAINING_INTERVAL_SEC: "1"
108+
LATENCY_MIN_SAMPLES_FOR_RETRAIN: "100"
109+
LATENCY_TTFT_MODEL_PATH: "/models/ttft.joblib"
110+
LATENCY_TPOT_MODEL_PATH: "/models/tpot.joblib"
111+
LATENCY_TTFT_SCALER_PATH: "/models/ttft_scaler.joblib"
112+
LATENCY_TPOT_SCALER_PATH: "/models/tpot_scaler.joblib"
113+
LATENCY_MODEL_TYPE: "xgboost"
114+
LATENCY_MAX_TRAINING_DATA_SIZE_PER_BUCKET: "5000"
115+
LATENCY_QUANTILE_ALPHA: "0.9"
116+
117+
# Prediction Server Configuration
118+
predictionServers:
119+
count: 10
120+
startPort: 8001
121+
image:
122+
hub: path/to/your/docker/repo # NOTE: Update with your Docker repository path for sidecars
123+
name: latencypredictor-prediction-server
124+
tag: latest
125+
pullPolicy: Always
126+
resources:
127+
requests:
128+
cpu: "500m"
129+
memory: "1Gi"
130+
limits:
131+
cpu: "1000m"
132+
memory: "2Gi"
133+
livenessProbe:
134+
httpGet:
135+
path: /healthz
136+
initialDelaySeconds: 15
137+
periodSeconds: 15
138+
readinessProbe:
139+
httpGet:
140+
path: /readyz
141+
initialDelaySeconds: 10
142+
periodSeconds: 5
143+
failureThreshold: 10
144+
volumeSize: "10Gi"
145+
config:
146+
LATENCY_MODEL_TYPE: "xgboost"
147+
PREDICT_HOST: "0.0.0.0"
148+
LOCAL_TTFT_MODEL_PATH: "/server_models/ttft.joblib"
149+
LOCAL_TPOT_MODEL_PATH: "/server_models/tpot.joblib"
150+
LOCAL_TTFT_SCALER_PATH: "/server_models/ttft_scaler.joblib"
151+
LOCAL_TPOT_SCALER_PATH: "/server_models/tpot_scaler.joblib"
152+
153+
# EPP Environment Variables for Latency Predictor
154+
eppEnv:
155+
LATENCY_MAX_SAMPLE_SIZE: "10000"
156+
74157
inferencePool:
75158
targetPorts:
76159
- number: 8000

docs/proposals/003-model-server-protocol/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ effort.
2828
| Metric | Type | Description | vLLM metric | Triton TensorRT-LLM| SGLang |
2929
| ----- | ---- | ------------ | ---- | ---- | ---- |
3030
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`| `sglang:num_queue_reqs`
31+
| TotalRunningRequests | Gauge | The current total number of requests actively being served on the model server.| `vllm:num_requests_running`| `nv_trt_llm_request_metrics{request_type=scheduled}`| `sglang:num_running_reqs`
3132
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`| `sglang:token_usage`
3233
| [Optional] BlockSize | Labeled | The block size in tokens to allocate memory, used by the prefix cache scorer. If this metric is not available, the BlockSize will be derived from the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `block_size`| |
3334
| [Optional] NumGPUBlocks| Labeled | The total number of blocks in the HBM KV cache, used by the prefix cache scorer. If this metric is not available, the NumGPUBlocks will be derived from the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `num_gpu_blocks`| |

0 commit comments

Comments
 (0)