You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Dockerfile
+1Lines changed: 1 addition & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -24,6 +24,7 @@ COPY internal ./internal
24
24
COPY apix ./apix
25
25
COPY api ./api
26
26
COPY version ./version
27
+
COPY sidecars ./sidecars
27
28
WORKDIR /src/cmd/epp
28
29
RUN go build -ldflags="-X sigs.k8s.io/gateway-api-inference-extension/version.CommitSHA=${COMMIT_SHA} -X sigs.k8s.io/gateway-api-inference-extension/version.BuildRef=${BUILD_REF}" -o /epp
totalQueuedRequestsMetric=flag.String("total-queued-requests-metric", runserver.DefaultTotalQueuedRequestsMetric, "Prometheus metric for the number of queued requests.")
130
+
totalRunningRequestsMetric=flag.String("total-running-requests-metric", runserver.DefaultTotalRunningRequestsMetric, "Prometheus metric for the number of running requests.")
129
131
kvCacheUsagePercentageMetric=flag.String("kv-cache-usage-percentage-metric", runserver.DefaultKvCacheUsagePercentageMetric, "Prometheus metric for the fraction of KV-cache blocks currently in use (from 0 to 1).")
130
132
// LoRA metrics
131
133
loraInfoMetric=flag.String("lora-info-metric", runserver.DefaultLoraInfoMetric, "Prometheus metric for the LoRA info metrics (must be in vLLM label format).")
@@ -139,8 +141,9 @@ var (
139
141
configFile=flag.String("config-file", runserver.DefaultConfigFile, "The path to the configuration file")
140
142
configText=flag.String("config-text", runserver.DefaultConfigText, "The configuration specified as text, in lieu of a file")
141
143
142
-
modelServerMetricsPort=flag.Int("model-server-metrics-port", 0, "Port to scrape metrics from pods. "+
143
-
"Default value will be set to the InferencePool.Spec.TargetPorts[0].Number if not set.")
144
+
modelServerMetricsPort=flag.Int("model-server-metrics-port", 0, "[DEPRECATED] Port to scrape metrics from pods. "+
145
+
"Default value will be set to the InferencePool.Spec.TargetPorts[0].Number if not set."+
146
+
"This option will be removed in the next release.")
144
147
modelServerMetricsPath=flag.String("model-server-metrics-path", "/metrics", "Path to scrape metrics from pods")
145
148
modelServerMetricsScheme=flag.String("model-server-metrics-scheme", "http", "Scheme to scrape metrics from pods")
146
149
modelServerMetricsHttpsInsecureSkipVerify=flag.Bool("model-server-metrics-https-insecure-skip-verify", true, "When using 'https' scheme for 'model-server-metrics-scheme', configure 'InsecureSkipVerify' (default to true)")
Alternatively, you can define flags in the `values.yaml` file:
39
+
40
+
```yaml
41
+
bbr:
42
+
flags:
43
+
FLAG_NAME: <FLAG_VALUE>
44
+
v: 3## Log verbosity
45
+
...
46
+
```
47
+
27
48
## Uninstall
28
49
29
50
Run the following command to uninstall the chart:
@@ -46,6 +67,7 @@ The following table list the configurable parameters of the chart.
46
67
|`bbr.image.hub`| Registry URL where the image is hosted. |
47
68
|`bbr.image.tag`| Image tag. |
48
69
|`bbr.image.pullPolicy`| Image pull policy for the container. Possible values: `Always`, `IfNotPresent`, or `Never`. Defaults to `Always`. |
70
+
|`bbr.flags`| map of flags which are passed through to bbr. Refer to [runner.go](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/cmd/bbr/runner/runner.go) for complete list. |
49
71
|`provider.name`| Name of the Inference Gateway implementation being used. Possible values: `istio`, `gke`. Defaults to `none`. |
50
72
|`inferenceGateway.name`| The name of the Gateway. Defaults to `inference-gateway`. |
For full details see the dedicated [Latency-Based Routing Guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/latency-based-predictor.md)
127
+
128
+
#### Latency-Based Router Configuration
129
+
130
+
The behavior of the latency-based router can be fine-tuned using the configuration parameters under `inferenceExtension.latencyPredictor.sloAwareRouting` in your `values.yaml` file.
| `samplingMean` | The sampling mean (lambda) for the Poisson distribution of token sampling. | `100.0` |
135
+
| `maxSampledTokens` | The maximum number of tokens to sample for TPOT prediction. | `20` |
136
+
| `sloBufferFactor` | A buffer to apply to the SLO to make it more or less strict. | `1.0` |
137
+
| `negHeadroomTTFTWeight` | The weight to give to the TTFT when a pod has negative headroom. | `0.8` |
138
+
| `negHeadroomTPOTWeight` | The weight to give to the TPOT when a pod has negative headroom. | `0.2` |
139
+
| `headroomTTFTWeight` | The weight to give to the TTFT when a pod has positive headroom. | `0.8` |
140
+
| `headroomTPOTWeight` | The weight to give to the TPOT when a pod has positive headroom. | `0.2` |
141
+
| `headroomSelectionStrategy` | The strategy to use for selecting a pod based on headroom. Options: `least`, `most`, `composite-least`, `composite-most`, `composite-only`. | `least` |
142
+
| `compositeKVWeight` | The weight for KV cache in the composite score. | `1.0` |
143
+
| `compositeQueueWeight` | The weight for queue size in the composite score. | `1.0` |
144
+
| `compositePrefixWeight` | The weight for prefix cache in the composite score. | `1.0` |
| `affinityGateTauGlobal` | Global affinity gate threshold. | `0.99` |
149
+
| `selectionMode` | The mode for selection (e.g., "linear"). | `linear` |
150
+
151
+
**Note:** Enabling SLO-aware routing also exposes a number of Prometheus metrics for monitoring the feature, including actual vs. predicted latency, SLO violations, and more.
152
+
124
153
### Install with High Availability (HA)
125
154
126
155
To deploy the EndpointPicker in a high-availability (HA) active-passive configuration set replicas to be greater than one. In such a setup, only one "leader" replica will be active and ready to process traffic at any given time. If the leader pod fails, another pod will be elected as the new leader, ensuring service continuity.
0 commit comments