You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
10
+
11
+
The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
12
+
13
+
For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
20
+
21
+
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
help="Max model context length. Defaults to models max, usually model_max_length from tokenizer_config.json. Reducing this reduces VRAM requirements.",
102
109
)
110
+
parser.add_argument(
111
+
"--migration-limit",
112
+
type=int,
113
+
default=0,
114
+
help="Maximum number of times a request may be migrated to a different engine worker. The number may be overridden by the engine.",
Copy file name to clipboardExpand all lines: components/backends/sglang/README.md
+16Lines changed: 16 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -139,6 +139,22 @@ cd $DYNAMO_ROOT/components/backends/sglang
139
139
./launch/disagg_dp_attn.sh
140
140
```
141
141
142
+
## Request Migration
143
+
144
+
In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
145
+
146
+
The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
147
+
148
+
For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
149
+
150
+
For example,
151
+
```bash
152
+
python3 -m dynamo.sglang ... --migration-limit=3
153
+
```
154
+
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
155
+
156
+
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
157
+
142
158
## Advanced Examples
143
159
144
160
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-tranfer.md).
207
207
208
+
## Request Migration
209
+
210
+
In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
211
+
212
+
The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
213
+
214
+
For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
215
+
216
+
For example,
217
+
```bash
218
+
python3 -m dynamo.trtllm ... --migration-limit=3
219
+
```
220
+
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
221
+
222
+
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
Copy file name to clipboardExpand all lines: components/backends/vllm/README.md
+16Lines changed: 16 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -186,3 +186,19 @@ vLLM workers are configured through command-line arguments. Key parameters inclu
186
186
See `args.py` for the full list of configuration options and their defaults.
187
187
188
188
The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.
189
+
190
+
## Request Migration
191
+
192
+
In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
193
+
194
+
The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
195
+
196
+
For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
197
+
198
+
For example,
199
+
```bash
200
+
python3 -m dynamo.vllm ... --migration-limit=3
201
+
```
202
+
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
203
+
204
+
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
0 commit comments