Skip to content

Commit 88f9726

Browse files
committed
docs: Add --migration-limit to vllm trtllm sglang llama_cpp README
1 parent aecf4c3 commit 88f9726

File tree

4 files changed

+63
-0
lines changed

4 files changed

+63
-0
lines changed

components/backends/llama_cpp/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,18 @@ Usage:
44
- `pip install -r requirements.txt` # Need a recent pip, `uv pip` might be too old.
55
- `python -m dynamo.llama_cpp --model-path /data/models/Qwen3-0.6B-Q8_0.gguf [args]`
66

7+
## Request Migration
8+
9+
In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
10+
11+
The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
12+
13+
For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
14+
15+
For example,
16+
```bash
17+
python3 -m dynamo.llama_cpp ... --migration-limit=3
18+
```
19+
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
20+
21+
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.

components/backends/sglang/README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,22 @@ cd $DYNAMO_ROOT/components/backends/sglang
139139
./launch/disagg_dp_attn.sh
140140
```
141141

142+
## Request Migration
143+
144+
In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
145+
146+
The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
147+
148+
For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
149+
150+
For example,
151+
```bash
152+
python3 -m dynamo.sglang ... --migration-limit=3
153+
```
154+
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
155+
156+
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
157+
142158
## Advanced Examples
143159

144160
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!

components/backends/trtllm/README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,22 @@ DISAGGREGATION_STRATEGY="prefill_first" ./launch/disagg.sh
205205

206206
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-tranfer.md).
207207

208+
## Request Migration
209+
210+
In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
211+
212+
The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
213+
214+
For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
215+
216+
For example,
217+
```bash
218+
python3 -m dynamo.trtllm ... --migration-limit=3
219+
```
220+
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
221+
222+
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
223+
208224
## More Example Architectures
209225

210226
- [Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)

components/backends/vllm/README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,3 +170,19 @@ vLLM workers are configured through command-line arguments. Key parameters inclu
170170
See `args.py` for the full list of configuration options and their defaults.
171171

172172
The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.
173+
174+
## Request Migration
175+
176+
In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
177+
178+
The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
179+
180+
For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
181+
182+
For example,
183+
```bash
184+
python3 -m dynamo.vllm ... --migration-limit=3
185+
```
186+
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
187+
188+
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.

0 commit comments

Comments
 (0)