docs: Add --migration-limit to vllm trtllm sglang llama_cpp README

kthui · kthui · commit 88f9726f2879 · 2025-07-28T15:04:58.000-07:00
diff --git a/components/backends/llama_cpp/README.md b/components/backends/llama_cpp/README.md
@@ -4,3 +4,18 @@ Usage:
 - `pip install -r requirements.txt` # Need a recent pip, `uv pip` might be too old.
 - `python -m dynamo.llama_cpp --model-path /data/models/Qwen3-0.6B-Q8_0.gguf [args]`
 
+## Request Migration
+
+In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
+
+The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
+
+For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
+
+For example,
+```bash
+python3 -m dynamo.llama_cpp ... --migration-limit=3
+```
+indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
+
+The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
diff --git a/components/backends/sglang/README.md b/components/backends/sglang/README.md
@@ -139,6 +139,22 @@ cd $DYNAMO_ROOT/components/backends/sglang
 ./launch/disagg_dp_attn.sh
 ```
 
+## Request Migration
+
+In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
+
+The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
+
+For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
+
+For example,
+```bash
+python3 -m dynamo.sglang ... --migration-limit=3
+```
+indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
+
+The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
+
 ## Advanced Examples
 
 Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md
@@ -205,6 +205,22 @@ DISAGGREGATION_STRATEGY="prefill_first" ./launch/disagg.sh
 
 Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-tranfer.md).
 
+## Request Migration
+
+In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
+
+The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
+
+For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
+
+For example,
+```bash
+python3 -m dynamo.trtllm ... --migration-limit=3
+```
+indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
+
+The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
+
 ## More Example Architectures
 
 - [Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)
diff --git a/components/backends/vllm/README.md b/components/backends/vllm/README.md
@@ -170,3 +170,19 @@ vLLM workers are configured through command-line arguments. Key parameters inclu
 See `args.py` for the full list of configuration options and their defaults.
 
 The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.
+
+## Request Migration
+
+In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
+
+The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
+
+For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
+
+For example,
+```bash
+python3 -m dynamo.vllm ... --migration-limit=3
+```
+indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
+
+The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.