|
| 1 | +# Creating Kubernetes Deployments |
| 2 | + |
| 3 | +The scripts in the `components/<backend>/launch` folder like [agg.sh](../../../components/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally. |
| 4 | +The corresponding YAML files like [agg.yaml](../../../components/backends/vllm/deploy/agg.yaml) show you how you could create a kubernetes deployment for your inference graph. |
| 5 | + |
| 6 | + |
| 7 | +This guide explains how to create your own deployment files. |
| 8 | + |
| 9 | +## Step 1: Choose Your Architecture Pattern |
| 10 | + |
| 11 | +Select the architecture pattern as your template that best fits your use case. |
| 12 | + |
| 13 | +For example, when using the `VLLM` inference backend: |
| 14 | + |
| 15 | +- **Development / Testing** |
| 16 | + Use [`agg.yaml`](../../../components/backends/vllm/deploy/agg.yaml) as the base configuration. |
| 17 | + |
| 18 | +- **Production with Load Balancing** |
| 19 | + Use [`agg_router.yaml`](../../../components/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference. |
| 20 | + |
| 21 | +- **High Performance / Disaggregated Deployment** |
| 22 | + Use [`disagg_router.yaml`](../../../components/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability. |
| 23 | + |
| 24 | + |
| 25 | +## Step 2: Customize the Template |
| 26 | + |
| 27 | +You can run the Frontend on one machine, for example a CPU node, and the worker on a different machine (a GPU node). |
| 28 | +The Frontend serves as a framework-agnostic HTTP entry point and is likely not to need many changes. |
| 29 | + |
| 30 | +It serves the following roles: |
| 31 | +1. OpenAI-Compatible HTTP Server |
| 32 | + * Provides `/v1/chat/completions` endpoint |
| 33 | + * Handles HTTP request/response formatting |
| 34 | + * Supports streaming responses |
| 35 | + * Validates incoming requests |
| 36 | + |
| 37 | +2. Service Discovery and Routing |
| 38 | + * Auto-discovers backend workers via etcd |
| 39 | + * Routes requests to the appropriate Processor/Worker components |
| 40 | + * Handles load balancing between multiple workers |
| 41 | + |
| 42 | +3. Request Preprocessing |
| 43 | + * Initial request validation |
| 44 | + * Model name verification |
| 45 | + * Request format standardization |
| 46 | + |
| 47 | +You should then pick a worker and specialize the config. For example, |
| 48 | + |
| 49 | +```yaml |
| 50 | +VllmWorker: # vLLM-specific config |
| 51 | + enforce-eager: true |
| 52 | + enable-prefix-caching: true |
| 53 | + |
| 54 | +SglangWorker: # SGLang-specific config |
| 55 | + router-mode: kv |
| 56 | + disagg-mode: true |
| 57 | + |
| 58 | +TrtllmWorker: # TensorRT-LLM-specific config |
| 59 | + engine-config: ./engine.yaml |
| 60 | + kv-cache-transfer: ucx |
| 61 | +``` |
| 62 | +
|
| 63 | +Here's a template structure based on the examples: |
| 64 | +
|
| 65 | +```yaml |
| 66 | + YourWorker: |
| 67 | + dynamoNamespace: your-namespace |
| 68 | + componentType: worker |
| 69 | + replicas: N |
| 70 | + envFromSecret: your-secrets # e.g., hf-token-secret |
| 71 | + # Health checks for worker initialization |
| 72 | + readinessProbe: |
| 73 | + exec: |
| 74 | + command: ["/bin/sh", "-c", 'grep "Worker.*initialized" /tmp/worker.log'] |
| 75 | + resources: |
| 76 | + requests: |
| 77 | + gpu: "1" # GPU allocation |
| 78 | + extraPodSpec: |
| 79 | + mainContainer: |
| 80 | + image: your-image |
| 81 | + command: |
| 82 | + - /bin/sh |
| 83 | + - -c |
| 84 | + args: |
| 85 | + - python -m dynamo.YOUR_INFERENCE_ENGINE --model YOUR_MODEL --your-flags |
| 86 | +``` |
| 87 | +
|
| 88 | +Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the |
| 89 | +`extraPodSpec: -> mainContainer: -> args:` |
| 90 | + |
| 91 | +The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]" |
| 92 | +Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command. |
| 93 | +If you are a Dynamo contributor the [dynamo run guide](../dynamo_run.md) for details on how to run this command. |
| 94 | + |
| 95 | + |
| 96 | +## Step 3: Key Customization Points |
| 97 | + |
| 98 | +### Model Configuration |
| 99 | + |
| 100 | +```yaml |
| 101 | + args: |
| 102 | + - "python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flag" |
| 103 | +``` |
| 104 | + |
| 105 | +### Resource Allocation |
| 106 | + |
| 107 | +```yaml |
| 108 | + resources: |
| 109 | + requests: |
| 110 | + cpu: "N" |
| 111 | + memory: "NGi" |
| 112 | + gpu: "N" |
| 113 | +``` |
| 114 | + |
| 115 | +### Scaling |
| 116 | + |
| 117 | +```yaml |
| 118 | + replicas: N # Number of worker instances |
| 119 | +``` |
| 120 | + |
| 121 | +### Routing Mode |
| 122 | +```yaml |
| 123 | + args: |
| 124 | + - --router-mode |
| 125 | + - kv # Enable KV-cache routing |
| 126 | +``` |
| 127 | + |
| 128 | +### Worker Specialization |
| 129 | + |
| 130 | +```yaml |
| 131 | + args: |
| 132 | + - --is-prefill-worker # For disaggregated prefill workers |
| 133 | +``` |
0 commit comments