-
Notifications
You must be signed in to change notification settings - Fork 676
feat: example routing between different models using inference gateway #1981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes introduce a new example for serving two models with an inference gateway using Kubernetes resources and documentation. New YAML configuration files and a README are added to demonstrate deploying, configuring, and interacting with two models. Existing instructions for minikube gateway testing are clarified for better usability. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Gateway
participant GemmaWorker
participant QwenWorker
User->>Gateway: REST API request (specifies model)
alt Model = gemma-3-1b-it
Gateway->>GemmaWorker: Forward inference request
GemmaWorker-->>Gateway: Model response
else Model = Qwen3-0.6B
Gateway->>QwenWorker: Forward inference request
QwenWorker-->>Gateway: Model response
end
Gateway-->>User: Return inference result
Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
deploy/inference-gateway/example/two_models/two_models.yaml (1)
136-140: Same shell issue for Qwen worker (see Gemma).
🧹 Nitpick comments (4)
deploy/inference-gateway/example/README.md (1)
141-142: Mention privilege requirement forminikube tunnel
minikube tunneloften prompts for sudo/root privileges to create the network route. A quick note avoids user confusion when the command fails silently.Example:
# start minikube tunnel (may require sudo) sudo minikube tunneldeploy/inference-gateway/example/two_models/README.md (2)
1-4: Fix typo and kubectl flag misuse.
- “Gatewaty” → “Gateway”.
kubectlcan only take one-o; use-o jsonpath=…directly.-Get Gatewaty URL +Get Gateway URL - -GATEWAY_URL=$(kubectl get svc inference-gateway -o yaml -o jsonpath='{.spec.clusterIP}') +GATEWAY_URL=$(kubectl get svc inference-gateway -o=jsonpath='{.spec.clusterIP}')
35-48: Cluster-IP alone is usually unreachable from your shell.
curl $GATEWAY_URL …will work only inside the cluster network.
For local testing consider:kubectl port-forward svc/inference-gateway 8000:8000 export GATEWAY_URL=http://localhost:8000Add a note so users are not blocked.
deploy/inference-gateway/example/two_models/inference-model-qwen.yaml (1)
15-26: Mismatch between pool & model, plus missing newline.Same pool mismatch as Gemma – update or verify.
YAML-lint also reports a missing trailing newline.- name: dynamo-deepseek + name: dynamo-qwen # or the correct pool + +# (add a blank line at EOF)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
deploy/inference-gateway/example/README.md(1 hunks)deploy/inference-gateway/example/two_models/README.md(1 hunks)deploy/inference-gateway/example/two_models/inference-model-gemma.yaml(1 hunks)deploy/inference-gateway/example/two_models/inference-model-qwen.yaml(1 hunks)deploy/inference-gateway/example/two_models/two_models.yaml(1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
deploy/inference-gateway/example/two_models/inference-model-qwen.yaml
[error] 26-26: no new line character at the end of file
(new-line-at-end-of-file)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (1)
deploy/inference-gateway/example/two_models/inference-model-gemma.yaml (1)
18-26: Pool name no longer matches model – verify or rename.
metadata.nameandmodelNamewere switched to Gemma, butpoolRef.nameis stilldynamo-deepseek.
If the pool actually points at a DeepSeek-specific worker, routing Gemma traffic there will fail.- name: dynamo-deepseek + # e.g. + name: dynamo-gemmaPlease double-check the pool or update the reference to avoid mis-routing.
| GATEWAY_URL=$(kubectl get svc inference-gateway -o yaml -o jsonpath='{.spec.clusterIP}') | ||
| echo $GATEWAY_URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubectl get svc command is broken and returns an unusable IP
kubectl only accepts one -o flag, so -o yaml -o jsonpath=... fails.
Even if the command ran, .spec.clusterIP yields an internal-only address that is not reachable from the host. For a LoadBalancer service via minikube tunnel you need the external IP ( .status.loadBalancer.ingress[0].ip ) or simply use minikube service … --url.
Suggested fix:
-# in a separate terminal
-GATEWAY_URL=$(kubectl get svc inference-gateway -o yaml -o jsonpath='{.spec.clusterIP}')
+# in a separate terminal
+# grab the external IP assigned by the tunnel
+GATEWAY_URL=$(kubectl get svc inference-gateway \
+ -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo $GATEWAY_URLor, for simplicity:
GATEWAY_URL=$(minikube service inference-gateway --url | head -n1)🤖 Prompt for AI Agents
In deploy/inference-gateway/example/README.md around lines 145 to 146, the
kubectl command uses multiple -o flags which is invalid and retrieves an
internal cluster IP that is not accessible externally. Replace the command with
one that fetches the external IP from .status.loadBalancer.ingress[0].ip or,
more simply, use the minikube service inference-gateway --url command to get a
reachable gateway URL.
| mainContainer: | ||
| image: gitlab-master.nvidia.com:5005/aire/microservices/compoundai/dynamo:1c03404f2624186523529b8d4ca04731b60aa8b9-31776852-vllm_v1-amd64 | ||
| workingDir: /workspace/examples/vllm | ||
| args: | ||
| - "python3 components/main.py --model google/gemma-3-1b-it --enforce-eager --endpoint dyn://dynamo.gemma.generate 2>&1 | tee /tmp/vllm.log" | ||
| QwenDecodeWorker: | ||
| envFromSecret: hf-token-secret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worker args executed without shell – redirection & pipes break.
The whole string is one argv element; 2>&1 | tee … will never be interpreted.
Either wrap with an explicit shell:
command:
- /bin/sh
- -c
args:
- |
python3 components/main.py --model google/gemma-3-1b-it \
--enforce-eager --endpoint dyn://dynamo.gemma.generate 2>&1 | tee /tmp/vllm.logor drop redirection/pipe.
🤖 Prompt for AI Agents
In deploy/inference-gateway/example/two_models/two_models.yaml around lines 97
to 103, the args for mainContainer include shell redirection and piping as a
single argument, which won't work because the command is executed without a
shell. To fix this, replace the args with a command array that runs /bin/sh with
the -c option, and pass the entire python command with redirection and pipe as a
single string argument to the shell. This ensures the shell interprets the
redirection and piping correctly.
| exec: | ||
| command: | ||
| - /bin/sh | ||
| - -c | ||
| - "exit 0" | ||
| initialDelaySeconds: 60 | ||
| periodSeconds: 60 | ||
| timeoutSeconds: 30 | ||
| failureThreshold: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Readiness probe always succeeds – loses rollout safety.
command: ["sh","-c","exit 0"] marks every pod ready even when the app is down.
Expose a real health endpoint or remove the probe to let liveness handle restarts.
🤖 Prompt for AI Agents
In deploy/inference-gateway/example/two_models/two_models.yaml around lines 31
to 39, the readiness probe uses a command that always exits with 0, causing the
pod to be marked ready even if the app is down. To fix this, replace the command
with a real health check that verifies the application's readiness, such as an
HTTP GET to a health endpoint, or remove the readiness probe entirely so that
the liveness probe manages pod restarts.
| To test the gateway in minikube, use the following command: | ||
| ```bash | ||
| minikube tunnel & | ||
| # start minikube tunnel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 35 has obsolete data:
export DEPLOYMENT_NAME=llm-agg1
yq eval '
.metadata.name = env(DEPLOYMENT_NAME) |
.spec.services[].extraPodSpec.mainContainer.image = env(VLLM_RUNTIME_IMAGE)
' examples/vllm_v0/deploy/agg.yaml > examples/vllm_v0/deploy/agg1.yaml
| extraPodSpec: | ||
| mainContainer: | ||
| image: gitlab-master.nvidia.com:5005/aire/microservices/compoundai/dynamo:1c03404f2624186523529b8d4ca04731b60aa8b9-31776852-vllm_v1-amd64 | ||
| workingDir: /workspace/examples/vllm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also obsolete paths and "dynamo run"
Overview:
example routing between different models using inference gateway
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Documentation
New Features