-
Notifications
You must be signed in to change notification settings - Fork 688
feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart #2755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
f4420dd
feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart
julienmancuso 3831d59
feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart
julienmancuso 7c65724
feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart
julienmancuso 80fc6bf
feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart
julienmancuso e266f67
feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart
julienmancuso 1b2fc62
feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart
julienmancuso 13f2454
feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart
julienmancuso 52697d2
feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart
julienmancuso b537efb
feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart
julienmancuso 7108b6f
feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart
julienmancuso File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| <!-- | ||
| SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| --> | ||
|
|
||
| # dynamo-platform | ||
|
|
||
| A Helm chart for NVIDIA Dynamo Platform. | ||
|
|
||
|   | ||
|
|
||
| ## 🚀 Overview | ||
|
|
||
| The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure on Kubernetes, including: | ||
|
|
||
| - **Dynamo Operator**: Kubernetes operator for managing Dynamo deployments | ||
| - **NATS**: High-performance messaging system for component communication | ||
| - **etcd**: Distributed key-value store for operator state management | ||
| - **Grove**: Multi-node inference orchestration (optional) | ||
| - **Kai Scheduler**: Advanced workload scheduling (optional) | ||
|
|
||
| ## 📋 Prerequisites | ||
|
|
||
| - Kubernetes cluster (v1.20+) | ||
| - Helm 3.8+ | ||
| - Sufficient cluster resources for your deployment scale | ||
| - Container registry access (if using private images) | ||
|
|
||
| ## 🔧 Configuration | ||
|
|
||
| ## Requirements | ||
|
|
||
| | Repository | Name | Version | | ||
| |------------|------|---------| | ||
| | file://components/operator | dynamo-operator | 0.5.0 | | ||
| | https://charts.bitnami.com/bitnami | etcd | 11.1.0 | | ||
| | https://nats-io.github.io/k8s/helm/charts/ | nats | 1.3.2 | | ||
| | oci://ghcr.io/nvidia/grove | grove(grove-charts) | v0.0.0-6e30275 | | ||
| | oci://ghcr.io/nvidia/kai-scheduler | kai-scheduler | v0.8.1 | | ||
|
|
||
| ## Values | ||
|
|
||
| | Key | Type | Default | Description | | ||
| |-----|------|---------|-------------| | ||
| | dynamo-operator.enabled | bool | `true` | Whether to enable the Dynamo Kubernetes operator deployment | | ||
| | dynamo-operator.natsAddr | string | `""` | NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" | | ||
| | dynamo-operator.etcdAddr | string | `""` | etcd server address for operator state storage (leave empty to use the bundled etcd chart). Format: "http://hostname:port" or "https://hostname:port" | | ||
| | dynamo-operator.namespaceRestriction.enabled | bool | `true` | Whether to restrict operator to specific namespaces | | ||
| | dynamo-operator.namespaceRestriction.targetNamespace | string | `nil` | Target namespace for operator deployment (leave empty for current namespace) | | ||
| | dynamo-operator.controllerManager.tolerations | list | `[]` | Node tolerations for controller manager pods | | ||
| | dynamo-operator.controllerManager.manager.image.repository | string | `"nvcr.io/nvidia/ai-dynamo/kubernetes-operator"` | Official NVIDIA Dynamo operator image repository | | ||
| | dynamo-operator.controllerManager.manager.image.tag | string | `""` | Image tag (leave empty to use chart default) | | ||
| | dynamo-operator.controllerManager.manager.image.pullPolicy | string | `"IfNotPresent"` | Image pull policy - when to pull the image | | ||
| | dynamo-operator.controllerManager.manager.args[0] | string | `"--health-probe-bind-address=:8081"` | Health probe endpoint for Kubernetes health checks | | ||
| | dynamo-operator.controllerManager.manager.args[1] | string | `"--metrics-bind-address=127.0.0.1:8080"` | Metrics endpoint for Prometheus scraping (localhost only for security) | | ||
| | dynamo-operator.imagePullSecrets | list | `[]` | Secrets for pulling private container images | | ||
| | dynamo-operator.dynamo.groveTerminationDelay | string | `"15m"` | How long to wait before forcefully terminating Grove instances | | ||
| | dynamo-operator.dynamo.internalImages.debugger | string | `"python:3.12-slim"` | Debugger image for troubleshooting deployments | | ||
| | dynamo-operator.dynamo.enableRestrictedSecurityContext | bool | `false` | Whether to enable restricted security contexts for enhanced security | | ||
| | dynamo-operator.dynamo.dockerRegistry.useKubernetesSecret | bool | `false` | Whether to use Kubernetes secrets for registry authentication | | ||
| | dynamo-operator.dynamo.dockerRegistry.server | string | `nil` | Docker registry server URL | | ||
| | dynamo-operator.dynamo.dockerRegistry.username | string | `nil` | Registry username | | ||
| | dynamo-operator.dynamo.dockerRegistry.password | string | `nil` | Registry password (consider using existingSecretName instead) | | ||
| | dynamo-operator.dynamo.dockerRegistry.existingSecretName | string | `nil` | Name of existing Kubernetes secret containing registry credentials | | ||
| | dynamo-operator.dynamo.dockerRegistry.secure | bool | `true` | Whether the registry uses HTTPS | | ||
| | dynamo-operator.dynamo.ingress.enabled | bool | `false` | Whether to create ingress resources | | ||
| | dynamo-operator.dynamo.ingress.className | string | `nil` | Ingress class name (e.g., "nginx", "traefik") | | ||
| | dynamo-operator.dynamo.ingress.tlsSecretName | string | `"my-tls-secret"` | Secret name containing TLS certificates | | ||
| | dynamo-operator.dynamo.istio.enabled | bool | `false` | Whether to enable Istio integration | | ||
| | dynamo-operator.dynamo.istio.gateway | string | `nil` | Istio gateway name for routing | | ||
| | dynamo-operator.dynamo.ingressHostSuffix | string | `""` | Host suffix for generated ingress hostnames | | ||
| | dynamo-operator.dynamo.virtualServiceSupportsHTTPS | bool | `false` | Whether VirtualServices should support HTTPS routing | | ||
| | grove.enabled | bool | `false` | Whether to enable Grove for multi-node inference coordination, if enabled, the Grove operator will be deployed cluster-wide | | ||
| | kai-scheduler.enabled | bool | `false` | Whether to enable Kai Scheduler for intelligent resource allocation, if enabled, the Kai Scheduler operator will be deployed cluster-wide | | ||
| | etcd.enabled | bool | `true` | Whether to enable etcd deployment, disable if you want to use an external etcd instance | | ||
| | nats.enabled | bool | `true` | Whether to enable NATS deployment, disable if you want to use an external NATS instance | | ||
|
|
||
| ### NATS Configuration | ||
|
|
||
| For detailed NATS configuration options beyond `nats.enabled`, please refer to the official NATS Helm chart documentation: | ||
| **[NATS Helm Chart Documentation](https://github.com/nats-io/k8s/tree/main/helm/charts/nats)** | ||
|
|
||
| ### etcd Configuration | ||
|
|
||
| For detailed etcd configuration options beyond `etcd.enabled`, please refer to the official Bitnami etcd Helm chart documentation: | ||
| **[etcd Helm Chart Documentation](https://github.com/bitnami/charts/tree/main/bitnami/etcd)** | ||
|
|
||
| ## 📚 Additional Resources | ||
|
|
||
| - [Dynamo Cloud Deployment Guide](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md) | ||
| - [NATS Documentation](https://docs.nats.io/) | ||
| - [etcd Documentation](https://etcd.io/docs/) | ||
| - [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) | ||
|
|
||
| ---------------------------------------------- | ||
| Autogenerated from chart metadata using [helm-docs v1.14.2](https://github.com/norwoodj/helm-docs/releases/v1.14.2) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| <!-- | ||
| SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| SPDX-License-Identifier: Apache-2.0 | ||
hhzhang16 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| --> | ||
|
|
||
| {{ template "chart.header" . }} | ||
|
|
||
| {{ template "chart.description" . }} | ||
|
|
||
| {{ template "chart.versionBadge" . }}{{ template "chart.typeBadge" . }}{{ template "chart.appVersionBadge" . }} | ||
|
|
||
| ## 🚀 Overview | ||
|
|
||
| The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure on Kubernetes, including: | ||
|
|
||
| - **Dynamo Operator**: Kubernetes operator for managing Dynamo deployments | ||
| - **NATS**: High-performance messaging system for component communication | ||
| - **etcd**: Distributed key-value store for operator state management | ||
| - **Grove**: Multi-node inference orchestration (optional) | ||
| - **Kai Scheduler**: Advanced workload scheduling (optional) | ||
|
|
||
| ## 📋 Prerequisites | ||
|
|
||
| - Kubernetes cluster (v1.20+) | ||
| - Helm 3.8+ | ||
| - Sufficient cluster resources for your deployment scale | ||
| - Container registry access (if using private images) | ||
|
|
||
| ## 🔧 Configuration | ||
|
|
||
| {{ template "chart.requirementsSection" . }} | ||
|
|
||
| {{ template "chart.valuesSection" . }} | ||
|
|
||
| ### NATS Configuration | ||
|
|
||
| For detailed NATS configuration options beyond `nats.enabled`, please refer to the official NATS Helm chart documentation: | ||
| **[NATS Helm Chart Documentation](https://github.com/nats-io/k8s/tree/main/helm/charts/nats)** | ||
|
|
||
| ### etcd Configuration | ||
|
|
||
| For detailed etcd configuration options beyond `etcd.enabled`, please refer to the official Bitnami etcd Helm chart documentation: | ||
| **[etcd Helm Chart Documentation](https://github.com/bitnami/charts/tree/main/bitnami/etcd)** | ||
|
|
||
|
|
||
| ## 📚 Additional Resources | ||
|
|
||
| - [Dynamo Cloud Deployment Guide](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md) | ||
| - [NATS Documentation](https://docs.nats.io/) | ||
| - [etcd Documentation](https://etcd.io/docs/) | ||
| - [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) | ||
|
|
||
| {{ template "helm-docs.versionFooter" . }} | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| --- | ||
| {{- if .Capabilities.APIVersions.Has "scheduling.run.ai/v2" }} | ||
|
|
||
| {{- /* Create parent queue first */ -}} | ||
| {{- $defaultQueue := lookup "scheduling.run.ai/v2" "Queue" "" "dynamo-default" }} | ||
| {{- if not $defaultQueue }} | ||
| --- | ||
| apiVersion: scheduling.run.ai/v2 | ||
| kind: Queue | ||
| metadata: | ||
| name: dynamo-default | ||
| annotations: | ||
| "helm.sh/hook": post-install,post-upgrade | ||
| "helm.sh/hook-weight": "100" | ||
| "helm.sh/hook-delete-policy": before-hook-creation | ||
| spec: | ||
| resources: | ||
| cpu: | ||
| quota: -1 | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| gpu: | ||
| quota: -1 | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| memory: | ||
| quota: -1 | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| {{- end }} | ||
|
|
||
| {{- /* Create child queue second */ -}} | ||
| {{- $dynamoQueue := lookup "scheduling.run.ai/v2" "Queue" "" "dynamo" }} | ||
| {{- if not $dynamoQueue }} | ||
| --- | ||
| apiVersion: scheduling.run.ai/v2 | ||
| kind: Queue | ||
| metadata: | ||
| name: dynamo | ||
| annotations: | ||
| "helm.sh/hook": post-install,post-upgrade | ||
| "helm.sh/hook-weight": "110" | ||
| "helm.sh/hook-delete-policy": before-hook-creation | ||
| spec: | ||
| parentQueue: dynamo-default | ||
| resources: | ||
| cpu: | ||
| quota: -1 | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| gpu: | ||
| quota: -1 | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| memory: | ||
| quota: -1 | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| {{- end }} | ||
|
|
||
| {{- end }} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.