diff --git a/deploy/cloud/helm/platform/README.md b/deploy/cloud/helm/platform/README.md index 0712953185..b3f794553c 100644 --- a/deploy/cloud/helm/platform/README.md +++ b/deploy/cloud/helm/platform/README.md @@ -38,6 +38,48 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure - Sufficient cluster resources for your deployment scale - Container registry access (if using private images) +## ⚠️ Important: Cluster-Wide vs Namespace-Scoped Deployment + +### Single Cluster-Wide Operator (Recommended) + +**By default, the Dynamo operator runs with cluster-wide permissions and should only be deployed ONCE per cluster.** + +- ✅ **Recommended**: Deploy one cluster-wide operator per cluster +- ❌ **Not Recommended**: Multiple cluster-wide operators in the same cluster + +### Multiple Namespace-Scoped Operators (Advanced) + +If you need multiple operator instances (e.g., for multi-tenancy), use namespace-scoped deployment: + +```yaml +# values.yaml +dynamo-operator: + namespaceRestriction: + enabled: true + targetNamespace: "my-tenant-namespace" # Optional, defaults to release namespace +``` + +### Validation and Safety + +The chart includes built-in validation to prevent all operator conflicts: + +- **Automatic Detection**: Scans for existing operators (both cluster-wide and namespace-restricted) during installation +- **Prevents Multiple Cluster-Wide**: Installation will fail if another cluster-wide operator exists +- **Prevents Mixed Deployments (Type 1)**: Installation will fail if trying to install namespace-restricted operator when cluster-wide exists +- **Prevents Mixed Deployments (Type 2)**: Installation will fail if trying to install cluster-wide operator when namespace-restricted operators exist +- **Safe Defaults**: Leader election uses shared ID for proper coordination + +#### 🚫 **Blocked Conflict Scenarios** + +| Existing Operator | New Operator | Status | Reason | +|-------------------|--------------|---------|--------| +| None | Cluster-wide | ✅ **Allowed** | No conflicts | +| None | Namespace-restricted | ✅ **Allowed** | No conflicts | +| Cluster-wide | Cluster-wide | ❌ **Blocked** | Multiple cluster managers | +| Cluster-wide | Namespace-restricted | ❌ **Blocked** | Cluster-wide already manages target namespace | +| Namespace-restricted | Cluster-wide | ❌ **Blocked** | Would conflict with existing namespace operators | +| Namespace-restricted A | Namespace-restricted B (diff ns) | ✅ **Allowed** | Different scopes | + ## 🔧 Configuration ## Requirements @@ -58,11 +100,13 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure | dynamo-operator.natsAddr | string | `""` | NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" | | dynamo-operator.etcdAddr | string | `""` | etcd server address for operator state storage (leave empty to use the bundled etcd chart). Format: "http://hostname:port" or "https://hostname:port" | | dynamo-operator.modelExpressURL | string | `""` | URL for the Model Express server if not deployed by this helm chart. This is ignored if Model Express server is installed by this helm chart (global.model-express.enabled is true). | -| dynamo-operator.namespaceRestriction | object | `{"enabled":true,"targetNamespace":null}` | Namespace access controls for the operator | -| dynamo-operator.namespaceRestriction.enabled | bool | `true` | Whether to restrict operator to specific namespaces | +| dynamo-operator.namespaceRestriction | object | `{"enabled":false,"targetNamespace":null}` | Namespace access controls for the operator | +| dynamo-operator.namespaceRestriction.enabled | bool | `false` | Whether to restrict operator to specific namespaces. By default, the operator will run with cluster-wide permissions. Only 1 instance of the operator should be deployed in the cluster. If you want to deploy multiple operator instances, you can set this to true and specify the target namespace (by default, the target namespace is the helm release namespace). | | dynamo-operator.namespaceRestriction.targetNamespace | string | `nil` | Target namespace for operator deployment (leave empty for current namespace) | | dynamo-operator.controllerManager.tolerations | list | `[]` | Node tolerations for controller manager pods | | dynamo-operator.controllerManager.affinity | list | `[]` | Affinity for controller manager pods | +| dynamo-operator.controllerManager.leaderElection.id | string | `""` | Leader election ID for cluster-wide coordination. WARNING: All cluster-wide operators must use the SAME ID to prevent split-brain. Different IDs would allow multiple leaders simultaneously. | +| dynamo-operator.controllerManager.leaderElection.namespace | string | `""` | Namespace for leader election leases (only used in cluster-wide mode). If empty, defaults to kube-system for cluster-wide coordination. All cluster-wide operators should use the SAME namespace for proper leader election. | | dynamo-operator.controllerManager.manager.image.repository | string | `"nvcr.io/nvidia/ai-dynamo/kubernetes-operator"` | Official NVIDIA Dynamo operator image repository | | dynamo-operator.controllerManager.manager.image.tag | string | `""` | Image tag (leave empty to use chart default) | | dynamo-operator.controllerManager.manager.image.pullPolicy | string | `"IfNotPresent"` | Image pull policy - when to pull the image | diff --git a/deploy/cloud/helm/platform/README.md.gotmpl b/deploy/cloud/helm/platform/README.md.gotmpl index 93e69facf3..d8b0e66af6 100644 --- a/deploy/cloud/helm/platform/README.md.gotmpl +++ b/deploy/cloud/helm/platform/README.md.gotmpl @@ -38,6 +38,48 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure - Sufficient cluster resources for your deployment scale - Container registry access (if using private images) +## ⚠️ Important: Cluster-Wide vs Namespace-Scoped Deployment + +### Single Cluster-Wide Operator (Recommended) + +**By default, the Dynamo operator runs with cluster-wide permissions and should only be deployed ONCE per cluster.** + +- ✅ **Recommended**: Deploy one cluster-wide operator per cluster +- ❌ **Not Recommended**: Multiple cluster-wide operators in the same cluster + +### Multiple Namespace-Scoped Operators (Advanced) + +If you need multiple operator instances (e.g., for multi-tenancy), use namespace-scoped deployment: + +```yaml +# values.yaml +dynamo-operator: + namespaceRestriction: + enabled: true + targetNamespace: "my-tenant-namespace" # Optional, defaults to release namespace +``` + +### Validation and Safety + +The chart includes built-in validation to prevent all operator conflicts: + +- **Automatic Detection**: Scans for existing operators (both cluster-wide and namespace-restricted) during installation +- **Prevents Multiple Cluster-Wide**: Installation will fail if another cluster-wide operator exists +- **Prevents Mixed Deployments (Type 1)**: Installation will fail if trying to install namespace-restricted operator when cluster-wide exists +- **Prevents Mixed Deployments (Type 2)**: Installation will fail if trying to install cluster-wide operator when namespace-restricted operators exist +- **Safe Defaults**: Leader election uses shared ID for proper coordination + +#### 🚫 **Blocked Conflict Scenarios** + +| Existing Operator | New Operator | Status | Reason | +|-------------------|--------------|---------|--------| +| None | Cluster-wide | ✅ **Allowed** | No conflicts | +| None | Namespace-restricted | ✅ **Allowed** | No conflicts | +| Cluster-wide | Cluster-wide | ❌ **Blocked** | Multiple cluster managers | +| Cluster-wide | Namespace-restricted | ❌ **Blocked** | Cluster-wide already manages target namespace | +| Namespace-restricted | Cluster-wide | ❌ **Blocked** | Would conflict with existing namespace operators | +| Namespace-restricted A | Namespace-restricted B (diff ns) | ✅ **Allowed** | Different scopes | + ## 🔧 Configuration {{ template "chart.requirementsSection" . }} diff --git a/deploy/cloud/helm/platform/components/operator/templates/_validation.tpl b/deploy/cloud/helm/platform/components/operator/templates/_validation.tpl new file mode 100644 index 0000000000..0389d233ac --- /dev/null +++ b/deploy/cloud/helm/platform/components/operator/templates/_validation.tpl @@ -0,0 +1,125 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{/* +Validation to prevent operator conflicts +Prevents all conflict scenarios: +1. Multiple cluster-wide operators (multiple cluster managers) +2. Namespace-restricted operator when cluster-wide exists (both would manage same resources) +3. Cluster-wide operator when namespace-restricted exist (both would manage same resources) +*/}} +{{- define "dynamo-operator.validateClusterWideInstallation" -}} +{{- $currentReleaseName := .Release.Name -}} + +{{/* Check for existing namespace-restricted operators (only when installing cluster-wide) */}} +{{- if not .Values.namespaceRestriction.enabled -}} + {{- $allRoles := lookup "rbac.authorization.k8s.io/v1" "Role" "" "" -}} + {{- $namespaceRestrictedOperators := list -}} + + {{- if $allRoles -}} + {{- range $role := $allRoles.items -}} + {{- if and (contains "-dynamo-operator-" $role.metadata.name) (hasSuffix "-manager-role" $role.metadata.name) -}} + {{- $namespaceRestrictedOperators = append $namespaceRestrictedOperators $role.metadata.namespace -}} + {{- end -}} + {{- end -}} + {{- end -}} + + {{- if $namespaceRestrictedOperators -}} + {{- fail (printf "VALIDATION ERROR: Cannot install cluster-wide Dynamo operator. Found existing namespace-restricted Dynamo operators in namespaces: %s. This would create resource conflicts as both the cluster-wide operator and namespace-restricted operators would manage the same DGDs/DCDs. Either:\n1. Use one of the existing namespace-restricted operators for your specific namespace, or\n2. Uninstall all existing namespace-restricted operators first, or\n3. Install this operator in namespace-restricted mode: --set namespaceRestriction.enabled=true" (join ", " ($namespaceRestrictedOperators | uniq))) -}} + {{- end -}} +{{- end -}} + +{{/* Check for existing ClusterRoles that would indicate other cluster-wide installations */}} +{{- $existingClusterRoles := lookup "rbac.authorization.k8s.io/v1" "ClusterRole" "" "" -}} +{{- $foundExistingClusterWideOperator := false -}} +{{- $existingOperatorRelease := "" -}} +{{- $existingOperatorRoleName := "" -}} +{{- $existingOperatorNamespace := "" -}} + +{{- if $existingClusterRoles -}} + {{- range $cr := $existingClusterRoles.items -}} + {{- if and (contains "-dynamo-operator-" $cr.metadata.name) (hasSuffix "-manager-role" $cr.metadata.name) -}} + {{- $currentRoleName := printf "%s-dynamo-operator-manager-role" $currentReleaseName -}} + {{- if ne $cr.metadata.name $currentRoleName -}} + {{- $foundExistingClusterWideOperator = true -}} + {{- $existingOperatorRoleName = $cr.metadata.name -}} + {{- if $cr.metadata.labels -}} + {{- if $cr.metadata.labels.release -}} + {{- $existingOperatorRelease = $cr.metadata.labels.release -}} + {{- else if index $cr.metadata.labels "app.kubernetes.io/instance" -}} + {{- $existingOperatorRelease = index $cr.metadata.labels "app.kubernetes.io/instance" -}} + {{- end -}} + {{- end -}} + + {{/* Find the namespace by looking at ClusterRoleBinding subjects */}} + {{- $clusterRoleBindings := lookup "rbac.authorization.k8s.io/v1" "ClusterRoleBinding" "" "" -}} + {{- if $clusterRoleBindings -}} + {{- range $crb := $clusterRoleBindings.items -}} + {{- if eq $crb.roleRef.name $cr.metadata.name -}} + {{- range $subject := $crb.subjects -}} + {{- if and (eq $subject.kind "ServiceAccount") $subject.namespace -}} + {{- $existingOperatorNamespace = $subject.namespace -}} + {{- end -}} + {{- end -}} + {{- end -}} + {{- end -}} + {{- end -}} + {{- end -}} + {{- end -}} + {{- end -}} +{{- end -}} + +{{- if $foundExistingClusterWideOperator -}} + {{- $uninstallCmd := printf "helm uninstall %s" $existingOperatorRelease -}} + {{- if $existingOperatorNamespace -}} + {{- $uninstallCmd = printf "helm uninstall %s -n %s" $existingOperatorRelease $existingOperatorNamespace -}} + {{- end -}} + + {{- if .Values.namespaceRestriction.enabled -}} + {{- if $existingOperatorNamespace -}} + {{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' in namespace '%s' (ClusterRole: %s). Cannot install namespace-restricted operator because the cluster-wide operator already manages resources in all namespaces, including the target namespace. This would create resource conflicts. Either:\n1. Use the existing cluster-wide operator, or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorNamespace $existingOperatorRoleName $uninstallCmd) -}} + {{- else -}} + {{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' (ClusterRole: %s). Cannot install namespace-restricted operator because the cluster-wide operator already manages resources in all namespaces, including the target namespace. This would create resource conflicts. Either:\n1. Use the existing cluster-wide operator, or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorRoleName $uninstallCmd) -}} + {{- end -}} + {{- else -}} + {{- if $existingOperatorNamespace -}} + {{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' in namespace '%s' (ClusterRole: %s). Only one cluster-wide Dynamo operator should be deployed per cluster. Either:\n1. Use the existing cluster-wide operator (no need to install another), or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorNamespace $existingOperatorRoleName $uninstallCmd) -}} + {{- else -}} + {{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' (ClusterRole: %s). Only one cluster-wide Dynamo operator should be deployed per cluster. Either:\n1. Use the existing cluster-wide operator (no need to install another), or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorRoleName $uninstallCmd) -}} + {{- end -}} + {{- end -}} +{{- end -}} + +{{/* Additional validation for cluster-wide mode */}} +{{- if not .Values.namespaceRestriction.enabled -}} + {{/* Warn if using different leader election IDs */}} + {{- $leaderElectionId := default "dynamo.nvidia.com" .Values.controllerManager.leaderElection.id -}} + {{- if ne $leaderElectionId "dynamo.nvidia.com" -}} + {{- fail (printf "VALIDATION WARNING: Using custom leader election ID '%s' in cluster-wide mode. For proper coordination, all cluster-wide Dynamo operators should use the SAME leader election ID. Different IDs will allow multiple leaders simultaneously (split-brain scenario)." $leaderElectionId) -}} + {{- end -}} +{{- end -}} +{{- end -}} + +{{/* +Validation for configuration consistency +*/}} +{{- define "dynamo-operator.validateConfiguration" -}} +{{/* Validate leader election namespace setting */}} +{{- if and (not .Values.namespaceRestriction.enabled) .Values.controllerManager.leaderElection.namespace -}} + {{- if eq .Values.controllerManager.leaderElection.namespace .Release.Namespace -}} + {{- printf "\nWARNING: Leader election namespace is set to the same as release namespace (%s) in cluster-wide mode. This may prevent proper coordination between multiple releases. Consider using 'kube-system' or leaving empty for default.\n" .Release.Namespace | fail -}} + {{- end -}} +{{- end -}} +{{- end -}} diff --git a/deploy/cloud/helm/platform/components/operator/templates/deployment.yaml b/deploy/cloud/helm/platform/components/operator/templates/deployment.yaml index f6f7b5e3dd..9d1175ce72 100644 --- a/deploy/cloud/helm/platform/components/operator/templates/deployment.yaml +++ b/deploy/cloud/helm/platform/components/operator/templates/deployment.yaml @@ -12,6 +12,11 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. + +{{/* Validate installation to prevent conflicts */}} +{{- include "dynamo-operator.validateClusterWideInstallation" . -}} +{{- include "dynamo-operator.validateConfiguration" . -}} + --- apiVersion: apps/v1 kind: Deployment @@ -76,7 +81,8 @@ spec: - --leader-elect=false {{- else }} - --leader-elect - - --leader-election-id=dynamo.nvidia.com + - --leader-election-id={{ default "dynamo.nvidia.com" .Values.controllerManager.leaderElection.id }} + - --leader-election-namespace={{ default "kube-system" .Values.controllerManager.leaderElection.namespace }} {{- end }} {{- if .Values.natsAddr }} - --natsAddr={{ .Values.natsAddr }} diff --git a/deploy/cloud/helm/platform/components/operator/templates/leader-election-rbac.yaml b/deploy/cloud/helm/platform/components/operator/templates/leader-election-rbac.yaml index 20b88ec6e7..f7174ff7cf 100644 --- a/deploy/cloud/helm/platform/components/operator/templates/leader-election-rbac.yaml +++ b/deploy/cloud/helm/platform/components/operator/templates/leader-election-rbac.yaml @@ -12,8 +12,14 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +{{/* +Only create leader election RBAC when leader election is enabled. +When namespaceRestriction.enabled=true, leader election is disabled (--leader-elect=false), +so these permissions are not needed. +*/}} +{{- if not .Values.namespaceRestriction.enabled }} apiVersion: rbac.authorization.k8s.io/v1 -kind: Role +kind: ClusterRole metadata: name: {{ include "dynamo-operator.fullname" . }}-leader-election-role labels: @@ -55,7 +61,7 @@ rules: - patch --- apiVersion: rbac.authorization.k8s.io/v1 -kind: RoleBinding +kind: ClusterRoleBinding metadata: name: {{ include "dynamo-operator.fullname" . }}-leader-election-rolebinding labels: @@ -65,9 +71,10 @@ metadata: {{- include "dynamo-operator.labels" . | nindent 4 }} roleRef: apiGroup: rbac.authorization.k8s.io - kind: Role + kind: ClusterRole name: '{{ include "dynamo-operator.fullname" . }}-leader-election-role' subjects: - kind: ServiceAccount name: '{{ include "dynamo-operator.fullname" . }}-controller-manager' - namespace: '{{ .Release.Namespace }}' \ No newline at end of file + namespace: '{{ .Release.Namespace }}' +{{- end }} \ No newline at end of file diff --git a/deploy/cloud/helm/platform/components/operator/values.yaml b/deploy/cloud/helm/platform/components/operator/values.yaml index e074bc0f35..168283e121 100644 --- a/deploy/cloud/helm/platform/components/operator/values.yaml +++ b/deploy/cloud/helm/platform/components/operator/values.yaml @@ -27,6 +27,19 @@ namespaceRestriction: targetNamespace: "" controllerManager: tolerations: [] + + # Leader election configuration + leaderElection: + # Leader election ID for cluster-wide coordination + # WARNING: All cluster-wide operators must use the SAME ID to prevent split-brain + # Different IDs would allow multiple leaders simultaneously + id: "" # If empty, defaults to: dynamo.nvidia.com (shared across all cluster-wide operators) + + # Namespace for leader election leases (only used in cluster-wide mode) + # If empty, defaults to kube-system for cluster-wide coordination + # All cluster-wide operators should use the SAME namespace for proper leader election + namespace: "" + kubeRbacProxy: args: - --secure-listen-address=0.0.0.0:8443 diff --git a/deploy/cloud/helm/platform/values.yaml b/deploy/cloud/helm/platform/values.yaml index ce384fe787..2d2cc469be 100644 --- a/deploy/cloud/helm/platform/values.yaml +++ b/deploy/cloud/helm/platform/values.yaml @@ -31,8 +31,8 @@ dynamo-operator: modelExpressURL: "" # -- Namespace access controls for the operator namespaceRestriction: - # -- Whether to restrict operator to specific namespaces - enabled: true + # -- Whether to restrict operator to specific namespaces. By default, the operator will run with cluster-wide permissions. Only 1 instance of the operator should be deployed in the cluster. If you want to deploy multiple operator instances, you can set this to true and specify the target namespace (by default, the target namespace is the helm release namespace). + enabled: false # -- Target namespace for operator deployment (leave empty for current namespace) targetNamespace: @@ -44,6 +44,13 @@ dynamo-operator: # -- Affinity for controller manager pods affinity: [] + # Leader election configuration for cluster-wide coordination + leaderElection: + # -- Leader election ID for cluster-wide coordination. WARNING: All cluster-wide operators must use the SAME ID to prevent split-brain. Different IDs would allow multiple leaders simultaneously. + id: "" # If empty, defaults to: dynamo.nvidia.com (shared across all cluster-wide operators) + # -- Namespace for leader election leases (only used in cluster-wide mode). If empty, defaults to kube-system for cluster-wide coordination. All cluster-wide operators should use the SAME namespace for proper leader election. + namespace: "" + manager: # Container image configuration for the operator manager image: diff --git a/deploy/cloud/operator/cmd/main.go b/deploy/cloud/operator/cmd/main.go index 7cfd43a0ae..bc55f36eb2 100644 --- a/deploy/cloud/operator/cmd/main.go +++ b/deploy/cloud/operator/cmd/main.go @@ -124,6 +124,7 @@ func main() { var enableHTTP2 bool var restrictedNamespace string var leaderElectionID string + var leaderElectionNamespace string var natsAddr string var etcdAddr string var istioVirtualServiceGateway string @@ -149,6 +150,9 @@ func main() { "Enable resources filtering, only the resources belonging to the given namespace will be handled.") flag.StringVar(&leaderElectionID, "leader-election-id", "", "Leader election id"+ "Id to use for the leader election.") + flag.StringVar(&leaderElectionNamespace, + "leader-election-namespace", "", + "Namespace where the leader election resource will be created (default: same as operator namespace)") flag.StringVar(&natsAddr, "natsAddr", "", "address of the NATS server") flag.StringVar(&etcdAddr, "etcdAddr", "", "address of the etcd server") flag.StringVar(&istioVirtualServiceGateway, "istio-virtual-service-gateway", "", @@ -253,10 +257,11 @@ func main() { SecureServing: secureMetrics, TLSOpts: tlsOpts, }, - WebhookServer: webhookServer, - HealthProbeBindAddress: probeAddr, - LeaderElection: enableLeaderElection, - LeaderElectionID: leaderElectionID, + WebhookServer: webhookServer, + HealthProbeBindAddress: probeAddr, + LeaderElection: enableLeaderElection, + LeaderElectionID: leaderElectionID, + LeaderElectionNamespace: leaderElectionNamespace, // LeaderElectionReleaseOnCancel defines if the leader should step down voluntarily // when the Manager ends. This requires the binary to immediately end when the // Manager is stopped, otherwise, this setting is unsafe. Setting this significantly diff --git a/deploy/cloud/operator/internal/dynamo/backend_sglang_test.go b/deploy/cloud/operator/internal/dynamo/backend_sglang_test.go index d459c4d719..06ad894ca0 100644 --- a/deploy/cloud/operator/internal/dynamo/backend_sglang_test.go +++ b/deploy/cloud/operator/internal/dynamo/backend_sglang_test.go @@ -19,7 +19,7 @@ func (m *MockSimpleDeployer) GetHostNames(serviceName string, numberOfNodes int3 hostnames := make([]string, numberOfNodes) hostnames[0] = m.GetLeaderHostname(serviceName) for i := int32(1); i < numberOfNodes; i++ { - hostnames[i] = "worker" + string(rune('0'+i)) + ".example.com" + hostnames[i] = "worker" + string('0'+i) + ".example.com" } return hostnames } @@ -39,7 +39,7 @@ func (m *MockShellDeployer) GetHostNames(serviceName string, numberOfNodes int32 hostnames := make([]string, numberOfNodes) hostnames[0] = m.GetLeaderHostname(serviceName) for i := int32(1); i < numberOfNodes; i++ { - hostnames[i] = "$(WORKER_" + string(rune('0'+i)) + "_HOST)" + hostnames[i] = "$(WORKER_" + string('0'+i) + "_HOST)" } return hostnames } diff --git a/docs/kubernetes/README.md b/docs/kubernetes/README.md index 22ff95675c..fb9cbed756 100644 --- a/docs/kubernetes/README.md +++ b/docs/kubernetes/README.md @@ -23,7 +23,7 @@ High-level guide to Dynamo Kubernetes deployments. Start here, then dive into sp ```bash # 1. Set environment -export NAMESPACE=dynamo-kubernetes +export NAMESPACE=dynamo-system export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases # 2. Install CRDs @@ -50,8 +50,8 @@ Each backend has deployment examples and configuration options: ## 3. Deploy Your First Model ```bash -# Set same namespace from platform install export NAMESPACE=dynamo-cloud +kubectl create namespace ${NAMESPACE} # Deploy any example (this uses vLLM with Qwen model using aggregated serving) kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE} diff --git a/docs/kubernetes/installation_guide.md b/docs/kubernetes/installation_guide.md index c581d76b49..d5a9a9e1c3 100644 --- a/docs/kubernetes/installation_guide.md +++ b/docs/kubernetes/installation_guide.md @@ -69,7 +69,7 @@ Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidi ```bash # 1. Set environment -export NAMESPACE=dynamo-kubernetes +export NAMESPACE=dynamo-system export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases # 2. Install CRDs @@ -99,6 +99,15 @@ helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace --set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080" ``` +> [!TIP] +> By default, Dynamo Operator is installed cluster-wide and will monitor all namespaces. +> If you wish to restrict the operator to monitor only a specific namespace (the helm release namespace by default), you can set the namespaceRestriction.enabled to true. +> You can also change the restricted namespace by setting the targetNamespace property. + +```bash +--set "dynamo-operator.namespaceRestriction.enabled=true" +--set "dynamo-operator.namespaceRestriction.targetNamespace=dynamo-namespace" # optional +``` → [Verify Installation](#verify-installation) @@ -108,7 +117,7 @@ Build and deploy from source for customization. ```bash # 1. Set environment -export NAMESPACE=dynamo-cloud +export NAMESPACE=dynamo-system export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/ # or your registry export DOCKER_USERNAME='$oauthtoken' export DOCKER_PASSWORD= diff --git a/docs/kubernetes/logging.md b/docs/kubernetes/logging.md index cf8e8ed054..ce1767bbc8 100644 --- a/docs/kubernetes/logging.md +++ b/docs/kubernetes/logging.md @@ -31,7 +31,7 @@ The following env variables are set: ```bash export MONITORING_NAMESPACE=monitoring -export DYNAMO_NAMESPACE=dynamo-cloud +export DYNAMO_NAMESPACE=dynamo-system ``` ## Installation Steps