diff --git a/deploy/cloud/helm/platform/Chart.yaml b/deploy/cloud/helm/platform/Chart.yaml index ea00f3189f..6c56a072b8 100644 --- a/deploy/cloud/helm/platform/Chart.yaml +++ b/deploy/cloud/helm/platform/Chart.yaml @@ -34,3 +34,12 @@ dependencies: version: 11.1.0 repository: "https://charts.bitnami.com/bitnami" condition: etcd.enabled + - name: kai-scheduler + version: v0.8.1 + repository: oci://ghcr.io/nvidia/kai-scheduler + condition: kai-scheduler.enabled + - name: grove-charts + alias: grove + version: v0.0.0-6e30275 + repository: oci://ghcr.io/nvidia/grove + condition: grove.enabled diff --git a/deploy/cloud/helm/platform/README.md b/deploy/cloud/helm/platform/README.md new file mode 100644 index 0000000000..bfc821feb3 --- /dev/null +++ b/deploy/cloud/helm/platform/README.md @@ -0,0 +1,108 @@ + + +# dynamo-platform + +A Helm chart for NVIDIA Dynamo Platform. + +![Version: 0.5.0](https://img.shields.io/badge/Version-0.5.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) + +## ๐Ÿš€ Overview + +The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure on Kubernetes, including: + +- **Dynamo Operator**: Kubernetes operator for managing Dynamo deployments +- **NATS**: High-performance messaging system for component communication +- **etcd**: Distributed key-value store for operator state management +- **Grove**: Multi-node inference orchestration (optional) +- **Kai Scheduler**: Advanced workload scheduling (optional) + +## ๐Ÿ“‹ Prerequisites + +- Kubernetes cluster (v1.20+) +- Helm 3.8+ +- Sufficient cluster resources for your deployment scale +- Container registry access (if using private images) + +## ๐Ÿ”ง Configuration + +## Requirements + +| Repository | Name | Version | +|------------|------|---------| +| file://components/operator | dynamo-operator | 0.5.0 | +| https://charts.bitnami.com/bitnami | etcd | 11.1.0 | +| https://nats-io.github.io/k8s/helm/charts/ | nats | 1.3.2 | +| oci://ghcr.io/nvidia/grove | grove(grove-charts) | v0.0.0-6e30275 | +| oci://ghcr.io/nvidia/kai-scheduler | kai-scheduler | v0.8.1 | + +## Values + +| Key | Type | Default | Description | +|-----|------|---------|-------------| +| dynamo-operator.enabled | bool | `true` | Whether to enable the Dynamo Kubernetes operator deployment | +| dynamo-operator.natsAddr | string | `""` | NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" | +| dynamo-operator.etcdAddr | string | `""` | etcd server address for operator state storage (leave empty to use the bundled etcd chart). Format: "http://hostname:port" or "https://hostname:port" | +| dynamo-operator.namespaceRestriction.enabled | bool | `true` | Whether to restrict operator to specific namespaces | +| dynamo-operator.namespaceRestriction.targetNamespace | string | `nil` | Target namespace for operator deployment (leave empty for current namespace) | +| dynamo-operator.controllerManager.tolerations | list | `[]` | Node tolerations for controller manager pods | +| dynamo-operator.controllerManager.manager.image.repository | string | `"nvcr.io/nvidia/ai-dynamo/kubernetes-operator"` | Official NVIDIA Dynamo operator image repository | +| dynamo-operator.controllerManager.manager.image.tag | string | `""` | Image tag (leave empty to use chart default) | +| dynamo-operator.controllerManager.manager.image.pullPolicy | string | `"IfNotPresent"` | Image pull policy - when to pull the image | +| dynamo-operator.controllerManager.manager.args[0] | string | `"--health-probe-bind-address=:8081"` | Health probe endpoint for Kubernetes health checks | +| dynamo-operator.controllerManager.manager.args[1] | string | `"--metrics-bind-address=127.0.0.1:8080"` | Metrics endpoint for Prometheus scraping (localhost only for security) | +| dynamo-operator.imagePullSecrets | list | `[]` | Secrets for pulling private container images | +| dynamo-operator.dynamo.groveTerminationDelay | string | `"15m"` | How long to wait before forcefully terminating Grove instances | +| dynamo-operator.dynamo.internalImages.debugger | string | `"python:3.12-slim"` | Debugger image for troubleshooting deployments | +| dynamo-operator.dynamo.enableRestrictedSecurityContext | bool | `false` | Whether to enable restricted security contexts for enhanced security | +| dynamo-operator.dynamo.dockerRegistry.useKubernetesSecret | bool | `false` | Whether to use Kubernetes secrets for registry authentication | +| dynamo-operator.dynamo.dockerRegistry.server | string | `nil` | Docker registry server URL | +| dynamo-operator.dynamo.dockerRegistry.username | string | `nil` | Registry username | +| dynamo-operator.dynamo.dockerRegistry.password | string | `nil` | Registry password (consider using existingSecretName instead) | +| dynamo-operator.dynamo.dockerRegistry.existingSecretName | string | `nil` | Name of existing Kubernetes secret containing registry credentials | +| dynamo-operator.dynamo.dockerRegistry.secure | bool | `true` | Whether the registry uses HTTPS | +| dynamo-operator.dynamo.ingress.enabled | bool | `false` | Whether to create ingress resources | +| dynamo-operator.dynamo.ingress.className | string | `nil` | Ingress class name (e.g., "nginx", "traefik") | +| dynamo-operator.dynamo.ingress.tlsSecretName | string | `"my-tls-secret"` | Secret name containing TLS certificates | +| dynamo-operator.dynamo.istio.enabled | bool | `false` | Whether to enable Istio integration | +| dynamo-operator.dynamo.istio.gateway | string | `nil` | Istio gateway name for routing | +| dynamo-operator.dynamo.ingressHostSuffix | string | `""` | Host suffix for generated ingress hostnames | +| dynamo-operator.dynamo.virtualServiceSupportsHTTPS | bool | `false` | Whether VirtualServices should support HTTPS routing | +| grove.enabled | bool | `false` | Whether to enable Grove for multi-node inference coordination, if enabled, the Grove operator will be deployed cluster-wide | +| kai-scheduler.enabled | bool | `false` | Whether to enable Kai Scheduler for intelligent resource allocation, if enabled, the Kai Scheduler operator will be deployed cluster-wide | +| etcd.enabled | bool | `true` | Whether to enable etcd deployment, disable if you want to use an external etcd instance | +| nats.enabled | bool | `true` | Whether to enable NATS deployment, disable if you want to use an external NATS instance | + +### NATS Configuration + +For detailed NATS configuration options beyond `nats.enabled`, please refer to the official NATS Helm chart documentation: +**[NATS Helm Chart Documentation](https://github.com/nats-io/k8s/tree/main/helm/charts/nats)** + +### etcd Configuration + +For detailed etcd configuration options beyond `etcd.enabled`, please refer to the official Bitnami etcd Helm chart documentation: +**[etcd Helm Chart Documentation](https://github.com/bitnami/charts/tree/main/bitnami/etcd)** + +## ๐Ÿ“š Additional Resources + +- [Dynamo Cloud Deployment Guide](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md) +- [NATS Documentation](https://docs.nats.io/) +- [etcd Documentation](https://etcd.io/docs/) +- [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) + +---------------------------------------------- +Autogenerated from chart metadata using [helm-docs v1.14.2](https://github.com/norwoodj/helm-docs/releases/v1.14.2) diff --git a/deploy/cloud/helm/platform/README.md.gotmpl b/deploy/cloud/helm/platform/README.md.gotmpl new file mode 100644 index 0000000000..ba5e8780fc --- /dev/null +++ b/deploy/cloud/helm/platform/README.md.gotmpl @@ -0,0 +1,65 @@ + + +{{ template "chart.header" . }} + +{{ template "chart.description" . }} + +{{ template "chart.versionBadge" . }}{{ template "chart.typeBadge" . }}{{ template "chart.appVersionBadge" . }} + +## ๐Ÿš€ Overview + +The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure on Kubernetes, including: + +- **Dynamo Operator**: Kubernetes operator for managing Dynamo deployments +- **NATS**: High-performance messaging system for component communication +- **etcd**: Distributed key-value store for operator state management +- **Grove**: Multi-node inference orchestration (optional) +- **Kai Scheduler**: Advanced workload scheduling (optional) + +## ๐Ÿ“‹ Prerequisites + +- Kubernetes cluster (v1.20+) +- Helm 3.8+ +- Sufficient cluster resources for your deployment scale +- Container registry access (if using private images) + +## ๐Ÿ”ง Configuration + +{{ template "chart.requirementsSection" . }} + +{{ template "chart.valuesSection" . }} + +### NATS Configuration + +For detailed NATS configuration options beyond `nats.enabled`, please refer to the official NATS Helm chart documentation: +**[NATS Helm Chart Documentation](https://github.com/nats-io/k8s/tree/main/helm/charts/nats)** + +### etcd Configuration + +For detailed etcd configuration options beyond `etcd.enabled`, please refer to the official Bitnami etcd Helm chart documentation: +**[etcd Helm Chart Documentation](https://github.com/bitnami/charts/tree/main/bitnami/etcd)** + + +## ๐Ÿ“š Additional Resources + +- [Dynamo Cloud Deployment Guide](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md) +- [NATS Documentation](https://docs.nats.io/) +- [etcd Documentation](https://etcd.io/docs/) +- [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) + +{{ template "helm-docs.versionFooter" . }} diff --git a/deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml b/deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml index a225b52b13..1f5e6f6d8a 100644 --- a/deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml +++ b/deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml @@ -491,7 +491,7 @@ subjects: apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: - name: {{ include "dynamo-operator.fullname" . }}-queue-reader + name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-queue-reader labels: app.kubernetes.io/component: rbac app.kubernetes.io/created-by: dynamo-operator @@ -510,7 +510,7 @@ rules: apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: - name: {{ include "dynamo-operator.fullname" . }}-queue-reader-binding + name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-queue-reader-binding labels: app.kubernetes.io/component: rbac app.kubernetes.io/created-by: dynamo-operator @@ -519,7 +519,7 @@ metadata: roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole - name: {{ include "dynamo-operator.fullname" . }}-queue-reader + name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-queue-reader subjects: - kind: ServiceAccount name: '{{ include "dynamo-operator.fullname" . }}-controller-manager' diff --git a/deploy/cloud/helm/platform/templates/kai.yaml b/deploy/cloud/helm/platform/templates/kai.yaml new file mode 100644 index 0000000000..af1a082201 --- /dev/null +++ b/deploy/cloud/helm/platform/templates/kai.yaml @@ -0,0 +1,75 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +--- +{{- if .Capabilities.APIVersions.Has "scheduling.run.ai/v2" }} + +{{- /* Create parent queue first */ -}} +{{- $defaultQueue := lookup "scheduling.run.ai/v2" "Queue" "" "dynamo-default" }} +{{- if not $defaultQueue }} +--- +apiVersion: scheduling.run.ai/v2 +kind: Queue +metadata: + name: dynamo-default + annotations: + "helm.sh/hook": post-install,post-upgrade + "helm.sh/hook-weight": "100" + "helm.sh/hook-delete-policy": before-hook-creation +spec: + resources: + cpu: + quota: -1 + limit: -1 + overQuotaWeight: 1 + gpu: + quota: -1 + limit: -1 + overQuotaWeight: 1 + memory: + quota: -1 + limit: -1 + overQuotaWeight: 1 +{{- end }} + +{{- /* Create child queue second */ -}} +{{- $dynamoQueue := lookup "scheduling.run.ai/v2" "Queue" "" "dynamo" }} +{{- if not $dynamoQueue }} +--- +apiVersion: scheduling.run.ai/v2 +kind: Queue +metadata: + name: dynamo + annotations: + "helm.sh/hook": post-install,post-upgrade + "helm.sh/hook-weight": "110" + "helm.sh/hook-delete-policy": before-hook-creation +spec: + parentQueue: dynamo-default + resources: + cpu: + quota: -1 + limit: -1 + overQuotaWeight: 1 + gpu: + quota: -1 + limit: -1 + overQuotaWeight: 1 + memory: + quota: -1 + limit: -1 + overQuotaWeight: 1 +{{- end }} + +{{- end }} \ No newline at end of file diff --git a/deploy/cloud/helm/platform/values.yaml b/deploy/cloud/helm/platform/values.yaml index fd6cd045a1..7bca7fb28d 100644 --- a/deploy/cloud/helm/platform/values.yaml +++ b/deploy/cloud/helm/platform/values.yaml @@ -14,170 +14,290 @@ # limitations under the License. # Used to generate top-level secrets (overridden by custom-values.yaml) -# Subcharts +# Subcharts configuration + +# Dynamo operator configuration dynamo-operator: + # -- Whether to enable the Dynamo Kubernetes operator deployment enabled: true + + # -- NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" natsAddr: "" + + # -- etcd server address for operator state storage (leave empty to use the bundled etcd chart). Format: "http://hostname:port" or "https://hostname:port" etcdAddr: "" + + # Namespace access controls for the operator namespaceRestriction: + # -- Whether to restrict operator to specific namespaces enabled: true + # -- Target namespace for operator deployment (leave empty for current namespace) targetNamespace: + + # Controller manager configuration controllerManager: + # -- Node tolerations for controller manager pods tolerations: [] + manager: + # Container image configuration for the operator manager image: + # -- Official NVIDIA Dynamo operator image repository repository: "nvcr.io/nvidia/ai-dynamo/kubernetes-operator" + # -- Image tag (leave empty to use chart default) tag: "" + # -- Image pull policy - when to pull the image pullPolicy: IfNotPresent + + # Command line arguments for the operator manager args: + # -- Health probe endpoint for Kubernetes health checks - --health-probe-bind-address=:8081 + # -- Metrics endpoint for Prometheus scraping (localhost only for security) - --metrics-bind-address=127.0.0.1:8080 + + # -- Secrets for pulling private container images imagePullSecrets: [] + + # Core Dynamo platform configuration dynamo: + # -- How long to wait before forcefully terminating Grove instances groveTerminationDelay: 15m + + # Internal utility images used by the platform internalImages: + # -- Debugger image for troubleshooting deployments debugger: python:3.12-slim + + # -- Whether to enable restricted security contexts for enhanced security enableRestrictedSecurityContext: false + + # Docker registry configuration for private repositories dockerRegistry: + # -- Whether to use Kubernetes secrets for registry authentication useKubernetesSecret: false + # -- Docker registry server URL server: + # -- Registry username username: + # -- Registry password (consider using existingSecretName instead) password: + # -- Name of existing Kubernetes secret containing registry credentials existingSecretName: + # -- Whether the registry uses HTTPS secure: true + + # Ingress configuration for external access ingress: + # -- Whether to create ingress resources enabled: false + # -- Ingress class name (e.g., "nginx", "traefik") className: + # -- Secret name containing TLS certificates tlsSecretName: my-tls-secret + + # Istio service mesh configuration istio: + # -- Whether to enable Istio integration enabled: false + # -- Istio gateway name for routing gateway: + + # -- Host suffix for generated ingress hostnames ingressHostSuffix: "" + + # -- Whether VirtualServices should support HTTPS routing virtualServiceSupportsHTTPS: false + +# Grove component - distributed inference orchestration +grove: + # -- Whether to enable Grove for multi-node inference coordination, if enabled, the Grove operator will be deployed cluster-wide + enabled: false + +# Kai Scheduler component - advanced workload scheduling +kai-scheduler: + # -- Whether to enable Kai Scheduler for intelligent resource allocation, if enabled, the Kai Scheduler operator will be deployed cluster-wide + enabled: false + +# etcd configuration - distributed key-value store for operator state +# For complete configuration options, see: https://github.com/bitnami/charts/tree/main/bitnami/etcd etcd: + # -- Whether to enable etcd deployment, disable if you want to use an external etcd instance enabled: true + + # Persistent storage configuration for etcd data persistence: + # Whether to enable persistent storage (recommended for production) enabled: true # Use the cluster default storage-class or override with a named class storageClass: null + # Size of persistent volume for etcd data size: 1Gi + + # Pre-upgrade job configuration preUpgrade: + # Whether to run pre-upgrade validation jobs enabled: false + + # Number of etcd replicas (1 for single-node, 3+ for HA) replicaCount: 1 - # Explicitly remove authentication + + # Authentication and authorization settings + # Explicitly remove authentication for simplified internal communication auth: rbac: + # Whether to create RBAC authentication (disabled for internal use) create: false + # Health check configuration readinessProbe: + # Whether to enable readiness probes (disabled to reduce startup complexity) enabled: false livenessProbe: + # Whether to enable liveness probes (disabled to reduce startup complexity) enabled: false + # Node tolerations for etcd pods (allows scheduling on specific nodes) tolerations: [] +# NATS configuration - messaging system for operator communication +# For complete configuration options, see: https://github.com/nats-io/k8s/tree/main/helm/charts/nats nats: + # -- Whether to enable NATS deployment, disable if you want to use an external NATS instance enabled: true - # reference a common CA Certificate or Bundle in all nats config `tls` blocks and nats-box contexts - # note: `tls.verify` still must be set in the appropriate nats config `tls` blocks to require mTLS + + # TLS Certificate Authority configuration for secure communication + # Reference a common CA Certificate or Bundle in all nats config `tls` blocks and nats-box contexts + # Note: `tls.verify` still must be set in the appropriate nats config `tls` blocks to require mTLS tlsCA: + # Whether to enable TLS CA configuration enabled: false + # Core NATS server configuration config: + # NATS clustering for high availability (multiple NATS servers) cluster: + # Whether to enable NATS clustering (disabled for single-node setups) enabled: false - + # JetStream - persistent messaging and streaming capabilities jetstream: + # Whether to enable JetStream (recommended for persistent messaging) enabled: true + # File-based storage for JetStream streams and consumers fileStore: + # Whether to enable file storage (persistent across restarts) enabled: true + # Directory path for JetStream file storage dir: /data ############################################################ - # stateful set -> volume claim templates -> jetstream pvc + # Persistent Volume Claim for JetStream file storage ############################################################ pvc: + # Whether to create a PVC for JetStream storage enabled: true + # Size of the persistent volume for JetStream data size: 10Gi + # Storage class name (leave empty for default) storageClassName: - # merge or patch the jetstream pvc + # Advanced PVC configuration (merge additional fields) # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#persistentvolumeclaim-v1-core merge: {} patch: [] - # defaults to "{{ include "nats.fullname" $ }}-js" + # PVC name (defaults to "{{ include "nats.fullname" $ }}-js") name: - # defaults to the PVC size + # Maximum size for JetStream file storage (defaults to PVC size) maxSize: + # Memory-based storage for JetStream (non-persistent) memoryStore: + # Whether to enable memory storage (faster but not persistent) enabled: false - # merge or patch the jetstream config - # https://docs.nats.io/running-a-nats-service/configuration#jetstream + # Advanced JetStream configuration + # For options see: https://docs.nats.io/running-a-nats-service/configuration#jetstream merge: {} patch: [] + # Core NATS server settings nats: + # Port for NATS client connections port: 4222 + + # TLS configuration for encrypted connections tls: + # Whether to enable TLS encryption enabled: false - # merge or patch the tls config - # https://docs.nats.io/running-a-nats-service/configuration/securing_nats/tls + # Advanced TLS configuration + # For options see: https://docs.nats.io/running-a-nats-service/configuration/securing_nats/tls merge: {} patch: [] + # Leaf nodes for creating NATS topologies and remote connections leafnodes: + # Whether to enable leaf node connections enabled: false - + # WebSocket support for browser-based NATS clients websocket: + # Whether to enable WebSocket protocol support enabled: false - + # MQTT protocol bridge for IoT device connectivity mqtt: + # Whether to enable MQTT protocol support enabled: false - + # Gateway connections for multi-cluster NATS deployments gateway: + # Whether to enable gateway connections enabled: false - + # HTTP monitoring endpoint for NATS server metrics monitor: + # Whether to enable HTTP monitoring interface enabled: true + # Port for monitoring HTTP endpoint port: 8222 + + # TLS configuration for monitoring endpoint tls: - # config.nats.tls must be enabled also - # when enabled, monitoring port will use HTTPS with the options from config.nats.tls + # Whether to enable HTTPS for monitoring (requires config.nats.tls enabled) + # When enabled, monitoring port will use HTTPS with the options from config.nats.tls enabled: false + # Go pprof profiling endpoint for performance debugging profiling: + # Whether to enable profiling endpoint (for debugging only) enabled: false + # Port for profiling endpoint port: 65432 + # Account resolver for multi-tenant NATS deployments resolver: + # Whether to enable account resolution (for advanced multi-tenancy) enabled: false - - # adds a prefix to the server name, which defaults to the pod name - # helpful for ensuring server name is unique in a super cluster + # Server naming configuration + # Adds a prefix to the server name, which defaults to the pod name + # Helpful for ensuring server name is unique in a super cluster serverNamePrefix: "" - # merge or patch the nats config - # https://docs.nats.io/running-a-nats-service/configuration - # following special rules apply + # Advanced NATS configuration merging and patching + # For complete options see: https://docs.nats.io/running-a-nats-service/configuration + # Special rules apply: # 1. strings that start with << and end with >> will be unquoted # use this for variables and numbers with units # 2. keys ending in $include will be switched to include directives # keys are sorted alphabetically, use prefix before $includes to control includes ordering # paths should be relative to /etc/nats-config/nats.conf - # example: - # + # Example: # merge: # $include: ./my-config.conf # zzz$include: ./my-config-last.conf @@ -186,48 +306,48 @@ nats: # token: << $TOKEN >> # jetstream: # max_memory_store: << 1GB >> - # - # will yield the config: - # { - # include ./my-config.conf; - # "authorization": { - # "token": $TOKEN - # }, - # "jetstream": { - # "max_memory_store": 1GB - # }, - # "server_name": "nats", - # include ./my-config-last.conf; - # } merge: {} patch: [] ############################################################ - # stateful set -> pod template -> nats container + # NATS container configuration in StatefulSet ############################################################ container: + # NATS server container image configuration image: + # Official NATS server repository repository: nats + # NATS server version (Alpine-based for smaller size) tag: 2.10.21-alpine + # Image pull policy (leave empty for chart default) pullPolicy: + # Custom registry URL (leave empty for Docker Hub) registry: - # container port options - # must be enabled in the config section also + # Container port configuration + # Note: Ports must also be enabled in the config section above # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#containerport-v1-core ports: + # Main NATS client connection port nats: {} + # Leaf node connection port leafnodes: {} + # WebSocket connection port websocket: {} + # MQTT protocol port mqtt: {} + # Cluster communication port cluster: {} + # Gateway connection port gateway: {} + # HTTP monitoring port monitor: {} + # Go profiling port profiling: {} - # map with key as env var name, value can be string or map - # example: - # + # Environment variables for the NATS container + # Map with key as env var name, value can be string or map + # Example: # env: # GOMEMLIMIT: 7GiB # TOKEN: @@ -237,211 +357,245 @@ nats: # key: token env: {} - # merge or patch the container + # Advanced container configuration merging and patching # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#container-v1-core merge: {} patch: [] ############################################################ - # stateful set -> pod template -> reloader container + # Configuration reloader container for hot config updates ############################################################ reloader: + # Whether to enable the config reloader sidecar container enabled: true + + # Config reloader container image image: + # Official NATS config reloader repository repository: natsio/nats-server-config-reloader + # Config reloader version tag: 0.16.0 + # Image pull policy (leave empty for chart default) pullPolicy: + # Custom registry URL (leave empty for Docker Hub) registry: - # env var map, see nats.env for an example + # Environment variables for the reloader container env: {} - # all nats container volume mounts with the following prefixes - # will be mounted into the reloader container + # Volume mount prefixes from NATS container to share with reloader + # All NATS container volume mounts with these prefixes will be mounted into the reloader natsVolumeMountPrefixes: - /etc/ - # merge or patch the container + # Advanced reloader container configuration # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#container-v1-core merge: {} patch: [] ############################################################ - # stateful set -> pod template -> prom-exporter container + # Prometheus metrics exporter container (optional) ############################################################ - # config.monitor must be enabled + # Note: config.monitor must be enabled for this to work promExporter: + # Whether to enable Prometheus metrics exporter sidecar enabled: false - ############################################################ - # service + # Kubernetes Service for NATS access ############################################################ service: + # Whether to create a Kubernetes Service for NATS enabled: true - # service port options - # additional boolean field enable to control whether port is exposed in the service - # must be enabled in the config section also + # Service port configuration + # Additional boolean field 'enabled' controls whether port is exposed in the service + # Note: Ports must also be enabled in the config section above # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#serviceport-v1-core ports: + # Main NATS client connection port nats: enabled: true + # Leaf node connection port leafnodes: enabled: true + # WebSocket connection port websocket: enabled: true + # MQTT protocol port mqtt: enabled: true + # Cluster communication port (typically internal only) cluster: enabled: false + # Gateway connection port (typically internal only) gateway: enabled: false + # HTTP monitoring port (typically internal only) monitor: enabled: false + # Go profiling port (typically internal only) profiling: enabled: false - # merge or patch the service + # Advanced service configuration # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#service-v1-core merge: {} patch: [] - # defaults to "{{ include "nats.fullname" $ }}" + # Service name (defaults to "{{ include "nats.fullname" $ }}") name: ############################################################ - # other nats extension points + # Advanced NATS Kubernetes resource configuration ############################################################ - # stateful set + # StatefulSet configuration for NATS server persistence statefulSet: - # merge or patch the stateful set + # Advanced StatefulSet configuration merging and patching # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#statefulset-v1-apps merge: {} patch: [] - # defaults to "{{ include "nats.fullname" $ }}" + # StatefulSet name (defaults to "{{ include "nats.fullname" $ }}") name: - # stateful set -> pod template + # Pod template configuration for NATS StatefulSet podTemplate: - # adds a hash of the ConfigMap as a pod annotation - # this will cause the StatefulSet to roll when the ConfigMap is updated + # Whether to add a hash of the ConfigMap as a pod annotation + # This will cause the StatefulSet to roll when the ConfigMap is updated configChecksumAnnotation: true - # map of topologyKey: topologySpreadConstraint - # labelSelector will be added to match StatefulSet pods - # - # topologySpreadConstraints: - # kubernetes.io/hostname: - # maxSkew: 1 - # + # Pod topology spread constraints for better distribution across nodes + # Map of topologyKey: topologySpreadConstraint + # labelSelector will be added automatically to match StatefulSet pods + # Example: + # topologySpreadConstraints: + # kubernetes.io/hostname: + # maxSkew: 1 topologySpreadConstraints: {} - # merge or patch the pod template + # Advanced pod template configuration # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#pod-v1-core merge: spec: + # Node tolerations for NATS pods (allows scheduling on specific nodes) tolerations: [] patch: [] - # headless service + # Headless service for StatefulSet pod discovery headlessService: - # merge or patch the headless service + # Advanced headless service configuration # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#service-v1-core merge: {} patch: [] - # defaults to "{{ include "nats.fullname" $ }}-headless" + # Headless service name (defaults to "{{ include "nats.fullname" $ }}-headless") name: - # config map + # ConfigMap for NATS server configuration configMap: - # merge or patch the config map + # Advanced ConfigMap configuration # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#configmap-v1-core merge: {} patch: [] - # defaults to "{{ include "nats.fullname" $ }}-config" + # ConfigMap name (defaults to "{{ include "nats.fullname" $ }}-config") name: - # pod disruption budget + # Pod Disruption Budget for controlled rolling updates podDisruptionBudget: + # Whether to create a PodDisruptionBudget (recommended for production) enabled: true - - # service account + # Service Account for NATS server pods serviceAccount: + # Whether to create and use a dedicated service account enabled: false - - ############################################################ - # natsBox - # - # NATS Box Deployment and associated resources + # NATS Box - CLI tools and debugging container + # NATS Box provides CLI tools for interacting with NATS server ############################################################ natsBox: + # Whether to deploy NATS Box for CLI access and debugging enabled: true ############################################################ - # NATS contexts + # NATS client contexts for authentication and connection ############################################################ contexts: + # Default context configuration default: + # Credentials-based authentication creds: - # set contents in order to create a secret with the creds file contents + # Inline credentials file contents (base64 encoded) contents: - # set secretName in order to mount an existing secret to dir + # Name of existing secret containing credentials file secretName: - # defaults to /etc/nats-creds/ + # Directory to mount credentials (defaults to /etc/nats-creds/) dir: + # Key name in secret for credentials file key: nats.creds + + # NKey-based authentication (public/private key pairs) nkey: - # set contents in order to create a secret with the nkey file contents + # Inline NKey file contents (base64 encoded) contents: - # set secretName in order to mount an existing secret to dir + # Name of existing secret containing NKey file secretName: - # defaults to /etc/nats-nkeys/ + # Directory to mount NKey (defaults to /etc/nats-nkeys/) dir: + # Key name in secret for NKey file key: nats.nk - # used to connect with client certificates + + # TLS client certificate authentication tls: - # set secretName in order to mount an existing secret to dir + # Name of existing secret containing TLS client certificates secretName: - # defaults to /etc/nats-certs/ + # Directory to mount certificates (defaults to /etc/nats-certs/) dir: + # Certificate file name in secret cert: tls.crt + # Private key file name in secret key: tls.key - # merge or patch the context - # https://docs.nats.io/using-nats/nats-tools/nats_cli#nats-contexts + # Advanced context configuration + # For options see: https://docs.nats.io/using-nats/nats-tools/nats_cli#nats-contexts merge: {} patch: [] - # name of context to select by default + # Name of context to select by default for NATS CLI operations defaultContextName: default ############################################################ - # deployment -> pod template -> nats-box container + # NATS Box container configuration ############################################################ container: + # NATS Box container image image: + # Official NATS Box repository with CLI tools repository: natsio/nats-box + # NATS Box version tag: 0.14.5 + # Image pull policy (leave empty for chart default) pullPolicy: + # Custom registry URL (leave empty for Docker Hub) registry: - # env var map, see nats.env for an example + # Environment variables for NATS Box container env: {} - # merge or patch the container + # Advanced container configuration # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#container-v1-core merge: {} patch: [] - # service account + + # Service Account for NATS Box deployment serviceAccount: + # Whether to create and use a dedicated service account for NATS Box enabled: false + # Pod template configuration for NATS Box deployment podTemplate: merge: spec: + # Node tolerations for NATS Box pods tolerations: [] patch: [] diff --git a/deploy/cloud/operator/Makefile b/deploy/cloud/operator/Makefile index ef21bf7429..fb7f8d62be 100644 --- a/deploy/cloud/operator/Makefile +++ b/deploy/cloud/operator/Makefile @@ -57,7 +57,7 @@ ensure-yq: fi .PHONY: manifests -manifests: controller-gen ensure-yq ## Generate WebhookConfiguration, ClusterRole and CustomResourceDefinition objects. +manifests: controller-gen ensure-yq generate-api-docs ## Generate WebhookConfiguration, ClusterRole and CustomResourceDefinition objects. # Use a large maxDescLen to ensure all field comments are included as OpenAPI descriptions $(CONTROLLER_GEN) rbac:roleName=manager-role crd:maxDescLen=100000 webhook paths="./..." output:crd:artifacts:config=config/crd/bases echo "Removing name from mainContainer required fields" @@ -266,6 +266,27 @@ $(HELMIFY): $(LOCALBIN) helm: manifests kustomize helmify $(KUSTOMIZE) build config/default | $(HELMIFY) -image-pull-secrets charts/dynamo-kubernetes-operator +######################### CRD Reference Docs +CRD_REF_DOCS_VERSION ?= v0.0.12 +CRD_REF_DOCS ?= $(LOCALBIN)/crd-ref-docs + +.PHONY: crd-ref-docs +crd-ref-docs: $(CRD_REF_DOCS) ## Download crd-ref-docs locally if necessary. +$(CRD_REF_DOCS): $(LOCALBIN) + @echo "Installing crd-ref-docs $(CRD_REF_DOCS_VERSION)" + @GOBIN=$(LOCALBIN) go install github.com/elastic/crd-ref-docs@$(CRD_REF_DOCS_VERSION) + @echo "โœ… crd-ref-docs $(CRD_REF_DOCS_VERSION) installed successfully" + +.PHONY: generate-api-docs +generate-api-docs: crd-ref-docs ## Generate API reference documentation from CRDs + @echo "๐Ÿ“š Generating CRD API reference documentation..." + @mkdir -p docs + @$(CRD_REF_DOCS) \ + --source-path=api \ + --config=docs/crd-ref-docs-config.yaml \ + --renderer=markdown \ + --output-path=docs/api_reference.md + @echo "โœ… Generated API reference at docs/api_reference.md" .PHONY: coverage coverage: test diff --git a/deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go b/deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go index ae6e233696..8eedb0a2d4 100644 --- a/deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go +++ b/deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go @@ -67,13 +67,13 @@ type DynamoComponentDeploymentSharedSpec struct { // Labels to add to generated Kubernetes resources for this component. Labels map[string]string `json:"labels,omitempty"` - // contains the name of the component + // The name of the component ServiceName string `json:"serviceName,omitempty"` // ComponentType indicates the role of this component (for example, "main"). ComponentType string `json:"componentType,omitempty"` - // dynamo namespace of the service (allows to override the dynamo namespace of the service defined in annotations inside the dynamo archive) + // Dynamo namespace of the service (allows to override the Dynamo namespace of the service defined in annotations inside the Dynamo archive) DynamoNamespace *string `json:"dynamoNamespace,omitempty"` // Resources requested and limits for this component, including CPU, memory, @@ -99,8 +99,9 @@ type DynamoComponentDeploymentSharedSpec struct { // ExtraPodMetadata adds labels/annotations to the created Pods. ExtraPodMetadata *dynamoCommon.ExtraPodMetadata `json:"extraPodMetadata,omitempty"` // +optional - // ExtraPodSpec merges additional fields into the generated PodSpec for advanced - // customization (tolerations, node selectors, affinity, etc.). + // ExtraPodSpec allows to override the main pod spec configuration. + // It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field + // that allows overriding the main container configuration. ExtraPodSpec *dynamoCommon.ExtraPodSpec `json:"extraPodSpec,omitempty"` // LivenessProbe to detect and restart unhealthy containers. diff --git a/deploy/cloud/operator/api/v1alpha1/groupversion_info.go b/deploy/cloud/operator/api/v1alpha1/groupversion_info.go index 59ac11975a..28e0ce1266 100644 --- a/deploy/cloud/operator/api/v1alpha1/groupversion_info.go +++ b/deploy/cloud/operator/api/v1alpha1/groupversion_info.go @@ -17,7 +17,7 @@ * Modifications Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES */ -// Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group +// Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group. // +kubebuilder:object:generate=true // +groupName=nvidia.com package v1alpha1 diff --git a/deploy/cloud/operator/docs/api_reference.md b/deploy/cloud/operator/docs/api_reference.md new file mode 100644 index 0000000000..85cef51347 --- /dev/null +++ b/deploy/cloud/operator/docs/api_reference.md @@ -0,0 +1,279 @@ +# API Reference + +## Packages +- [nvidia.com/v1alpha1](#nvidiacomv1alpha1) + + +## nvidia.com/v1alpha1 + +Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group. + +### Resource Types +- [DynamoComponentDeployment](#dynamocomponentdeployment) +- [DynamoGraphDeployment](#dynamographdeployment) + + + +#### Autoscaling + + + + + + + +_Appears in:_ +- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec) +- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `enabled` _boolean_ | | | | +| `minReplicas` _integer_ | | | | +| `maxReplicas` _integer_ | | | | +| `behavior` _[HorizontalPodAutoscalerBehavior](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#horizontalpodautoscalerbehavior-v2-autoscaling)_ | | | | +| `metrics` _[MetricSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#metricspec-v2-autoscaling) array_ | | | | + + + + +#### DynamoComponentDeployment + + + +DynamoComponentDeployment is the Schema for the dynamocomponentdeployments API + + + + + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | | +| `kind` _string_ | `DynamoComponentDeployment` | | | +| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | | +| `spec` _[DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)_ | Spec defines the desired state for this Dynamo component deployment. | | | + + +#### DynamoComponentDeploymentSharedSpec + + + + + + + +_Appears in:_ +- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `annotations` _object (keys:string, values:string)_ | Annotations to add to generated Kubernetes resources for this component
(such as Pod, Service, and Ingress when applicable). | | | +| `labels` _object (keys:string, values:string)_ | Labels to add to generated Kubernetes resources for this component. | | | +| `serviceName` _string_ | The name of the component | | | +| `componentType` _string_ | ComponentType indicates the role of this component (for example, "main"). | | | +| `dynamoNamespace` _string_ | Dynamo namespace of the service (allows to override the Dynamo namespace of the service defined in annotations inside the Dynamo archive) | | | +| `resources` _[Resources](#resources)_ | Resources requested and limits for this component, including CPU, memory,
GPUs/devices, and any runtime-specific resources. | | | +| `autoscaling` _[Autoscaling](#autoscaling)_ | Autoscaling config for this component (replica range, target utilization, etc.). | | | +| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs defines additional environment variables to inject into the component containers. | | | +| `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as
environment variables in the component containers. | | | +| `pvc` _[PVC](#pvc)_ | PVC config describing volumes to be mounted by the component. | | | +| `ingress` _[IngressSpec](#ingressspec)_ | Ingress config to expose the component outside the cluster (or through a service mesh). | | | +| `sharedMemory` _[SharedMemorySpec](#sharedmemoryspec)_ | SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). | | | +| `extraPodMetadata` _[ExtraPodMetadata](#extrapodmetadata)_ | ExtraPodMetadata adds labels/annotations to the created Pods. | | | +| `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.
It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field
that allows overriding the main container configuration. | | | +| `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. | | | +| `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. | | | +| `replicas` _integer_ | Replicas is the desired number of Pods for this component when autoscaling is not used. | | | +| `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. | | | + + +#### DynamoComponentDeploymentSpec + + + +DynamoComponentDeploymentSpec defines the desired state of DynamoComponentDeployment + + + +_Appears in:_ +- [DynamoComponentDeployment](#dynamocomponentdeployment) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `dynamoComponent` _string_ | DynamoComponent selects the Dynamo component from the archive to deploy.
Typically corresponds to a component defined in the packaged Dynamo artifacts. | | | +| `dynamoTag` _string_ | contains the tag of the DynamoComponent: for example, "my_package:MyService" | | | +| `backendFramework` _string_ | BackendFramework specifies the backend framework (e.g., "sglang", "vllm", "trtllm") | | Enum: [sglang vllm trtllm]
| +| `annotations` _object (keys:string, values:string)_ | Annotations to add to generated Kubernetes resources for this component
(such as Pod, Service, and Ingress when applicable). | | | +| `labels` _object (keys:string, values:string)_ | Labels to add to generated Kubernetes resources for this component. | | | +| `serviceName` _string_ | The name of the component | | | +| `componentType` _string_ | ComponentType indicates the role of this component (for example, "main"). | | | +| `dynamoNamespace` _string_ | Dynamo namespace of the service (allows to override the Dynamo namespace of the service defined in annotations inside the Dynamo archive) | | | +| `resources` _[Resources](#resources)_ | Resources requested and limits for this component, including CPU, memory,
GPUs/devices, and any runtime-specific resources. | | | +| `autoscaling` _[Autoscaling](#autoscaling)_ | Autoscaling config for this component (replica range, target utilization, etc.). | | | +| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs defines additional environment variables to inject into the component containers. | | | +| `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as
environment variables in the component containers. | | | +| `pvc` _[PVC](#pvc)_ | PVC config describing volumes to be mounted by the component. | | | +| `ingress` _[IngressSpec](#ingressspec)_ | Ingress config to expose the component outside the cluster (or through a service mesh). | | | +| `sharedMemory` _[SharedMemorySpec](#sharedmemoryspec)_ | SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). | | | +| `extraPodMetadata` _[ExtraPodMetadata](#extrapodmetadata)_ | ExtraPodMetadata adds labels/annotations to the created Pods. | | | +| `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.
It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field
that allows overriding the main container configuration. | | | +| `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. | | | +| `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. | | | +| `replicas` _integer_ | Replicas is the desired number of Pods for this component when autoscaling is not used. | | | +| `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. | | | + + +#### DynamoGraphDeployment + + + +DynamoGraphDeployment is the Schema for the dynamographdeployments API. + + + + + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | | +| `kind` _string_ | `DynamoGraphDeployment` | | | +| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | | +| `spec` _[DynamoGraphDeploymentSpec](#dynamographdeploymentspec)_ | Spec defines the desired state for this graph deployment. | | | +| `status` _[DynamoGraphDeploymentStatus](#dynamographdeploymentstatus)_ | Status reflects the current observed state of this graph deployment. | | | + + +#### DynamoGraphDeploymentSpec + + + +DynamoGraphDeploymentSpec defines the desired state of DynamoGraphDeployment. + + + +_Appears in:_ +- [DynamoGraphDeployment](#dynamographdeployment) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `dynamoGraph` _string_ | DynamoGraph selects the graph (workflow/topology) to deploy. This must match
a graph name packaged with the Dynamo archive. | | | +| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs are environment variables applied to all services in the graph unless
overridden by service-specific configuration. | | Optional: {}
| +| `backendFramework` _string_ | BackendFramework specifies the backend framework (e.g., "sglang", "vllm", "trtllm"). | | Enum: [sglang vllm trtllm]
| + + +#### DynamoGraphDeploymentStatus + + + +DynamoGraphDeploymentStatus defines the observed state of DynamoGraphDeployment. + + + +_Appears in:_ +- [DynamoGraphDeployment](#dynamographdeployment) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `state` _string_ | State is a high-level textual status of the graph deployment lifecycle. | | | +| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the graph deployment.
The slice is merged by type on patch updates. | | | + + +#### IngressSpec + + + + + + + +_Appears in:_ +- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec) +- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `enabled` _boolean_ | Enabled exposes the component through an ingress or virtual service when true. | | | +| `host` _string_ | Host is the base host name to route external traffic to this component. | | | +| `useVirtualService` _boolean_ | UseVirtualService indicates whether to configure a service-mesh VirtualService instead of a standard Ingress. | | | +| `virtualServiceGateway` _string_ | VirtualServiceGateway optionally specifies the gateway name to attach the VirtualService to. | | | +| `hostPrefix` _string_ | HostPrefix is an optional prefix added before the host. | | | +| `annotations` _object (keys:string, values:string)_ | Annotations to set on the generated Ingress/VirtualService resources. | | | +| `labels` _object (keys:string, values:string)_ | Labels to set on the generated Ingress/VirtualService resources. | | | +| `tls` _[IngressTLSSpec](#ingresstlsspec)_ | TLS holds the TLS configuration used by the Ingress/VirtualService. | | | +| `hostSuffix` _string_ | HostSuffix is an optional suffix appended after the host. | | | +| `ingressControllerClassName` _string_ | IngressControllerClassName selects the ingress controller class (e.g., "nginx"). | | | + + +#### IngressTLSSpec + + + + + + + +_Appears in:_ +- [IngressSpec](#ingressspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `secretName` _string_ | SecretName is the name of a Kubernetes Secret containing the TLS certificate and key. | | | + + +#### MultinodeSpec + + + + + + + +_Appears in:_ +- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec) +- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `nodeCount` _integer_ | Indicates the number of nodes to deploy for multinode components.
Total number of GPUs is NumberOfNodes * GPU limit.
Must be greater than 1. | 2 | Minimum: 2
| + + +#### PVC + + + + + + + +_Appears in:_ +- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec) +- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `create` _boolean_ | Create indicates to create a new PVC | | | +| `name` _string_ | Name is the name of the PVC | | | +| `storageClass` _string_ | StorageClass to be used for PVC creation. Leave it as empty if the PVC is already created. | | | +| `size` _[Quantity](#quantity)_ | Size of the NIM cache in Gi, used during PVC creation | | | +| `volumeAccessMode` _[PersistentVolumeAccessMode](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#persistentvolumeaccessmode-v1-core)_ | VolumeAccessMode is the volume access mode of the PVC | | | +| `mountPoint` _string_ | | | | + + +#### SharedMemorySpec + + + + + + + +_Appears in:_ +- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec) +- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `disabled` _boolean_ | | | | +| `size` _[Quantity](#quantity)_ | | | | + + diff --git a/deploy/cloud/operator/docs/crd-ref-docs-config.yaml b/deploy/cloud/operator/docs/crd-ref-docs-config.yaml new file mode 100644 index 0000000000..9d3e503bac --- /dev/null +++ b/deploy/cloud/operator/docs/crd-ref-docs-config.yaml @@ -0,0 +1,56 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Configuration file for crd-ref-docs +# https://github.com/elastic/crd-ref-docs + +processor: + # Ignore common metadata fields that are not user-configurable + ignoreFields: + - "metadata.creationTimestamp" + - "metadata.generation" + - "metadata.resourceVersion" + - "metadata.selfLink" + - "metadata.uid" + - "status.conditions[*].lastTransitionTime" + - "status.observedGeneration" + - "TypeMeta$" + ignoreTypes: + - "List$" + - "ParseError$" + # Ignore only the override wrapper type to reduce repetition + # Keep SharedSpec so embedded fields are documented once + - "DynamoComponentDeploymentOverridesSpec$" + - "DynamoComponentDeploymentStatus$" + - "BaseStatus$" + +render: + # Output format - use markdown instead of default asciidoc + format: markdown + + # Kubernetes version for API compatibility info + kubernetesVersion: "1.28" + + # Group related resources together + groupByKind: true + + # Include resource descriptions + includeDescription: true + + # Reduce repetition in links and references + allowDangerousTypes: false + + # Sort types alphabetically for better organization + sortByName: true diff --git a/docs/Makefile b/docs/Makefile index 31ddca99f8..ecb8da6db4 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -23,12 +23,68 @@ SPHINXBUILD ?= sphinx-build SOURCEDIR = . BUILDDIR = build +##@ General + # Put it first so that "make" without argument is like "make help". -help: +help: ## Display help for all targets @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) + @echo "" + @echo "Additional documentation targets:" + @awk 'BEGIN {FS = ":.*##"; printf " \033[36m%-20s\033[0m %s\n", "TARGET", "DESCRIPTION"} /^[a-zA-Z_0-9-]+:.*?##/ { printf " \033[36m%-20s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) }' $(MAKEFILE_LIST) -clean: +clean: ## Clean build artifacts @rm -fr ${BUILDDIR} + +##@ Helm Documentation + +## Location to install dependencies to +LOCALBIN ?= $(shell pwd)/bin +$(LOCALBIN): + mkdir -p $(LOCALBIN) + +## Tool Versions +HELM_DOCS_VERSION ?= 1.14.2 + +## Tool Binaries +HELM_DOCS ?= $(LOCALBIN)/helm-docs-$(HELM_DOCS_VERSION) + +.PHONY: helm-docs-install +helm-docs-install: $(HELM_DOCS) ## Download helm-docs locally if necessary +$(HELM_DOCS): $(LOCALBIN) + @echo "๐Ÿ“ฅ Downloading helm-docs $(HELM_DOCS_VERSION)..." + @ARCH=$$(uname -m | sed 's/x86_64/amd64/' | sed 's/aarch64/arm64/'); \ + OS=$$(uname -s | tr '[:upper:]' '[:lower:]'); \ + curl -sSL "https://github.com/norwoodj/helm-docs/releases/download/v$(HELM_DOCS_VERSION)/helm-docs_$(HELM_DOCS_VERSION)_$${OS}_$${ARCH}.tar.gz" | \ + tar xz -C $(LOCALBIN) helm-docs && \ + mv $(LOCALBIN)/helm-docs $(HELM_DOCS) && \ + echo "โœ… helm-docs $(HELM_DOCS_VERSION) installed successfully" + +.PHONY: generate-helm-docs +generate-helm-docs: helm-docs-install ## Generate README.md for Helm charts from values.yaml + @echo "๐Ÿ“š Generating Helm chart documentation..." + @cd ../deploy/cloud/helm/platform && $(realpath $(HELM_DOCS)) \ + --template-files=README.md.gotmpl \ + --output-file=README.md \ + --sort-values-order=file \ + --chart-to-generate=. \ + --ignore-non-descriptions + @echo "โœ… Generated documentation at ../deploy/cloud/helm/platform/README.md" + +.PHONY: helm-docs-clean +helm-docs-clean: ## Remove generated helm documentation + @echo "๐Ÿงน Cleaning generated helm documentation..." + @rm -f ../deploy/cloud/helm/platform/README.md + @echo "โœ… Cleaned helm documentation" + +.PHONY: generate-crd-docs +generate-crd-docs: ## Generate CRD API reference documentation + @echo "๐Ÿ“š Generating CRD API reference documentation..." + @cd ../deploy/cloud/operator && make generate-api-docs + @echo "โœ… CRD API reference generated" + +.PHONY: docs-all +docs-all: generate-helm-docs generate-crd-docs html ## Generate all documentation (Sphinx + Helm + CRDs) + .PHONY: help Makefile clean diff --git a/docs/guides/dynamo_deploy/README.md b/docs/guides/dynamo_deploy/README.md index 090c4ed983..3d2b0b8ce6 100644 --- a/docs/guides/dynamo_deploy/README.md +++ b/docs/guides/dynamo_deploy/README.md @@ -59,6 +59,14 @@ It's a Kubernetes Custom Resource that defines your inference pipeline: The scripts in the `components//launch` folder like `agg.sh` demonstrate how you can serve your models locally. The corresponding YAML files like `agg.yaml` show you how you could create a kubernetes deployment for your inference graph. +## ๐Ÿ“– API Reference & Documentation + +For detailed technical specifications of Dynamo's Kubernetes resources: + +- **[API Reference](api-reference.md)** - Complete CRD field specifications for `DynamoGraphDeployment` and `DynamoComponentDeployment` +- **[Operator Guide](dynamo_operator.md)** - Dynamo operator configuration and management +- **[Create Deployment](create_deployment.md)** - Step-by-step deployment creation examples + ### Choosing Your Architecture Pattern When creating a deployment, select the architecture pattern that best fits your use case: diff --git a/docs/guides/dynamo_deploy/api-reference.md b/docs/guides/dynamo_deploy/api-reference.md new file mode 100644 index 0000000000..a9d3653066 --- /dev/null +++ b/docs/guides/dynamo_deploy/api-reference.md @@ -0,0 +1,22 @@ + + +# Dynamo CRD API Reference + +For the complete technical API reference for Dynamo Custom Resource Definitions, see: + +**๐Ÿ“– [Dynamo CRD API Reference](../../../deploy/cloud/operator/docs/api_reference.md)** diff --git a/docs/guides/dynamo_deploy/dynamo_cloud.md b/docs/guides/dynamo_deploy/dynamo_cloud.md index c45a66a0b7..b46e4f093e 100644 --- a/docs/guides/dynamo_deploy/dynamo_cloud.md +++ b/docs/guides/dynamo_deploy/dynamo_cloud.md @@ -39,7 +39,7 @@ helm version # v3.0+ docker version # Running daemon # Set your inference runtime image -export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1 +export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0 # Also available: sglang-runtime, tensorrtllm-runtime ``` @@ -53,7 +53,7 @@ Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidi ```bash # 1. Set environment export NAMESPACE=dynamo-kubernetes -export RELEASE_VERSION=0.4.1 # any version of Dynamo 0.3.2+ +export RELEASE_VERSION=0.5.0 # any version of Dynamo 0.3.2+ # 2. Install CRDs helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz @@ -65,6 +65,15 @@ helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$ helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} ``` +> [!TIP] +> By default, Grove and Kai Scheduler are NOT installed. You can enable them by setting the following flags in the helm install command: + +```bash +--set "grove.enabled=true" +--set "kai-scheduler.enabled=true" +``` + + โ†’ [Verify Installation](#verify-installation) ## Path C: Custom Development @@ -79,7 +88,7 @@ export NAMESPACE=dynamo-cloud export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/ # or your registry export DOCKER_USERNAME='$oauthtoken' export DOCKER_PASSWORD= -export IMAGE_TAG=0.4.1 +export IMAGE_TAG=0.5.0 # 2. Build operator cd deploy/cloud/operator @@ -176,6 +185,7 @@ kubectl create secret generic hf-token-secret \ ## Advanced Options +- [Helm Chart Configuration](../../../deploy/cloud/helm/platform/README.md) - [GKE-specific setup](gke_setup.md) - [Create custom deployments](create_deployment.md) - [Dynamo Operator details](dynamo_operator.md) diff --git a/docs/guides/dynamo_deploy/dynamo_operator.md b/docs/guides/dynamo_deploy/dynamo_operator.md index 960719f3a6..e900d65ac7 100644 --- a/docs/guides/dynamo_deploy/dynamo_operator.md +++ b/docs/guides/dynamo_deploy/dynamo_operator.md @@ -23,50 +23,9 @@ Dynamo operator is a Kubernetes operator that simplifies the deployment, configu ## Custom Resource Definitions (CRDs) -### CRD: `DynamoGraphDeployment` +For the complete technical API reference for Dynamo Custom Resource Definitions, see: - -| Field | Type | Description | Required | Default | -|------------------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------|----------|---------| -| `services` | map | Map of service names to runtime configurations. This allows the user to override the service configuration defined in the DynamoComponentDeployment. | Yes | | -| `envs` | list | list of global environment variables. | No | | - - -**API Version:** `nvidia.com/v1alpha1` -**Scope:** Namespaced - -#### Example -```yaml -apiVersion: nvidia.com/v1alpha1 -kind: DynamoGraphDeployment -metadata: - name: disagg -spec: - envs: - - name: GLOBAL_ENV_VAR - value: some_global_value - services: - Frontend: - replicas: 1 - envs: - - name: SPECIFIC_ENV_VAR - value: some_specific_value - Processor: - replicas: 1 - envs: - - name: SPECIFIC_ENV_VAR - value: some_specific_value - VllmWorker: - replicas: 1 - envs: - - name: SPECIFIC_ENV_VAR - value: some_specific_value - PrefillWorker: - replicas: 1 - envs: - - name: SPECIFIC_ENV_VAR - value: some_specific_value -``` +**๐Ÿ“– [Dynamo CRD API Reference](../../../deploy/cloud/operator/docs/api_reference.md)** ## Installation @@ -151,25 +110,6 @@ export NAMESPACE= kubectl get dynamographdeployment llm-agg -n $NAMESPACE ``` - -## Reconciliation Logic - -### DynamoGraphDeployment - -- **Actions:** - - Create a DynamoComponent CR to build the docker image - - Create a DynamoComponentDeployment CR for each component defined in the Dynamo graph being deployed -- **Status Management:** - - `.status.conditions`: Reflects readiness, failure, progress states - - `.status.state`: overall state of the deployment, based on the state of the DynamoComponentDeployments - -### DynamoComponentDeployment - -- **Actions:** - - Create a Deployment, Service, and Ingress for the service -- **Status Management:** - - `.status.conditions`: Reflects readiness, failure, progress states - ## Configuration diff --git a/docs/guides/dynamo_deploy/grove.md b/docs/guides/dynamo_deploy/grove.md index d6ecd0982f..94ac7ba984 100644 --- a/docs/guides/dynamo_deploy/grove.md +++ b/docs/guides/dynamo_deploy/grove.md @@ -87,10 +87,14 @@ Grove represents a significant advancement in Kubernetes-based orchestration for ## Getting Started -> **Note**: Grove is currently in development and aligning with NVIDIA Dynamo's release schedule. +Grove relies on KAI Scheduler for resource allocation and scheduling. + +For KAI Scheduler, see the [KAI Scheduler Deployment Guide](https://github.com/NVIDIA/KAI-Scheduler). For installation instructions, see the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md). For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](multinode-deployment.md), which demonstrates multi-node disaggregated serving scenarios. -For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove). \ No newline at end of file +For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove). + +Dynamo Cloud also allows you to install Grove and KAI Scheduler as part of the platform installation. See the [Dynamo Cloud Deployment Guide](dynamo_cloud.md) for more details. \ No newline at end of file