Skip to content

Conversation

@julienmancuso
Copy link
Contributor

@julienmancuso julienmancuso commented Sep 5, 2025

Overview:

automatically setup and inject prometheus configuration

Summary by CodeRabbit

  • New Features

    • Configurable Prometheus endpoint; operator injects PROMETHEUS_ENDPOINT into managed services.
    • Automatic PodMonitor creation for frontend, worker, and planner with optional namespace scoping.
  • Documentation

    • Simplified install using --create-namespace.
    • Revamped metrics guide: operator-managed PodMonitors, dedicated monitoring namespace, Prometheus endpoint configuration, and Grafana dashboard setup; removed manual PodMonitor steps.
    • Updated etcd image repository guidance.
  • Chores

    • Removed deprecated imageBuilder service account annotations configuration.

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 5, 2025

Walkthrough

Adds Prometheus endpoint wiring across Helm and operator: new Helm values, operator flag/Config field, env var injection into managed pods, and conditional PodMonitor templates. Updates docs to reflect operator-managed PodMonitors and streamlined installation, and adjusts etcd image repo notes. Tests updated to assert PROMETHEUS_ENDPOINT propagation.

Changes

Cohort / File(s) Summary
Helm values and docs
deploy/cloud/helm/platform/values.yaml, deploy/cloud/helm/platform/README.md
Introduce dynamo-operator.metrics.prometheusEndpoint value and document it; add/readme note for etcd.image.repository (legacy Bitnami repo) and update comments.
Operator deployment template
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml
Add conditional manager arg --prometheus-endpoint={{ .Values.dynamo.metrics.prometheusEndpoint }}.
Operator PodMonitor templates
deploy/cloud/helm/platform/components/operator/templates/prometheus.yaml
New conditional template emitting three PodMonitors (frontend/worker/planner) with optional namespace restriction and CRD presence check.
Operator component values
deploy/cloud/helm/platform/components/operator/values.yaml
Add dynamo.metrics.prometheusEndpoint: ""; remove dynamo.imageBuilder.serviceAccount.annotations.
Operator main and config
deploy/cloud/operator/cmd/main.go, deploy/cloud/operator/internal/controller_common/predicate.go
Add --prometheus-endpoint CLI flag; thread PrometheusEndpoint through commonController.Config.
Env var propagation logic
deploy/cloud/operator/internal/dynamo/graph.go
Add PROMETHEUS_ENDPOINT standard env when configured; refactor env assembly to merge with overrides deterministically.
Tests
deploy/cloud/operator/internal/dynamo/graph_test.go
Update fixtures to set PrometheusEndpoint; assert PROMETHEUS_ENDPOINT appears in generated pod envs.
User documentation
docs/guides/dynamo_deploy/installation_guide.md, docs/guides/dynamo_deploy/metrics.md
Streamline install with --create-namespace; shift metrics guide to operator-managed PodMonitors, add Prometheus endpoint config, simplify Grafana setup; remove manual PodMonitor steps.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant H as Helm Chart
  participant K as Kubernetes
  participant OD as Operator Deployment
  participant OP as Operator (Main)
  participant RC as Reconciler/Controllers
  participant S as Managed Service Pods

  U->>H: helm install --set dynamo-operator.metrics.prometheusEndpoint=...
  H->>K: Render Deploy + PodMonitors (if CRD) and apply
  K-->>OD: Create/Update manager Pod with arg --prometheus-endpoint
  OD->>OP: Start with provided flag
  OP->>RC: Build controller Config{ PrometheusEndpoint }
  RC->>S: Generate PodSpecs with env PROMETHEUS_ENDPOINT
  Note over S: Service env can override via Deployment envs
Loading
sequenceDiagram
  autonumber
  participant OP as Operator
  participant T as Templates (prometheus.yaml)
  participant K as Kubernetes API
  participant P as Prometheus
  participant G as Grafana

  OP->>T: On install/upgrade
  T->>K: Create PodMonitor(frontend/worker/planner) if CRD present
  K->>P: PodMonitors discover /metrics on labeled pods
  P-->>G: Expose metrics for dashboards
  Note over K,P: Namespace selection per values (restricted or any)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

I hop through charts with paws so light,
Dropped PodMonitors in the night.
A PROMETHEUS_ENDPOINT, snug and neat,
Now every pod hums metric beat.
Grafana glows, stars align—
Carrots, charts, and signals fine.
Metrics harvested—dashboard time! 🥕📈

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.2.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (13)
deploy/cloud/helm/platform/components/operator/values.yaml (1)

99-101: Clarify documentation for prometheusEndpoint (feeds helm-docs).

The README row is generated from these comments; tighten wording here to avoid confusion about “retrieve metrics”.

Apply:

-  metrics:
-    prometheusEndpoint: ""
+  # metrics configuration
+  metrics:
+    # Base URL of your Prometheus server. When set, the operator injects the
+    # PROMETHEUS_ENDPOINT env var into managed pods. Scraping is handled by
+    # PodMonitors; workloads do not need to contact Prometheus unless they
+    # explicitly use this env var.
+    prometheusEndpoint: ""
deploy/cloud/helm/platform/README.md (2)

87-87: Rephrase prometheusEndpoint description at the source (values.yaml).

Since this file is auto-generated by helm-docs, please update the comment in values.yaml (operator chart) to reflect: “Base URL of your Prometheus server. When set, the operator injects PROMETHEUS_ENDPOINT into managed pods; scraping is done via PodMonitors.”

After editing comments, re-run helm-docs to refresh this table.


91-91: Avoid bare URL in generated docs (MD034).

Update the underlying values.yaml description to use a markdown link so helm-docs emits a non-bare URL.

Apply in the platform chart values.yaml (source for this row):

-| etcd.image.repository | string | `"bitnamilegacy/etcd"` | following bitnami announcement for brownout - https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog, we need to use the legacy repository until we migrate to the new "secure" repository |
+| etcd.image.repository | string | `"bitnamilegacy/etcd"` | following Bitnami brownout notice — see [announcement](https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog); use the legacy repository until migrating to the new “secure” repository |

Then re-run helm-docs.

deploy/cloud/helm/platform/components/operator/templates/prometheus.yaml (2)

19-19: Make PodMonitor names release-scoped to avoid collisions.

If multiple releases share a namespace, static names can conflict. Include .Release.Name in names.

Apply:

-metadata:
-  name: dynamo-frontend
+metadata:
+  name: {{ printf "dynamo-frontend-%s" .Release.Name | trunc 63 | trimSuffix "-" }}
+  labels:
+    app.kubernetes.io/instance: {{ .Release.Name }}
+    app.kubernetes.io/managed-by: {{ .Release.Service }}

...

-metadata:
-  name: dynamo-worker
+metadata:
+  name: {{ printf "dynamo-worker-%s" .Release.Name | trunc 63 | trimSuffix "-" }}
+  labels:
+    app.kubernetes.io/instance: {{ .Release.Name }}
+    app.kubernetes.io/managed-by: {{ .Release.Service }}

...

-metadata:
-  name: dynamo-planner
+metadata:
+  name: {{ printf "dynamo-planner-%s" .Release.Name | trunc 63 | trimSuffix "-" }}
+  labels:
+    app.kubernetes.io/instance: {{ .Release.Name }}
+    app.kubernetes.io/managed-by: {{ .Release.Service }}

Also applies to: 41-41, 63-63


37-37: Trim trailing spaces and add newline at EOF.

Pre-commit flagged trailing whitespace; also add a final newline.

Apply:

---- 
+---
 ...
---- 
+---
 ...
-{{- end }}
+{{- end }}
+

Also applies to: 59-59, 81-81

deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1)

110-112: Quote the value to avoid YAML/arg parsing pitfalls

If the endpoint contains special chars, whitespace, or colons, quoting is safer.

-          - --prometheus-endpoint={{ .Values.dynamo.metrics.prometheusEndpoint }}
+          - --prometheus-endpoint={{ quote .Values.dynamo.metrics.prometheusEndpoint }}
deploy/cloud/operator/internal/dynamo/graph.go (1)

690-692: Skip merge when there’s nothing to add

Tiny inefficiency: when no standard envs are set, we still sort container.Env. Early-exit to save work.

-    // merge the env vars to allow users to override the standard env vars
-    container.Env = MergeEnvs(standardEnvVars, container.Env)
+    // merge the env vars to allow users to override the standard env vars
+    if len(standardEnvVars) > 0 {
+        container.Env = MergeEnvs(standardEnvVars, container.Env)
+    }
deploy/cloud/helm/platform/values.yaml (1)

114-118: Clarify intent with an example value

Add a concrete example to reduce misconfiguration.

-      # -- Endpoint that services can use to retrieve metrics. If set, dynamo operator will automatically inject the PROMETHEUS_ENDPOINT environment variable into services it manages. Users can override the value of the PROMETHEUS_ENDPOINT environment variable by modifying the corresponding deployment's environment variables
+      # -- Prometheus server base URL to be surfaced to services (e.g., "http://prometheus-server.monitoring.svc:9090").
+      # If set, the operator injects PROMETHEUS_ENDPOINT into managed pods. Service manifests can override it via their env vars.
deploy/cloud/operator/cmd/main.go (1)

135-136: Validate and log --prometheus-endpoint like modelExpressURL

Parity with modelExpressURL helps catch typos early and yields clearer logs.

 var modelExpressURL string
 var prometheusEndpoint string
@@
 flag.StringVar(&prometheusEndpoint, "prometheus-endpoint", "",
 		"URL of the Prometheus endpoint to use for metrics")
@@
 	// Validate modelExpressURL if provided
 	if modelExpressURL != "" {
 		if _, err := url.Parse(modelExpressURL); err != nil {
 			setupLog.Error(err, "invalid model-express-url provided", "url", modelExpressURL)
 			os.Exit(1)
 		}
 		setupLog.Info("Model Express URL configured", "url", modelExpressURL)
 	}
+	// Validate prometheusEndpoint if provided
+	if prometheusEndpoint != "" {
+		if _, err := url.Parse(prometheusEndpoint); err != nil {
+			setupLog.Error(err, "invalid prometheus-endpoint provided", "url", prometheusEndpoint)
+			os.Exit(1)
+		}
+		setupLog.Info("Prometheus endpoint configured", "url", prometheusEndpoint)
+	}
 
 	ctrlConfig := commonController.Config{
@@
-		ModelExpressURL:    modelExpressURL,
-		PrometheusEndpoint: prometheusEndpoint,
+		ModelExpressURL:    modelExpressURL,
+		PrometheusEndpoint: prometheusEndpoint,

Also applies to: 165-167, 202-204

deploy/cloud/operator/internal/dynamo/graph_test.go (1)

1074-1076: Add a test for override precedence of PROMETHEUS_ENDPOINT

Validate that a component’s env overrides the controller-injected value.

Proposed test (new function):

func TestPrometheusEndpointOverridePrecedence(t *testing.T) {
  ctrlCfg := controller_common.Config{ PrometheusEndpoint: "http://cluster-prom:9090" }
  dgd := &v1alpha1.DynamoGraphDeployment{
    ObjectMeta: metav1.ObjectMeta{Name: "t", Namespace: "ns"},
    Spec: v1alpha1.DynamoGraphDeploymentSpec{
      Services: map[string]*v1alpha1.DynamoComponentDeploymentOverridesSpec{
        "Frontend": {
          DynamoComponentDeploymentSharedSpec: v1alpha1.DynamoComponentDeploymentSharedSpec{
            ComponentType: commonconsts.ComponentTypeFrontend,
            Envs: []corev1.EnvVar{{Name: "PROMETHEUS_ENDPOINT", Value: "http://override:9090"}},
          },
        },
      },
    },
  }
  got, err := GenerateGrovePodGangSet(context.Background(), dgd, ctrlCfg, nil)
  if err != nil { t.Fatal(err) }
  envs := got.Spec.Template.Cliques[0].Spec.PodSpec.Containers[0].Env
  found := false
  for _, e := range envs {
    if e.Name == "PROMETHEUS_ENDPOINT" {
      found = true
      if e.Value != "http://override:9090" {
        t.Fatalf("override not applied, got %q", e.Value)
      }
    }
  }
  if !found { t.Fatalf("PROMETHEUS_ENDPOINT not found") }
}
docs/guides/dynamo_deploy/metrics.md (3)

21-24: Confirm Helm flags and quoting; add quoting for the endpoint.

Please verify the chart value key is exactly prometheusEndpoint and not nested (e.g., prometheus.endpoint) in values.yaml. Also, quote the URL to avoid shell parsing edge cases.

-  --set prometheusEndpoint=http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
+  --set prometheusEndpoint="http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090"

30-37: Operator install note is good; ensure the link path and namespace are correct.

  • The relative link likely resolves, but using ./installation_guide.md is simpler/clearer.
  • Consider showing -n <namespace> explicitly on the Helm install line to match later usage of $NAMESPACE.
-Follow our [Installation Guide](../dynamo_deploy/installation_guide.md)
+Follow our [Installation Guide](./installation_guide.md)

96-99: Pluralize and tighten wording; verify label/annotation keys match operator code.

Grammar fix plus clearer scoping; and please confirm the exact keys and values are what the operator applies/selects on.

-The Prometheus Operator uses PodMonitor resources to automatically discover and scrape metrics from pods. To enable this discovery, the Dynamo operator automatically creates PodMonitor resource and adds these labels to all pods:
+The Prometheus Operator uses PodMonitor resources to automatically discover and scrape metrics from pods. To enable this discovery, the Dynamo operator automatically creates PodMonitor resources and adds the following labels to all Dynamo-managed pods:

Follow-up checks to run after deploying one workload:

  • kubectl get pod -n "$NAMESPACE" -l nvidia.com/metrics-enabled=true -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.metadata.labels.nvidia.com/dynamo-component-type}{"\n"}{end}'
  • kubectl get podmonitor -n "$NAMESPACE"
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between cb5a657 and 059e589.

📒 Files selected for processing (11)
  • deploy/cloud/helm/platform/README.md (1 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/prometheus.yaml (1 hunks)
  • deploy/cloud/helm/platform/components/operator/values.yaml (1 hunks)
  • deploy/cloud/helm/platform/values.yaml (2 hunks)
  • deploy/cloud/operator/cmd/main.go (3 hunks)
  • deploy/cloud/operator/internal/controller_common/predicate.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/graph.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/graph_test.go (3 hunks)
  • docs/guides/dynamo_deploy/installation_guide.md (1 hunks)
  • docs/guides/dynamo_deploy/metrics.md (3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
📚 Learning: 2025-09-03T21:02:02.170Z
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2843
File: deploy/cloud/helm/platform/values.yaml:0-0
Timestamp: 2025-09-03T21:02:02.170Z
Learning: With etcd chart version 12.0.18, the global.security.allowInsecureImages setting is not required when using the bitnamilegacy repository, unlike older chart versions.

Applied to files:

  • deploy/cloud/helm/platform/values.yaml
🪛 YAMLlint (1.37.1)
deploy/cloud/helm/platform/components/operator/templates/prometheus.yaml

[error] 15-15: syntax error: expected the node content, but found '-'

(syntax)


[error] 37-37: trailing spaces

(trailing-spaces)


[error] 59-59: trailing spaces

(trailing-spaces)


[error] 81-81: no new line character at the end of file

(new-line-at-end-of-file)

deploy/cloud/helm/platform/components/operator/values.yaml

[error] 98-98: trailing spaces

(trailing-spaces)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/2912/merge) by julienmancuso.
deploy/cloud/helm/platform/components/operator/templates/prometheus.yaml

[error] 1-1: Trailing whitespace detected by pre-commit 'trailing-whitespace' and file modified.

deploy/cloud/helm/platform/components/operator/values.yaml

[error] 1-1: Trailing whitespace detected by pre-commit 'trailing-whitespace' and file modified.

🪛 markdownlint-cli2 (0.17.2)
deploy/cloud/helm/platform/README.md

91-91: Bare URL used

(MD034, no-bare-urls)

🪛 LanguageTool
docs/guides/dynamo_deploy/metrics.md

[grammar] ~96-~96: There might be a mistake here.
Context: ...very, the Dynamo operator automatically creates PodMonitor resource and adds these labe...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (5)
docs/guides/dynamo_deploy/installation_guide.md (1)

67-67: Nice: single-command install with namespace creation.

Using --create-namespace streamlines Path A. No further changes needed here.

deploy/cloud/helm/platform/components/operator/templates/prometheus.yaml (2)

21-28: Namespace selector logic looks good.

The any: true vs matchNames gating aligns with .Values.namespaceRestriction.enabled.

Also applies to: 43-50, 65-72


15-15: FYI: YAMLlint “syntax error” here is a false positive.

The {{- if ... }} Helm directive confuses plain YAML linters; rendering is valid. No change needed.

deploy/cloud/operator/internal/controller_common/predicate.go (1)

61-63: LGTM — end-to-end wiring verified. CLI flag, config field, main assignment, controller injection, and tests are all updated accordingly.

deploy/cloud/operator/internal/dynamo/graph.go (1)

663-692: Good merge strategy for envs; user-defined wins and order is deterministic

The switch to MergeEnvs with a standard list is correct and predictable. LGTM.

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@julienmancuso julienmancuso enabled auto-merge (squash) September 8, 2025 23:51
@julienmancuso julienmancuso merged commit 729b5fd into main Sep 9, 2025
13 of 15 checks passed
@julienmancuso julienmancuso deleted the jsm/dep-374 branch September 9, 2025 00:14
indrajit96 pushed a commit that referenced this pull request Sep 9, 2025
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
indrajit96 pushed a commit that referenced this pull request Sep 9, 2025
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
tedzhouhk pushed a commit that referenced this pull request Sep 10, 2025
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants