-
Notifications
You must be signed in to change notification settings - Fork 690
feat: automatically setup and inject prometheus configuration #2912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
WalkthroughAdds Prometheus endpoint wiring across Helm and operator: new Helm values, operator flag/Config field, env var injection into managed pods, and conditional PodMonitor templates. Updates docs to reflect operator-managed PodMonitors and streamlined installation, and adjusts etcd image repo notes. Tests updated to assert PROMETHEUS_ENDPOINT propagation. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor U as User
participant H as Helm Chart
participant K as Kubernetes
participant OD as Operator Deployment
participant OP as Operator (Main)
participant RC as Reconciler/Controllers
participant S as Managed Service Pods
U->>H: helm install --set dynamo-operator.metrics.prometheusEndpoint=...
H->>K: Render Deploy + PodMonitors (if CRD) and apply
K-->>OD: Create/Update manager Pod with arg --prometheus-endpoint
OD->>OP: Start with provided flag
OP->>RC: Build controller Config{ PrometheusEndpoint }
RC->>S: Generate PodSpecs with env PROMETHEUS_ENDPOINT
Note over S: Service env can override via Deployment envs
sequenceDiagram
autonumber
participant OP as Operator
participant T as Templates (prometheus.yaml)
participant K as Kubernetes API
participant P as Prometheus
participant G as Grafana
OP->>T: On install/upgrade
T->>K: Create PodMonitor(frontend/worker/planner) if CRD present
K->>P: PodMonitors discover /metrics on labeled pods
P-->>G: Expose metrics for dashboards
Note over K,P: Namespace selection per values (restricted or any)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.2.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (13)
deploy/cloud/helm/platform/components/operator/values.yaml (1)
99-101: Clarify documentation for prometheusEndpoint (feeds helm-docs).The README row is generated from these comments; tighten wording here to avoid confusion about “retrieve metrics”.
Apply:
- metrics: - prometheusEndpoint: "" + # metrics configuration + metrics: + # Base URL of your Prometheus server. When set, the operator injects the + # PROMETHEUS_ENDPOINT env var into managed pods. Scraping is handled by + # PodMonitors; workloads do not need to contact Prometheus unless they + # explicitly use this env var. + prometheusEndpoint: ""deploy/cloud/helm/platform/README.md (2)
87-87: Rephrase prometheusEndpoint description at the source (values.yaml).Since this file is auto-generated by helm-docs, please update the comment in values.yaml (operator chart) to reflect: “Base URL of your Prometheus server. When set, the operator injects PROMETHEUS_ENDPOINT into managed pods; scraping is done via PodMonitors.”
After editing comments, re-run helm-docs to refresh this table.
91-91: Avoid bare URL in generated docs (MD034).Update the underlying values.yaml description to use a markdown link so helm-docs emits a non-bare URL.
Apply in the platform chart values.yaml (source for this row):
-| etcd.image.repository | string | `"bitnamilegacy/etcd"` | following bitnami announcement for brownout - https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog, we need to use the legacy repository until we migrate to the new "secure" repository | +| etcd.image.repository | string | `"bitnamilegacy/etcd"` | following Bitnami brownout notice — see [announcement](https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog); use the legacy repository until migrating to the new “secure” repository |Then re-run helm-docs.
deploy/cloud/helm/platform/components/operator/templates/prometheus.yaml (2)
19-19: Make PodMonitor names release-scoped to avoid collisions.If multiple releases share a namespace, static names can conflict. Include .Release.Name in names.
Apply:
-metadata: - name: dynamo-frontend +metadata: + name: {{ printf "dynamo-frontend-%s" .Release.Name | trunc 63 | trimSuffix "-" }} + labels: + app.kubernetes.io/instance: {{ .Release.Name }} + app.kubernetes.io/managed-by: {{ .Release.Service }} ... -metadata: - name: dynamo-worker +metadata: + name: {{ printf "dynamo-worker-%s" .Release.Name | trunc 63 | trimSuffix "-" }} + labels: + app.kubernetes.io/instance: {{ .Release.Name }} + app.kubernetes.io/managed-by: {{ .Release.Service }} ... -metadata: - name: dynamo-planner +metadata: + name: {{ printf "dynamo-planner-%s" .Release.Name | trunc 63 | trimSuffix "-" }} + labels: + app.kubernetes.io/instance: {{ .Release.Name }} + app.kubernetes.io/managed-by: {{ .Release.Service }}Also applies to: 41-41, 63-63
37-37: Trim trailing spaces and add newline at EOF.Pre-commit flagged trailing whitespace; also add a final newline.
Apply:
---- +--- ... ---- +--- ... -{{- end }} +{{- end }} +Also applies to: 59-59, 81-81
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1)
110-112: Quote the value to avoid YAML/arg parsing pitfallsIf the endpoint contains special chars, whitespace, or colons, quoting is safer.
- - --prometheus-endpoint={{ .Values.dynamo.metrics.prometheusEndpoint }} + - --prometheus-endpoint={{ quote .Values.dynamo.metrics.prometheusEndpoint }}deploy/cloud/operator/internal/dynamo/graph.go (1)
690-692: Skip merge when there’s nothing to addTiny inefficiency: when no standard envs are set, we still sort container.Env. Early-exit to save work.
- // merge the env vars to allow users to override the standard env vars - container.Env = MergeEnvs(standardEnvVars, container.Env) + // merge the env vars to allow users to override the standard env vars + if len(standardEnvVars) > 0 { + container.Env = MergeEnvs(standardEnvVars, container.Env) + }deploy/cloud/helm/platform/values.yaml (1)
114-118: Clarify intent with an example valueAdd a concrete example to reduce misconfiguration.
- # -- Endpoint that services can use to retrieve metrics. If set, dynamo operator will automatically inject the PROMETHEUS_ENDPOINT environment variable into services it manages. Users can override the value of the PROMETHEUS_ENDPOINT environment variable by modifying the corresponding deployment's environment variables + # -- Prometheus server base URL to be surfaced to services (e.g., "http://prometheus-server.monitoring.svc:9090"). + # If set, the operator injects PROMETHEUS_ENDPOINT into managed pods. Service manifests can override it via their env vars.deploy/cloud/operator/cmd/main.go (1)
135-136: Validate and log --prometheus-endpoint like modelExpressURLParity with modelExpressURL helps catch typos early and yields clearer logs.
var modelExpressURL string var prometheusEndpoint string @@ flag.StringVar(&prometheusEndpoint, "prometheus-endpoint", "", "URL of the Prometheus endpoint to use for metrics") @@ // Validate modelExpressURL if provided if modelExpressURL != "" { if _, err := url.Parse(modelExpressURL); err != nil { setupLog.Error(err, "invalid model-express-url provided", "url", modelExpressURL) os.Exit(1) } setupLog.Info("Model Express URL configured", "url", modelExpressURL) } + // Validate prometheusEndpoint if provided + if prometheusEndpoint != "" { + if _, err := url.Parse(prometheusEndpoint); err != nil { + setupLog.Error(err, "invalid prometheus-endpoint provided", "url", prometheusEndpoint) + os.Exit(1) + } + setupLog.Info("Prometheus endpoint configured", "url", prometheusEndpoint) + } ctrlConfig := commonController.Config{ @@ - ModelExpressURL: modelExpressURL, - PrometheusEndpoint: prometheusEndpoint, + ModelExpressURL: modelExpressURL, + PrometheusEndpoint: prometheusEndpoint,Also applies to: 165-167, 202-204
deploy/cloud/operator/internal/dynamo/graph_test.go (1)
1074-1076: Add a test for override precedence of PROMETHEUS_ENDPOINTValidate that a component’s env overrides the controller-injected value.
Proposed test (new function):
func TestPrometheusEndpointOverridePrecedence(t *testing.T) { ctrlCfg := controller_common.Config{ PrometheusEndpoint: "http://cluster-prom:9090" } dgd := &v1alpha1.DynamoGraphDeployment{ ObjectMeta: metav1.ObjectMeta{Name: "t", Namespace: "ns"}, Spec: v1alpha1.DynamoGraphDeploymentSpec{ Services: map[string]*v1alpha1.DynamoComponentDeploymentOverridesSpec{ "Frontend": { DynamoComponentDeploymentSharedSpec: v1alpha1.DynamoComponentDeploymentSharedSpec{ ComponentType: commonconsts.ComponentTypeFrontend, Envs: []corev1.EnvVar{{Name: "PROMETHEUS_ENDPOINT", Value: "http://override:9090"}}, }, }, }, }, } got, err := GenerateGrovePodGangSet(context.Background(), dgd, ctrlCfg, nil) if err != nil { t.Fatal(err) } envs := got.Spec.Template.Cliques[0].Spec.PodSpec.Containers[0].Env found := false for _, e := range envs { if e.Name == "PROMETHEUS_ENDPOINT" { found = true if e.Value != "http://override:9090" { t.Fatalf("override not applied, got %q", e.Value) } } } if !found { t.Fatalf("PROMETHEUS_ENDPOINT not found") } }docs/guides/dynamo_deploy/metrics.md (3)
21-24: Confirm Helm flags and quoting; add quoting for the endpoint.Please verify the chart value key is exactly
prometheusEndpointand not nested (e.g.,prometheus.endpoint) in values.yaml. Also, quote the URL to avoid shell parsing edge cases.- --set prometheusEndpoint=http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 + --set prometheusEndpoint="http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090"
30-37: Operator install note is good; ensure the link path and namespace are correct.
- The relative link likely resolves, but using
./installation_guide.mdis simpler/clearer.- Consider showing
-n <namespace>explicitly on the Helm install line to match later usage of$NAMESPACE.-Follow our [Installation Guide](../dynamo_deploy/installation_guide.md) +Follow our [Installation Guide](./installation_guide.md)
96-99: Pluralize and tighten wording; verify label/annotation keys match operator code.Grammar fix plus clearer scoping; and please confirm the exact keys and values are what the operator applies/selects on.
-The Prometheus Operator uses PodMonitor resources to automatically discover and scrape metrics from pods. To enable this discovery, the Dynamo operator automatically creates PodMonitor resource and adds these labels to all pods: +The Prometheus Operator uses PodMonitor resources to automatically discover and scrape metrics from pods. To enable this discovery, the Dynamo operator automatically creates PodMonitor resources and adds the following labels to all Dynamo-managed pods:Follow-up checks to run after deploying one workload:
- kubectl get pod -n "$NAMESPACE" -l nvidia.com/metrics-enabled=true -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.metadata.labels.nvidia.com/dynamo-component-type}{"\n"}{end}'
- kubectl get podmonitor -n "$NAMESPACE"
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (11)
deploy/cloud/helm/platform/README.md(1 hunks)deploy/cloud/helm/platform/components/operator/templates/deployment.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/templates/prometheus.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/values.yaml(1 hunks)deploy/cloud/helm/platform/values.yaml(2 hunks)deploy/cloud/operator/cmd/main.go(3 hunks)deploy/cloud/operator/internal/controller_common/predicate.go(1 hunks)deploy/cloud/operator/internal/dynamo/graph.go(1 hunks)deploy/cloud/operator/internal/dynamo/graph_test.go(3 hunks)docs/guides/dynamo_deploy/installation_guide.md(1 hunks)docs/guides/dynamo_deploy/metrics.md(3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
📚 Learning: 2025-09-03T21:02:02.170Z
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2843
File: deploy/cloud/helm/platform/values.yaml:0-0
Timestamp: 2025-09-03T21:02:02.170Z
Learning: With etcd chart version 12.0.18, the global.security.allowInsecureImages setting is not required when using the bitnamilegacy repository, unlike older chart versions.
Applied to files:
deploy/cloud/helm/platform/values.yaml
🪛 YAMLlint (1.37.1)
deploy/cloud/helm/platform/components/operator/templates/prometheus.yaml
[error] 15-15: syntax error: expected the node content, but found '-'
(syntax)
[error] 37-37: trailing spaces
(trailing-spaces)
[error] 59-59: trailing spaces
(trailing-spaces)
[error] 81-81: no new line character at the end of file
(new-line-at-end-of-file)
deploy/cloud/helm/platform/components/operator/values.yaml
[error] 98-98: trailing spaces
(trailing-spaces)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/2912/merge) by julienmancuso.
deploy/cloud/helm/platform/components/operator/templates/prometheus.yaml
[error] 1-1: Trailing whitespace detected by pre-commit 'trailing-whitespace' and file modified.
deploy/cloud/helm/platform/components/operator/values.yaml
[error] 1-1: Trailing whitespace detected by pre-commit 'trailing-whitespace' and file modified.
🪛 markdownlint-cli2 (0.17.2)
deploy/cloud/helm/platform/README.md
91-91: Bare URL used
(MD034, no-bare-urls)
🪛 LanguageTool
docs/guides/dynamo_deploy/metrics.md
[grammar] ~96-~96: There might be a mistake here.
Context: ...very, the Dynamo operator automatically creates PodMonitor resource and adds these labe...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Build and Test - vllm
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (5)
docs/guides/dynamo_deploy/installation_guide.md (1)
67-67: Nice: single-command install with namespace creation.Using --create-namespace streamlines Path A. No further changes needed here.
deploy/cloud/helm/platform/components/operator/templates/prometheus.yaml (2)
21-28: Namespace selector logic looks good.The any: true vs matchNames gating aligns with .Values.namespaceRestriction.enabled.
Also applies to: 43-50, 65-72
15-15: FYI: YAMLlint “syntax error” here is a false positive.The {{- if ... }} Helm directive confuses plain YAML linters; rendering is valid. No change needed.
deploy/cloud/operator/internal/controller_common/predicate.go (1)
61-63: LGTM — end-to-end wiring verified. CLI flag, config field, main assignment, controller injection, and tests are all updated accordingly.deploy/cloud/operator/internal/dynamo/graph.go (1)
663-692: Good merge strategy for envs; user-defined wins and order is deterministicThe switch to MergeEnvs with a standard list is correct and predictable. LGTM.
deploy/cloud/helm/platform/components/operator/templates/prometheus.yaml
Show resolved
Hide resolved
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
b9a205d to
d009162
Compare
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com> Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Overview:
automatically setup and inject prometheus configuration
Summary by CodeRabbit
New Features
Documentation
Chores