Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kube-prometheus-stack] Lost Prometheus targets after upgrade to version 63 #4878

Closed
kallaics opened this issue Sep 27, 2024 · 23 comments · Fixed by #4883
Closed

[kube-prometheus-stack] Lost Prometheus targets after upgrade to version 63 #4878

kallaics opened this issue Sep 27, 2024 · 23 comments · Fixed by #4883
Labels
bug Something isn't working

Comments

@kallaics
Copy link

Describe the bug a clear and concise description of what the bug is.

We have PodMonitor, ServiceMonitor and ScrapeConfig resources and after the upgrade these resources are completely disappeared from the targets list in Prometheus and not collect the metrics. I spent a morning to find any relevant change in the release notes, but I don't found anything. The only solution was the downgrade.
All affected resources are included through Selector.matchLabels

Worked well on version: 62.7.0
Not worked on version: 63.0.0 and 63.0.1

Prometheus configuration

prometheus:
  enabled: true
  ingress:
    enabled: true
    ingressClassName: nginx
    annotations:
      nginx.ingress.kubernetes.io/whitelist-source-range: <ip ranges>
    labels: {}
    hosts:
      - prometheus.<domain>.<tld>
    paths:
      - /
    pathType: ImplementationSpecific
    tls: []
  prometheusSpec:
    prometheusExternalLabelNameClear: true
    replicaExternalLabelNameClear: true
    podMonitorSelector:
      matchLabels:
        app.kubernetes.io/part-of: flux
    serviceMonitorSelector:
      matchLabels:
        monitoring: prometheus
    scrapeConfigSelector:
      matchLabels:
        monitoring: prometheus
    secrets:
      - ca-certs
    additionalArgs:
      - name: "web.enable-admin-api"
    resources:  
      requests:
        memory: 6Gi

Scrape configuration (I know it is alpha status, so we are testing on non-production environment only)

apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
  name: <Kubernetes cluster name>
  namespace: monitoring
  labels:
    monitoring: prometheus-scrape
spec:
  scrapeInterval: 15s
  scrapeTimeout: 12s
  honorLabels: true
  metricsPath: '/federate'
  scheme: HTTPS
  tlsConfig:
    ca:
      secret:
        key: internal-ca.pem
        name: ca-certs
  basicAuth:
    username:
      key: username
      name: monitoring-federate-auth
    password:
      key: password
      name: monitoring-federate-auth
  params:
    'match[]':
      - '{__name__=~".+"}'
  staticConfigs:
    - labels:
        job: <Kubernetes cluster name>
        env: dev
        cluster_name: <Kubernetes cluster name>
      targets:
        - 'prometheus.<domain2>.<tld>'

Thanks guys the help.

What's your helm version?

version.BuildInfo{Version:"v3.16.1", GitCommit:"5a5449dc42be07001fd5771d56429132984ab3ab", GitTreeState:"dirty", GoVersion:"go1.23.1"}

What's your kubectl version?

Client Version: v1.31.1 Kustomize Version: v5.4.2 Server Version: v1.29.7

Which chart?

kube-prometheus-stack

What's the chart version?

63.0.0, 63.1.0

What happened?

After Helm chart upgrade the most of the target disappeared from Prometheus.
Affected resources:

  • PodMonitor
  • ServiceMonitor
  • ScrapeConfig

What you expected to happen?

Not disappear the targets from Prometheus.

How to reproduce it?

Install Prometheus from kube-prometheus-stack version 62.7.0
Install PodMonitor resource in different namespace from the Prometheus
Install ServiceMonitor, ScrapeConfig resources in same namespace with Prometheus
After all target works well, upgrade the cluster to 63.0.0 or 63.1.0

Enter the changed values of values.yaml?

None

Enter the command that you execute and failing/misfunctioning.

helm install prometheus prometheus-community/kube-prometheus-stack --values values.yaml

Anything else we need to know?

The Deployment happened via FluxCD v 2.3.0 (latest stable).
Full Helm chart reinstall from scratch not solved the issue.
All CRDs are on latest version for the Helm chart.

@kallaics kallaics added the bug Something isn't working label Sep 27, 2024
@kallaics kallaics changed the title [kube-prometheus-stack] Lost scrapes after upgrade to version 63 [kube-prometheus-stack] Lost Prometheus targets after upgrade to version 63 Sep 27, 2024
@romankor
Copy link

Same here , i had to revert !

@fandujar
Copy link

@kallaics did you check the matching label? The default values now set release: release_name as default label and I had to insert this label on all service monitors

@kallaics
Copy link
Author

kallaics commented Sep 27, 2024

You are right, I forgot to add this step to the reproducing steps. The custom matching label is a difference from the basic settings.

I double checked the matching labels. My custom matching labels were added to the resources.

Helm chart configuration not changed during the upgrade on my side, and after downgrade everything was okay again.

@jkroepke
Copy link
Member

Instead:

    scrapeConfigSelector:
      matchLabels:
        monitoring: prometheus

use

    scrapeConfigSelector:
      matchLabels:
        release: null
        monitoring: prometheus

Ref:

## To remove the release label from the matchLabels condition, explicit set release to null.

@yellowhat
Copy link
Contributor

yellowhat commented Sep 29, 2024

I would like to remove the remove the default matchLabels.

The following works as expected:

helm template foo kube-prometheus-stack \
    --repo https://prometheus-community.github.io/helm-charts \
    >z0

cat >values.yaml <<EOF
prometheus:
  prometheusSpec:
    ruleSelector:
      matchLabels: null
    serviceMonitorSelector:
      matchLabels: null
EOF

helm template foo kube-prometheus-stack \
    --repo https://prometheus-community.github.io/helm-charts \
    --values values.yaml \
    >z1
$ diff z0 z1
--- z0
+++ z1
@@ -2273,8 +2273,7 @@
   routePrefix: "/"
   serviceAccountName: foo-kube-prometheus-stack-prometheus
   serviceMonitorSelector: 
-    matchLabels:
-      release: 'foo'
+    {}
   serviceMonitorNamespaceSelector: 
     {}
   podMonitorSelector: 
@@ -2297,8 +2296,7 @@
   ruleNamespaceSelector: 
     {}
   ruleSelector: 
-    matchLabels:
-      release: 'foo'
+    {}
   scrapeConfigSelector: 
     matchLabels:
       release: 'foo'

But if I use it as a subchart:

cat >Chart.yaml <<EOF
---
apiVersion: v2
name: system
type: application
version: 0.0.0
appVersion: 0.0.0
dependencies:
  - name: kube-prometheus-stack
    version: 63.1.0
    repository: https://prometheus-community.github.io/helm-charts
EOF

cat >values.yaml <<EOF
kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      ruleSelector:
        matchLabels: null
      serviceMonitorSelector:
        matchLabels: null
EOF

helm template foo . \
    --values values.yaml \
    >z1
$ diff z0 z1 | grep matchLabels
<empty>

in other words the matchLabels: null is skipped.

Instead:

cat >values.yaml <<EOF
kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      ruleSelector:
        matchLabels: a
      serviceMonitorSelector:
        matchLabels: a
EOF

helm template foo . \
    --values values.yaml \
    >z1
$ diff z0 z1 | grep matchLabels -A2
-    matchLabels:
-      release: 'foo'
+    matchLabels: a
   serviceMonitorNamespaceSelector: 
     {}
--
-    matchLabels:
-      release: 'foo'
+    matchLabels: a
   scrapeConfigSelector: 
     matchLabels:
       release: 'foo'

Maybe due to helm/helm#12637?

@jkroepke jkroepke pinned this issue Sep 29, 2024
@jkroepke
Copy link
Member

@yellowhat Correct, We are aware that there is an helm issue which add an incompatibility with sub-charts. ref: #4818

Could you try this:

kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      additionalConfig:
        scrapeConfigSelector:
          matchLabels:
            monitoring: prometheus

helm template may looks goofy, but due how duplicate keys on YAML works, the config from additionalConfig should be prefer.

Mention that this is maybe a short-term workaround. We are looking forward into a solution.

@kallaics
Copy link
Author

Hi guys,

I never thought, during the weekend will somebody find the root cause and can provide a workaround as well. Big high five for every participants!

I agree with @yellowhat recommendation to apply the changes. I have no clue, how much user use the release label, but the idea was basically good and useful for Helm chart validation.

@jkroepke It is good catch, but it will cause probably "major" change. I think this also would be great to move these items to there. We will see it can be there in the next version or just the next major release.

@yellowhat
Copy link
Contributor

@yellowhat Correct, We are aware that there is an helm issue which add an incompatibility with sub-charts. ref: #4818

Could you try this:

kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      additionalConfig:
        scrapeConfigSelector:
          matchLabels:
            monitoring: prometheus

helm template may looks goofy, but due how duplicate keys on YAML works, the config from additionalConfig should be prefer.

Mention that this is maybe a short-term workaround. We are looking forward into a solution.

From what I can see, the default value is already empty:

kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      additionalConfig: {}

How can I remove the matchLabels options?

@jkroepke
Copy link
Member

If scrapeConfigSelector is configured via additionalConfig , it should fully override the existing scrapeConfigSelector

@rouke-broersma
Copy link
Contributor

rouke-broersma commented Sep 30, 2024

@jkroepke the helm issue with null values to subcharts that would hit in this case is afaik already fixed, and I was able to upgrade to this new version without any changes to my deployment:

https://github.com/broersma-forslund/homelab/pull/570/files
See diff in this comment: broersma-forslund/homelab#570 (comment)

I suspect users with issues are using older versions of helm.

@yellowhat
Copy link
Contributor

yellowhat commented Sep 30, 2024

May I ask which version are you using?

I am on 3.16.1.

@rouke-broersma
Copy link
Contributor

May I ask which version are you using?

I am on 3.16.1.

Mhm weird, I'm on v3.15.2. But it's clearly working for my scenario.

@jkroepke
Copy link
Member

jkroepke commented Sep 30, 2024

It works only, if it's placed in the default values.yaml of the umbrella chart. I see in your diff, thats the case.

If the override comes from additional values, like via -f, the issue still occurs.

@BadLiveware
Copy link

As far as I know helm does not support overriding a non-null value with a null value in a subchart, e.g.

name: monitoring
version: "0.0.1"
appVersion: "0.1.0"
description: Monitoring platform
type: application
apiVersion: v2
dependencies:
  - name: kube-prometheus-stack
    version: 63.1.0
    repository: https://prometheus-community.github.io/helm-charts
kube-prometheus-stack:
  ## Manages Prometheus and Alertmanager components
  ##
  prometheusOperator:
  prometheus:
    prometheusSpec:
      ruleSelector:
        matchLabels: null # These nulls can never override a non-null value in values.yaml in https://prometheus-community.github.io/helm-charts
      serviceMonitorSelector:
        matchLabels: null 
      podMonitorSelector:
        matchLabels: null
      probeSelector:
        matchLabels: null
      scrapeConfigSelector:
        matchLabels: null

so with this setup it is impossible to override the value in e.g. https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml#L3708-L3710

See various helm issues helm/helm#12741 helm/helm#13222 helm/helm#12511

@rouke-broersma
Copy link
Contributor

As far as I know helm does not support overriding a non-null value with a null value in a subchart, e.g.

As explained above this does work as overrides in the parent chart default values but apparently not in values overrides provided as separate values to the parent chart.

@jkroepke
Copy link
Member

Hi all,

this morning, I discussed with @QuentinBisson possible option to recover from the failed v63 release. Both proposals will result into an other breaking change. But it is what it is.

Proposal 1 - Switch to plain strings (Vote with 🎉 )

All Selectors become plain strings.

Instead:

kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      scrapeConfigSelector:
        matchLabels:
          monitoring: prometheus

This format will be used:

kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      scrapeConfigSelector: |-
        matchLabels:
          monitoring: prometheus

This would avoid the required to set values to zero while giving the full flexibility to fill values. An empty string mean null in that case. If set, the default value "release: {{ $.Release.Name }}" would be overwritten without setting anything else to null.

The plain string would be passed through tpl as well.

On template site, it will be changes from

{{- if not (kindIs "invalid" .Values.prometheus.prometheusSpec.ruleNamespaceSelector) }}
  ruleNamespaceSelector: {{ tpl (toYaml .Values.prometheus.prometheusSpec.ruleNamespaceSelector) . | nindent 4 }}
{{- end }}

to

{{- with .Values.prometheus.prometheusSpec.ruleNamespaceSelector }}
  ruleNamespaceSelector: {{ tpl . $ | nindent 4 }}
{{- end }}

Proposal 2 - Revert to old values (Vote with 🚀 )

Basically revert #4818 without any future improvements.

@kallaics
Copy link
Author

How it looks like to introduce a variable to apply the default label (release: {{ .Release.Name }} ) ?
It is my proposal. It is not breaking the simple yaml structure and probably better to read it.

In my case will ignore therelease label:

Option A:

Added new variable to ruleconfigSelector level.

kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      ruleConfigSelector:
        includeReleaseLabel: false    # Default: true and will apply the 'release' label, when matchLabels is null, will ignored
        matchLabels:                         # Allow to configure more additional label too with or without release label.   
          monitoring: prometheus

The template:
This logic can be implemented in _helpers.tpl and just include it.

{{- if not (kindIs "invalid" .Values.prometheus.prometheusSpec.ruleNamespaceSelector) }}
  ruleNamespaceSelector: 
    {{- if .Values.prometheus.prometheusSpec.ruleNamespaceSelector.includeReleaseLabel) }}
    release: {{ .Release.Name }}
  {{- end }}
  {{- if .Values.prometheus.prometheusSpec.ruleNamespaceSelector.matchLabel) }}
  {{ toYaml .Values.prometheus.prometheusSpec.ruleNamespaceSelector.matchLabel . | nindent 4 }}
  {{- end }}
{{- end }}

Option B:

the ruleConfigSelector part will untouched and new variable added to the PrometheusSpec level.

kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      ruleConfigWithReleaseLabel: false    # Default: true and will apply the 'release' label, when matchLabels is null, will ignored
      ruleConfigSelector:
        matchLabels:                         # Allow to configure more additional label too with or without release label.   
          monitoring: prometheus

The template:
This logic can be implemented in _helpers.tpl and just include it.

{{- if not (kindIs "invalid" .Values.prometheus.prometheusSpec.ruleNamespaceSelector) }}
  ruleNamespaceSelector: 
    {{- if .Values.prometheus.prometheusSpec.ruleConfigWithReleaseLabel) }}
    release: {{ .Release.Name }}
  {{- end }}
  {{- if .Values.prometheus.prometheusSpec.ruleNamespaceSelector.matchLabel) }}
  {{ toYaml .Values.prometheus.prometheusSpec.ruleNamespaceSelector.matchLabel . | nindent 4 }}
  {{- end }}
{{- end }}

@tobimichael96
Copy link

tobimichael96 commented Sep 30, 2024

As far as I know helm does not support overriding a non-null value with a null value in a subchart, e.g.

name: monitoring
version: "0.0.1"
appVersion: "0.1.0"
description: Monitoring platform
type: application
apiVersion: v2
dependencies:
  - name: kube-prometheus-stack
    version: 63.1.0
    repository: https://prometheus-community.github.io/helm-charts
kube-prometheus-stack:
  ## Manages Prometheus and Alertmanager components
  ##
  prometheusOperator:
  prometheus:
    prometheusSpec:
      ruleSelector:
        matchLabels: null # These nulls can never override a non-null value in values.yaml in https://prometheus-community.github.io/helm-charts
      serviceMonitorSelector:
        matchLabels: null 
      podMonitorSelector:
        matchLabels: null
      probeSelector:
        matchLabels: null
      scrapeConfigSelector:
        matchLabels: null

so with this setup it is impossible to override the value in e.g. https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml#L3708-L3710

See various helm issues helm/helm#12741 helm/helm#13222 helm/helm#12511

Same problem here. So basically this comment here does not work for us:

## To remove matchLabels from the selector condition, explicitly set matchLabels to null.

@rouke-broersma
Copy link
Contributor

Instead of having to set it to empty string or null I would much prefer something more functional like for example being able to set it to ALL to match all labels or having a boolean property matchAllLabels or something similar. scrapeConfigSelectorNilUsesHelmValues was always a confusing property and setting a selector to "" does not make it clear that this means match ALL instead of match NONE.

It doesn't help that the prometheus operator crd itself is also unusual in that it supports three states (nil, empty and filled). I don't think the current kps chart supports all three states, does it?

@mrBlackhat
Copy link

Same, using special label to discover the targets, had to revert back to 62 cuz everything dissapeared 😃

@kallaics
Copy link
Author

kallaics commented Oct 2, 2024

@rouke-broersma What do you mean about my proposal above? It can keep both option. Only issue you cannot use in "OR" relation the two settings (to keep the default release labels and I want to scrape my custom resources with different label(s) ).

@rouke-broersma
Copy link
Contributor

@rouke-broersma What do you mean about my proposal above? It can keep both option. Only issue you cannot use in "OR" relation the two settings (to keep the default release labels and I want to scrape my custom resources with different label(s) ).

Your proposal does not functionally exactly match the actual scenarios. The actual scenario is not whether or not the release label is included. The actual scenario is one of these three:

  • Match NO Resources (nil value in prometheus-operator)
  • Match subset of resources based on labels (for example default release label, or custom label)
  • Match ALL Resources (empty array/object in prometheus-operator)

@jkroepke
Copy link
Member

jkroepke commented Oct 2, 2024

We decided to perform a rollback #4883

My motivation for using some of the 'cool' features is quite low because there may be other unknown bugs in the Helm ecosystem. I want to avoid another failed release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
9 participants