Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prometheus-kube-stack] New cAdvisorMetricRelabelings are a bit too restrictive by default #2279

Closed
dotdc opened this issue Jul 15, 2022 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@dotdc
Copy link
Member

dotdc commented Jul 15, 2022

Describe the bug a clear and concise description of what the bug is.

Hi,

I think two of the new rules introduced in 37.0.0 by @SuperQ are a bit too restrictive to be enabled by default.

Rules concerned :

      # Drop cgroup metrics with no pod.
      - sourceLabels: [id, pod]
        action: drop
        regex: '.+;'
      # Drop cgroup metrics with no container.
      - sourceLabels: [id, container]
        action: drop
        regex: '.+;'

Theses 2 rules drop every metrics without a pod and/or a container.
This will make some basic queries like container_cpu_usage_seconds_total{id="/"} to stop working out of the box.
I would at least comment theses two to limit the number of impacted users.

What do you think?

What's your helm version?

3.9.0

What's your kubectl version?

1.24.2

Which chart?

prometheus-kube-stack

What's the chart version?

37.2.0

What happened?

No response

What you expected to happen?

No response

How to reproduce it?

No response

Enter the changed values of values.yaml?

No response

Enter the command that you execute and failing/misfunctioning.

container_cpu_usage_seconds_total{id="/"}

Anything else we need to know?

No response

@SuperQ
Copy link
Contributor

SuperQ commented Jul 16, 2022

Adding my comments from the PR:

These are mostly generated by systemd cgroups, which can be pretty noisy for Kubernetes users. This is why I added them as the default.

The goal here is to monitor Kubernetes resources.

Also, I'm not sure what use of monitoring the CPU of the root cgroup is, when node_cpu_seconds_total covers that.

@dotdc
Copy link
Member Author

dotdc commented Jul 16, 2022

Good point, can't remember why I didn't used node_cpu_seconds_total to calculate global cluster usage, will check again and update this.

@BeckYeh
Copy link

BeckYeh commented Jul 18, 2022

Becasue this metrics used in prometheus-adapter for nodes resource query( kubectl top nodes ).
So by default I hope don't add this filter.

@SuperQ
Copy link
Contributor

SuperQ commented Jul 18, 2022

Ooof, the prometheus-adapter seems to have a bunch of incorrect queries built in. Those should be fixed.

@SuperQ
Copy link
Contributor

SuperQ commented Jul 18, 2022

kubernetes-sigs/prometheus-adapter#516 should fix up the prometheus-adapter bug.

@dotdc
Copy link
Member Author

dotdc commented Jul 18, 2022

Thank you @SuperQ, I replaced all my affected queries using node_cpu_seconds_total instead of container_cpu_usage_seconds_total and LGTM.

Can I close the issue or should we wait to see if other related charts and projects are using similar queries?

@SuperQ
Copy link
Contributor

SuperQ commented Jul 19, 2022

I think we can close it. Other projects can find this issue for advice on how to fix their query use.

@dotdc dotdc closed this as completed Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants