-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1814840: node-exporter: add systemd enable restart metrics #704
Bug 1814840: node-exporter: add systemd enable restart metrics #704
Conversation
67cfc30
to
cf73eb1
Compare
weird... an automatic approved label? |
cf73eb1
to
7168ad1
Compare
This PR looks correct. Not sure why the generate CI failed. /assign @lilic |
no concern once green |
/unassign @lilic |
/assign @paulfantom |
The generate job failed because you need to pull the latest image or rebase the PR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your PR needs a rebase and you need to rerun generate command. But curious how many metrics and series this collector adds? We are working hard on removing unnecessary collectors so curious if we need all of these metrics in the collector or can we drop some from the collector.
and commit all changed files. |
@lilic this adds one metric per slice This is a useful metric to know if slices are restarting a number of times. A good example would the crio or kubelet service restarting over and over again, or a pod that is crashing multiple times. |
7168ad1
to
8d426d0
Compare
rebased, ready for review |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rphillips The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retitle Bug 1814840: node-exporter: add systemd enable restart metrics |
@rphillips: This pull request references Bugzilla bug 1814840, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.4 |
@rphillips: once the present PR merges, I will cherry-pick it on top of release-4.4 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
Why is it a bug and not an RFE? :)
In total in a number cluster how many series do we have, do you mind checking. Thanks! |
Enabled the option to test this out and retrieve the count for Lili, but found out the container node_exporter runs in doesn't have the permissions to pull the systemd information: /hold |
/cc @sjenning |
node_exporter is running as nobody with an SCC [source]. We won't be able to get the systemd metrics unless we increase the privileges for the node_exporter to run as root. |
The systemd collector both needs to be privileged and is resource intensive, I don't think we should be enabling it to get a single metric that might be useful. Do we have an incident or something that even describes why we would want this metric? I suspect this is a case of cause vs symptom based alerting. |
@smarterclayton, was asking for a metric to count when services were crashing. |
systemd restarts are probably the #1 non container failure indicator. I was horrified upon realizing we didn't collect it. If the collector is intensive, we should figure out how to make it not so. If it's high cardinality, again we should probably have a much more streamlined one. The privileged issue is somewhat of a problem, but if that's a concern we could isolate it by having a sidecar track it (just that) and expose that on filesystem in a shared volume. I would rather have systemd restart than 90% of the metrics node_exporter gathers right now :). Agree we should do it the right way. |
As said on slack:
|
cf9db09
to
51dccce
Compare
Note that node exporter init has privilege so that we can read dmidecode, so a sidecar is not unreasonable (but should be isolated). If systemd collector is not the right tool for the job, how do we make it or an alternative the right tool for the job? Running a standalone collector won't address cardinality issues or resource issues. A minimal sidecar loop for node_exporter for just restart metrics would be the simplest condition, but that is going to be associated more closely with node exporter than something else (the pod for node exporter should contain the things that export metrics for openshift). If we have another systemd metric in the future, we would add it to the same place. |
Not sure that is enough, but not my level of expertise, but I think the entire node-expoter container might need to be privileged for this. Can we verify this? @rphillips
Sorry, let me clarify: running standalone collector can address the privilege issue yes. From what I heard the standalone systemd collector is less resource heavy and would work better than the built in node-exporter systemd collector. But @pgier might know more about that. |
51dccce
to
8d426d0
Compare
@lilic yeah, I was testing out with cluster-bot... the correct directories get propagated into the container, but then we get selinux denials trying to access the socket. It does look like we need to escalate the privileges to be able to get the metrics. |
Closing in favor of the enhancement: openshift/enhancements#255 |
Enables
--collector.systemd.enable-restarts-metrics
to addservice_restart_total
metric