Bug 1814840: node-exporter: add systemd enable restart metrics #704

rphillips · 2020-03-12T18:49:11Z

Enables --collector.systemd.enable-restarts-metrics to add service_restart_total metric

rphillips · 2020-03-12T19:15:50Z

weird... an automatic approved label?

rphillips · 2020-03-12T19:30:24Z

This PR looks correct. Not sure why the generate CI failed.

/assign @lilic

s-urbaniak · 2020-03-13T06:48:28Z

no concern once green
/cc @paulfantom

s-urbaniak · 2020-03-13T06:49:12Z

/unassign @lilic

s-urbaniak · 2020-03-13T06:50:46Z

/assign @paulfantom

lilic · 2020-03-13T07:55:49Z

The generate job failed because you need to pull the latest image or rebase the PR:
docker pull quay.io/coreos/jsonnet-ci && make clean && generate-in-docker

lilic

Your PR needs a rebase and you need to rerun generate command. But curious how many metrics and series this collector adds? We are working hard on removing unnecessary collectors so curious if we need all of these metrics in the collector or can we drop some from the collector.

paulfantom · 2020-03-18T07:51:24Z

ci/prow/generate was failing due to outdated jsonnetfile.* files. This should be already fixed on master branch and issue should go away after rebasing.
If it doesn't please run:

docker pull quay.io/coreos/jsonnet-ci && make clean && make generate-in-docker

and commit all changed files.

rphillips · 2020-03-18T18:53:57Z

@lilic this adds one metric per slice

This is a useful metric to know if slices are restarting a number of times. A good example would the crio or kubelet service restarting over and over again, or a pod that is crashing multiple times.

rphillips · 2020-03-18T18:54:38Z

rebased, ready for review

openshift-ci-robot · 2020-03-18T18:54:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rphillips
To complete the pull request process, please assign paulfantom
You can assign the PR to them by writing /assign @paulfantom in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rphillips · 2020-03-18T19:07:57Z

/retitle Bug 1814840: node-exporter: add systemd enable restart metrics

openshift-ci-robot · 2020-03-18T19:08:05Z

@rphillips: This pull request references Bugzilla bug 1814840, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1814840: node-exporter: add systemd enable restart metrics

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rphillips · 2020-03-18T19:08:19Z

/cherry-pick release-4.4

openshift-cherrypick-robot · 2020-03-18T19:08:20Z

@rphillips: once the present PR merges, I will cherry-pick it on top of release-4.4 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rphillips · 2020-03-18T21:18:28Z

/retest

lilic · 2020-03-19T11:37:15Z

This is a useful metric to know if slices are restarting a number of times.

Why is it a bug and not an RFE? :)

this adds one metric per slice

In total in a number cluster how many series do we have, do you mind checking. Thanks!

rphillips · 2020-03-19T22:01:07Z

Enabled the option to test this out and retrieve the count for Lili, but found out the container node_exporter runs in doesn't have the permissions to pull the systemd information: Failed to get D-Bus connection: Operation not permitted. I'll have to figure that out first.

/hold

rphillips · 2020-03-19T22:01:16Z

/cc @sjenning

rphillips · 2020-03-19T23:57:24Z

node_exporter is running as nobody with an SCC [source]. We won't be able to get the systemd metrics unless we increase the privileges for the node_exporter to run as root.

brancz · 2020-03-20T08:21:40Z

The systemd collector both needs to be privileged and is resource intensive, I don't think we should be enabling it to get a single metric that might be useful. Do we have an incident or something that even describes why we would want this metric? I suspect this is a case of cause vs symptom based alerting.

rphillips · 2020-03-20T14:19:09Z

@smarterclayton, was asking for a metric to count when services were crashing.

smarterclayton · 2020-03-20T14:25:15Z

systemd restarts are probably the #1 non container failure indicator. I was horrified upon realizing we didn't collect it.

If the collector is intensive, we should figure out how to make it not so. If it's high cardinality, again we should probably have a much more streamlined one. The privileged issue is somewhat of a problem, but if that's a concern we could isolate it by having a sidecar track it (just that) and expose that on filesystem in a shared volume.

I would rather have systemd restart than 90% of the metrics node_exporter gathers right now :). Agree we should do it the right way.

lilic · 2020-03-20T14:59:38Z

As said on slack:
I understand the importance of this metric, agree we should have it, my main concerns are:

that node-exporter would be privileged - its not today, it would be better to keep it that way, maybe we could use the separate systemd exporter (https://github.com/povilasv/systemd_exporter)
the amount of series it would produce
its resource intentive which is why its off by default in node-exporter
I think this needs a bit more thought, maybe an RFE would be better than a bugzilla, as not having a metric is not a bug. :)

smarterclayton · 2020-03-20T16:07:07Z

Note that node exporter init has privilege so that we can read dmidecode, so a sidecar is not unreasonable (but should be isolated).

If systemd collector is not the right tool for the job, how do we make it or an alternative the right tool for the job? Running a standalone collector won't address cardinality issues or resource issues. A minimal sidecar loop for node_exporter for just restart metrics would be the simplest condition, but that is going to be associated more closely with node exporter than something else (the pod for node exporter should contain the things that export metrics for openshift). If we have another systemd metric in the future, we would add it to the same place.

lilic · 2020-03-20T16:35:25Z

Note that node exporter init has privilege so that we can read dmidecode, so a sidecar is not unreasonable (but should be isolated).

Not sure that is enough, but not my level of expertise, but I think the entire node-expoter container might need to be privileged for this. Can we verify this? @rphillips

Running a standalone collector won't address cardinality issues or resource issues.

Sorry, let me clarify: running standalone collector can address the privilege issue yes. From what I heard the standalone systemd collector is less resource heavy and would work better than the built in node-exporter systemd collector. But @pgier might know more about that.

rphillips · 2020-03-20T18:01:41Z

@lilic yeah, I was testing out with cluster-bot... the correct directories get propagated into the container, but then we get selinux denials trying to access the socket. It does look like we need to escalate the privileges to be able to get the metrics.

rphillips · 2020-03-23T16:13:20Z

Closing in favor of the enhancement: openshift/enhancements#255

openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 12, 2020

openshift-ci-robot requested review from metalmatze and paulfantom March 12, 2020 18:50

rphillips force-pushed the fixes/add_systemd_restart_metrics branch from 67cfc30 to cf73eb1 Compare March 12, 2020 19:13

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 12, 2020

rphillips force-pushed the fixes/add_systemd_restart_metrics branch from cf73eb1 to 7168ad1 Compare March 12, 2020 19:19

openshift-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 12, 2020

openshift-ci-robot assigned lilic Mar 12, 2020

openshift-ci-robot unassigned lilic Mar 13, 2020

openshift-ci-robot assigned paulfantom Mar 13, 2020

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 16, 2020

lilic reviewed Mar 18, 2020

View reviewed changes

node-exporter: add systemd enable restart metrics

c1bf1c2

regenerate

8d426d0

rphillips force-pushed the fixes/add_systemd_restart_metrics branch from 7168ad1 to 8d426d0 Compare March 18, 2020 18:54

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2020

openshift-ci-robot changed the title ~~node-exporter: add systemd enable restart metrics~~ Bug 1814840: node-exporter: add systemd enable restart metrics Mar 18, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Mar 18, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 19, 2020

openshift-ci-robot requested a review from sjenning March 19, 2020 22:01

rphillips force-pushed the fixes/add_systemd_restart_metrics branch 2 times, most recently from cf9db09 to 51dccce Compare March 20, 2020 15:57

rphillips force-pushed the fixes/add_systemd_restart_metrics branch from 51dccce to 8d426d0 Compare March 20, 2020 17:59

rphillips closed this Mar 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1814840: node-exporter: add systemd enable restart metrics #704

Bug 1814840: node-exporter: add systemd enable restart metrics #704

rphillips commented Mar 12, 2020 •

edited

Loading

rphillips commented Mar 12, 2020

rphillips commented Mar 12, 2020

s-urbaniak commented Mar 13, 2020

s-urbaniak commented Mar 13, 2020

s-urbaniak commented Mar 13, 2020

lilic commented Mar 13, 2020

lilic left a comment •

edited

Loading

paulfantom commented Mar 18, 2020 •

edited

Loading

rphillips commented Mar 18, 2020 •

edited

Loading

rphillips commented Mar 18, 2020 •

edited

Loading

openshift-ci-robot commented Mar 18, 2020

rphillips commented Mar 18, 2020

openshift-ci-robot commented Mar 18, 2020

rphillips commented Mar 18, 2020

openshift-cherrypick-robot commented Mar 18, 2020

rphillips commented Mar 18, 2020

lilic commented Mar 19, 2020

rphillips commented Mar 19, 2020

rphillips commented Mar 19, 2020

rphillips commented Mar 19, 2020

brancz commented Mar 20, 2020

rphillips commented Mar 20, 2020

smarterclayton commented Mar 20, 2020

lilic commented Mar 20, 2020

smarterclayton commented Mar 20, 2020

lilic commented Mar 20, 2020

rphillips commented Mar 20, 2020

rphillips commented Mar 23, 2020

Bug 1814840: node-exporter: add systemd enable restart metrics #704

Bug 1814840: node-exporter: add systemd enable restart metrics #704

Conversation

rphillips commented Mar 12, 2020 • edited Loading

rphillips commented Mar 12, 2020

rphillips commented Mar 12, 2020

s-urbaniak commented Mar 13, 2020

s-urbaniak commented Mar 13, 2020

s-urbaniak commented Mar 13, 2020

lilic commented Mar 13, 2020

lilic left a comment • edited Loading

Choose a reason for hiding this comment

paulfantom commented Mar 18, 2020 • edited Loading

rphillips commented Mar 18, 2020 • edited Loading

rphillips commented Mar 18, 2020 • edited Loading

openshift-ci-robot commented Mar 18, 2020

rphillips commented Mar 18, 2020

openshift-ci-robot commented Mar 18, 2020

rphillips commented Mar 18, 2020

openshift-cherrypick-robot commented Mar 18, 2020

rphillips commented Mar 18, 2020

lilic commented Mar 19, 2020

rphillips commented Mar 19, 2020

rphillips commented Mar 19, 2020

rphillips commented Mar 19, 2020

brancz commented Mar 20, 2020

rphillips commented Mar 20, 2020

smarterclayton commented Mar 20, 2020

lilic commented Mar 20, 2020

smarterclayton commented Mar 20, 2020

lilic commented Mar 20, 2020

rphillips commented Mar 20, 2020

rphillips commented Mar 23, 2020

rphillips commented Mar 12, 2020 •

edited

Loading

lilic left a comment •

edited

Loading

paulfantom commented Mar 18, 2020 •

edited

Loading

rphillips commented Mar 18, 2020 •

edited

Loading

rphillips commented Mar 18, 2020 •

edited

Loading