Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1814840: node-exporter: add systemd enable restart metrics #704

Closed

Conversation

rphillips
Copy link
Contributor

@rphillips rphillips commented Mar 12, 2020

Enables --collector.systemd.enable-restarts-metrics to add service_restart_total metric

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 12, 2020
@rphillips rphillips force-pushed the fixes/add_systemd_restart_metrics branch from 67cfc30 to cf73eb1 Compare March 12, 2020 19:13
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 12, 2020
@rphillips
Copy link
Contributor Author

weird... an automatic approved label?

@rphillips rphillips force-pushed the fixes/add_systemd_restart_metrics branch from cf73eb1 to 7168ad1 Compare March 12, 2020 19:19
@openshift-ci-robot openshift-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 12, 2020
@rphillips
Copy link
Contributor Author

This PR looks correct. Not sure why the generate CI failed.

/assign @lilic

@s-urbaniak
Copy link
Contributor

no concern once green
/cc @paulfantom

@s-urbaniak
Copy link
Contributor

/unassign @lilic

@s-urbaniak
Copy link
Contributor

/assign @paulfantom

@lilic
Copy link
Contributor

lilic commented Mar 13, 2020

The generate job failed because you need to pull the latest image or rebase the PR:
docker pull quay.io/coreos/jsonnet-ci && make clean && generate-in-docker

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 16, 2020
Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your PR needs a rebase and you need to rerun generate command. But curious how many metrics and series this collector adds? We are working hard on removing unnecessary collectors so curious if we need all of these metrics in the collector or can we drop some from the collector.

@paulfantom
Copy link
Contributor

paulfantom commented Mar 18, 2020

ci/prow/generate was failing due to outdated jsonnetfile.* files. This should be already fixed on master branch and issue should go away after rebasing.
If it doesn't please run:

docker pull quay.io/coreos/jsonnet-ci && make clean && make generate-in-docker

and commit all changed files.

@rphillips
Copy link
Contributor Author

rphillips commented Mar 18, 2020

@lilic this adds one metric per slice

This is a useful metric to know if slices are restarting a number of times. A good example would the crio or kubelet service restarting over and over again, or a pod that is crashing multiple times.

@rphillips rphillips force-pushed the fixes/add_systemd_restart_metrics branch from 7168ad1 to 8d426d0 Compare March 18, 2020 18:54
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2020
@rphillips
Copy link
Contributor Author

rphillips commented Mar 18, 2020

rebased, ready for review

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rphillips
To complete the pull request process, please assign paulfantom
You can assign the PR to them by writing /assign @paulfantom in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rphillips
Copy link
Contributor Author

/retitle Bug 1814840: node-exporter: add systemd enable restart metrics

@openshift-ci-robot openshift-ci-robot changed the title node-exporter: add systemd enable restart metrics Bug 1814840: node-exporter: add systemd enable restart metrics Mar 18, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Mar 18, 2020
@openshift-ci-robot
Copy link
Contributor

@rphillips: This pull request references Bugzilla bug 1814840, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1814840: node-exporter: add systemd enable restart metrics

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rphillips
Copy link
Contributor Author

/cherry-pick release-4.4

@openshift-cherrypick-robot

@rphillips: once the present PR merges, I will cherry-pick it on top of release-4.4 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rphillips
Copy link
Contributor Author

/retest

@lilic
Copy link
Contributor

lilic commented Mar 19, 2020

This is a useful metric to know if slices are restarting a number of times.

Why is it a bug and not an RFE? :)

this adds one metric per slice

In total in a number cluster how many series do we have, do you mind checking. Thanks!

@rphillips
Copy link
Contributor Author

Enabled the option to test this out and retrieve the count for Lili, but found out the container node_exporter runs in doesn't have the permissions to pull the systemd information: Failed to get D-Bus connection: Operation not permitted. I'll have to figure that out first.

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 19, 2020
@rphillips
Copy link
Contributor Author

/cc @sjenning

@rphillips
Copy link
Contributor Author

node_exporter is running as nobody with an SCC [source]. We won't be able to get the systemd metrics unless we increase the privileges for the node_exporter to run as root.

@brancz
Copy link
Contributor

brancz commented Mar 20, 2020

The systemd collector both needs to be privileged and is resource intensive, I don't think we should be enabling it to get a single metric that might be useful. Do we have an incident or something that even describes why we would want this metric? I suspect this is a case of cause vs symptom based alerting.

@rphillips
Copy link
Contributor Author

@smarterclayton, was asking for a metric to count when services were crashing.

@smarterclayton
Copy link
Contributor

systemd restarts are probably the #1 non container failure indicator. I was horrified upon realizing we didn't collect it.

If the collector is intensive, we should figure out how to make it not so. If it's high cardinality, again we should probably have a much more streamlined one. The privileged issue is somewhat of a problem, but if that's a concern we could isolate it by having a sidecar track it (just that) and expose that on filesystem in a shared volume.

I would rather have systemd restart than 90% of the metrics node_exporter gathers right now :). Agree we should do it the right way.

@lilic
Copy link
Contributor

lilic commented Mar 20, 2020

As said on slack:
I understand the importance of this metric, agree we should have it, my main concerns are:

  1. that node-exporter would be privileged - its not today, it would be better to keep it that way, maybe we could use the separate systemd exporter (https://github.com/povilasv/systemd_exporter)
  2. the amount of series it would produce
  3. its resource intentive which is why its off by default in node-exporter
    I think this needs a bit more thought, maybe an RFE would be better than a bugzilla, as not having a metric is not a bug. :)

@rphillips rphillips force-pushed the fixes/add_systemd_restart_metrics branch 2 times, most recently from cf9db09 to 51dccce Compare March 20, 2020 15:57
@smarterclayton
Copy link
Contributor

Note that node exporter init has privilege so that we can read dmidecode, so a sidecar is not unreasonable (but should be isolated).

If systemd collector is not the right tool for the job, how do we make it or an alternative the right tool for the job? Running a standalone collector won't address cardinality issues or resource issues. A minimal sidecar loop for node_exporter for just restart metrics would be the simplest condition, but that is going to be associated more closely with node exporter than something else (the pod for node exporter should contain the things that export metrics for openshift). If we have another systemd metric in the future, we would add it to the same place.

@lilic
Copy link
Contributor

lilic commented Mar 20, 2020

Note that node exporter init has privilege so that we can read dmidecode, so a sidecar is not unreasonable (but should be isolated).

Not sure that is enough, but not my level of expertise, but I think the entire node-expoter container might need to be privileged for this. Can we verify this? @rphillips

Running a standalone collector won't address cardinality issues or resource issues.

Sorry, let me clarify: running standalone collector can address the privilege issue yes. From what I heard the standalone systemd collector is less resource heavy and would work better than the built in node-exporter systemd collector. But @pgier might know more about that.

@rphillips rphillips force-pushed the fixes/add_systemd_restart_metrics branch from 51dccce to 8d426d0 Compare March 20, 2020 17:59
@rphillips
Copy link
Contributor Author

@lilic yeah, I was testing out with cluster-bot... the correct directories get propagated into the container, but then we get selinux denials trying to access the socket. It does look like we need to escalate the privileges to be able to get the metrics.

@rphillips
Copy link
Contributor Author

Closing in favor of the enhancement: openshift/enhancements#255

@rphillips rphillips closed this Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants