add restart metrics enhancement #255

rphillips · 2020-03-23T16:12:01Z

This enhancement enables metrics for certain systemd units to track restarts.

openshift-ci-robot · 2020-03-23T16:12:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rphillips
To complete the pull request process, please assign jwmatthews
You can assign the PR to them by writing /assign @jwmatthews in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rphillips · 2020-03-23T16:12:28Z

/cc @sjenning @lilic @smarterclayton @mrunalp

lilic

If its our goal to monitor restarts/lifetime of crio and kubelet (from what I can tell) why do we not just use those metrics to those two components directly, for example process_start_time_seconds that comes from the go process collector, which both crio and kubelet already have.

enhancements/monitoring/restart-metrics.md

smarterclayton · 2020-03-23T17:15:39Z

Note that I commented that this is about monitoring restarts of all services on the box, not just kubelet and crio. We are responsible for knowing whether all services shipped as part of the OS or installed are restarting.

rphillips · 2020-03-23T21:02:48Z

@lilic @smarterclayton I updated the proposal with everything we discussed. I think we could defer the alarm definitions until after we see some sample clusters. Ready for re-review.

rphillips · 2020-03-23T21:03:47Z

I also added a list of possible core services we would want to start with.

lilic

Looking good, couple of questions.

cc @openshift/openshift-team-monitoring

lilic · 2020-03-24T08:43:32Z

enhancements/monitoring/restart-metrics.md

+
+### Non-Goals
+
+- Defining alerts is not in the scope of this proposal. Alerts can be defined


I assume the node team would define those alerts, we are of course happy to sheppard this! :)

lilic · 2020-03-24T08:55:40Z

enhancements/monitoring/restart-metrics.md

+
+### Test Plan
+
+- Validate we see system dbus metrics from the whitelisted services


Where do you want this tests, I am assuming in origin prometheus tests?

Yes, if we do metric validation there, that would be great.

lilic · 2020-03-24T09:11:32Z

enhancements/monitoring/restart-metrics.md

+
+### Graduation Criteria
+
+Delivered in 4.5.


We do agree that this is a feature and not a bug(zilla) and there is no need to backport this feature, right? :) cc @smarterclayton

enhancements/monitoring/restart-metrics.md

brancz · 2020-03-24T15:45:31Z

enhancements/monitoring/restart-metrics.md

+  text_collector.
+
+- A whitelist will be enabled in systemd_exporter to enable metrics from the
+  following core services:


I feel like at least the services listed here deserve to be monitored in a whitebox fashion, not just indirectly through the proposed mechanism. Can we ensure that as a follow up of this "generic" monitoring, we also add whitebox monitoring to the node team and/or coreos team's backlogs? kubelet and crio are already, and chrony would have been helpful multiple times already and I imagine the same holds true for all other services listed here (I believe I've talked about NetworkManager with @lucab before as well).

@brancz I can add a section on what we would like to add later... Can you provide me some examples of the whitebox testing you would like to see?

Essentially each of these components should either have a metrics endpoint of their own or we should write an exporter for it. kubelet and crio already do so that's good, but all the others in this list should as well or at least have an exporter.

Sorry I can't seem to find where this is documented. Can you point me at it?

Most of these other components are system components, and the restart metric is likely a good indicator of failure.

enhancements/monitoring/restart-metrics.md

lilic

One concern left from me, otherwise looks good to me.

lilic · 2020-03-25T12:20:28Z

enhancements/monitoring/restart-metrics.md

+
+- This proposal is to add a privileged sidecar running systemd_exporter
+  [systemd_exporter](https://github.com/povilasv/systemd_exporter]) to the
+  node_exporter [daemonset](https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/node-exporter/daemonset.yaml#L17).


Still would prefer this DaemonSet to be separate of the node-exporter one, any reasons against this?

Clayton specifically mentioned a sidecar, which is typically deployed with the dependent component. Since the systemd_exporter depends on node_exporter, I think it makes sense to deploy it in the same pod.

@smarterclayton ?

I don't see any dependency between those two components. Could someone describe what is the dependency here and why it couldn't be deployed as DaemonSet?

enhancements/monitoring/restart-metrics.md

rphillips · 2020-04-08T15:14:56Z

@lilic does this enhancement sound ok?

rphillips · 2020-04-08T15:15:16Z

@brancz ?

lilic · 2020-04-08T15:16:47Z

To me this #255 (comment) is still not resolved. When we agree on that, it looks good to me 👍

brancz · 2020-04-14T11:29:50Z

I'm maybe just missing it, but I also don't see this: #255 (comment)

rphillips · 2020-05-11T15:52:09Z

The other processes are system level processes. I doubt the maintainers will want to include metrics or http endpoints. Are we ok with 'monitoring' their restart counts?

brancz · 2020-05-12T07:33:54Z

It's not about native integrations necessarily. Many thing in the Prometheus world are monitored using "exporters". Exporters convert the format a project chooses to Prometheus format. A project that implements a long running daemon but doesn't care about the ability to monitor it, is arguably careless (that's like saying you don't log because it's not necessary).

The order of preference, depending on the respective project's willingness is roughly like this to me:

native integration with http endpoint
native integration with unix domain socket (then exposed to Prometheus via something like local_exporter)
@lucab is currently trying to see if there is a possibility to have something via dbus for components that already make use of it
implement exporter if project uses bespoke mechanism to expose metrics

If none of these are feasible then, and only then should we be thinking about this kind of blackbox monitoring, to have any insight at all.

rphillips · 2020-05-18T22:17:31Z

I think this proposal has gotten off a tad off track. The proposal is to track system process restart counts with the included way systemd tracks restart counts, so we can see if an underlying system process is continuously crashing. I am a bit confused as to where we are in this approval process.

brancz · 2020-05-20T08:49:16Z

What I'm asking for is for the node team to plan an effort to whitebox monitor these components, for me to be comfortable with this tactical solution of blackbox monitoring them.

lilic · 2020-05-20T09:22:29Z

I think we are fine with merging this, after my question is answered, as long as we open a RFE for node team to whitebox monitoring the components.

tedyu · 2020-06-22T18:06:50Z

enhancements/monitoring/restart-metrics.md

+- systemd_exporter will be configured to write metrics to node_exporters
+  text_collector.
+
+- A whitelist will be enabled in systemd_exporter to enable metrics from the


It seems we should find alternative to the term 'whitelist'

Thank you, yes that sounds great idea! We use allowlist/denylist in our project so that would be good.

rphillips · 2020-07-02T19:59:14Z

@lilic could you add lgtm / approval labels to this enhancement?

brancz · 2020-07-03T08:24:57Z

Could you link the RFE? Then I'm happy to lgtm as well.

cgwalters · 2020-08-18T15:38:11Z

enhancements/monitoring/restart-metrics.md

+
+- A whitelist will be enabled in systemd_exporter to enable metrics from the
+  following core services, using the built in command-line argument
+  ([collector.unit-whitelist](https://github.com/povilasv/systemd_exporter/blob/master/systemd/systemd.go#L25)):


I think we should treat this problem more holistically; there are various OpenShift components which end up running code on the host, sometimes via systemd units. An allowlist like this is going to be hard to maintain.

openshift-bot · 2020-11-16T20:35:09Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

lilic · 2020-11-18T09:45:12Z

I believe this is still very valid and would love to see it, it has only one comment left to address. @rphillips any updates on it? Thanks!

openshift-bot · 2020-12-18T10:37:51Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

ehashman · 2020-12-21T16:57:06Z

/remove-lifecycle rotten

My understanding is that we just need an approval on this and it should be good to go. @mrunalp @sjenning @smarterclayton could you PTAL when you get the chance?

ehashman · 2020-12-21T16:57:48Z

Also waiting for an LGTM from @lilic I think :)

lilic · 2021-01-04T12:37:57Z

Thanks Elana, we are just waiting for this comment to be answered, as we would prefer to deploy it separately from node_exporter. #255 (comment)

openshift-bot · 2021-04-04T18:25:35Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-05-04T22:32:55Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-06-04T01:21:36Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-06-04T01:21:43Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from abhinavdahiya and shawn-hurley March 23, 2020 16:12

openshift-ci-robot requested review from lilic, mrunalp, sjenning and smarterclayton March 23, 2020 16:12

rphillips mentioned this pull request Mar 23, 2020

Bug 1814840: node-exporter: add systemd enable restart metrics openshift/cluster-monitoring-operator#704

Closed

rphillips changed the title ~~add restart metrics proposal~~ add restart metrics enhancement Mar 23, 2020

lilic reviewed Mar 23, 2020

View reviewed changes

add restart metrics proposal

c9e3d86

rphillips force-pushed the feat/add_systemd_restart_metrics branch from ddaa06e to c9e3d86 Compare March 23, 2020 21:00

lilic reviewed Mar 24, 2020

View reviewed changes

brancz reviewed Mar 24, 2020

View reviewed changes

lilic reviewed Mar 24, 2020

View reviewed changes

enhancements/monitoring/restart-metrics.md Show resolved Hide resolved

rphillips added 3 commits March 24, 2020 16:57

add command-line argument for collector.unit-whitelist

9cef512

add blurb where the systemd_exporter would be added

96cf786

add ownership info

7c6aa3e

lilic reviewed Mar 25, 2020

View reviewed changes

sjenning reviewed Mar 25, 2020

View reviewed changes

enhancements/monitoring/restart-metrics.md Outdated Show resolved Hide resolved

remove extra ]

24be960

sjenning reviewed Mar 25, 2020

View reviewed changes

enhancements/monitoring/restart-metrics.md Outdated Show resolved Hide resolved

tweak wording due to selinux denial

9fc3590

tedyu reviewed Jun 22, 2020

View reviewed changes

cgwalters reviewed Aug 18, 2020

View reviewed changes

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 16, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 18, 2020

openshift-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 21, 2020

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2021

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 4, 2021

openshift-ci bot closed this Jun 4, 2021


		### Non-Goals

		- Defining alerts is not in the scope of this proposal. Alerts can be defined


		### Test Plan

		- Validate we see system dbus metrics from the whitelisted services

add restart metrics enhancement #255

add restart metrics enhancement #255

Conversation

rphillips commented Mar 23, 2020 • edited Loading

openshift-ci-robot commented Mar 23, 2020

rphillips commented Mar 23, 2020

lilic left a comment

Choose a reason for hiding this comment

smarterclayton commented Mar 23, 2020

rphillips commented Mar 23, 2020

rphillips commented Mar 23, 2020

lilic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lilic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rphillips commented Apr 8, 2020

rphillips commented Apr 8, 2020

lilic commented Apr 8, 2020

brancz commented Apr 14, 2020

rphillips commented May 11, 2020

brancz commented May 12, 2020

rphillips commented May 18, 2020

brancz commented May 20, 2020

lilic commented May 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rphillips commented Jul 2, 2020

brancz commented Jul 3, 2020

Choose a reason for hiding this comment

openshift-bot commented Nov 16, 2020

lilic commented Nov 18, 2020

openshift-bot commented Dec 18, 2020

ehashman commented Dec 21, 2020

ehashman commented Dec 21, 2020

lilic commented Jan 4, 2021

openshift-bot commented Apr 4, 2021

openshift-bot commented May 4, 2021

openshift-bot commented Jun 4, 2021

openshift-ci bot commented Jun 4, 2021

rphillips commented Mar 23, 2020 •

edited

Loading