Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add restart metrics enhancement #255

Closed
Closed
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions enhancements/monitoring/restart-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
---
title: restart-metrics
authors:
- "@rphillips"
reviewers:
- "@smarterclayton"
- "@sjenning"
- "@mrunalp"
approvers:
- "@brancz"
- "@bparees"
- "@squat"
- "@s-urbaniak"
- "@metalmatze"
- "@paulfantom"
- "@LiliC"
- "@pgier"
- "@simonpasquier"
creation-date: 2020-03-23
last-updated: 2020-03-23
status: implementable
see-also:
replaces:
superseded-by:
---

# Service Restart Metrics

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

There is not any insight with systemd units restarting. Metrics of this sort are
rphillips marked this conversation as resolved.
Show resolved Hide resolved
important to know if Kubelet, crio, or other system services are crashing and
restarting. Restart metrics are going to be extremely important since the
Kubelet is going to be changed to exit on a crash, thus restarting. Previous
behavior of the Kubelet is to recover() from the crash and not exit.

## Motivation

Systemd unit restart metrics of a unit are vitally important for system
administrators to see the health of the Kubelet, crio, and other vitally important
system services.

### Goals

- Systemd unit restart metrics need to be propogated through to monitoring, and
the alerting system to alert system administrators to problems within the
rphillips marked this conversation as resolved.
Show resolved Hide resolved
system.

### Non-Goals

- Defining alerts is not in the scope of this proposal. Alerts can be defined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the node team would define those alerts, we are of course happy to sheppard this! :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

once we see some sample cluster data.

### Current Behavior

node_exporter is running in a non-privileged container - as user `nobody` - and
rphillips marked this conversation as resolved.
Show resolved Hide resolved
cannot connect to the DBUS socket to communicate with systemd. Monitoring has
concerns the underlying [systemd
collector](https://github.com/prometheus/node_exporter/blob/master/collector/systemd_linux.go)
is not performant enough.

## Proposal

- This proposal is to add a privileged sidecar running systemd_exporter
[systemd_exporter](https://github.com/povilasv/systemd_exporter]) to the
rphillips marked this conversation as resolved.
Show resolved Hide resolved
node_exporter [daemonset](https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/node-exporter/daemonset.yaml#L17).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still would prefer this DaemonSet to be separate of the node-exporter one, any reasons against this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clayton specifically mentioned a sidecar, which is typically deployed with the dependent component. Since the systemd_exporter depends on node_exporter, I think it makes sense to deploy it in the same pod.

@smarterclayton ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any dependency between those two components. Could someone describe what is the dependency here and why it couldn't be deployed as DaemonSet?


- systemd_exporter will be configured to write metrics to node_exporters
rphillips marked this conversation as resolved.
Show resolved Hide resolved
text_collector.

- A whitelist will be enabled in systemd_exporter to enable metrics from the
rphillips marked this conversation as resolved.
Show resolved Hide resolved
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we should find alternative to the term 'whitelist'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, yes that sounds great idea! We use allowlist/denylist in our project so that would be good.

following core services, using the built in command-line argument
([collector.unit-whitelist](https://github.com/povilasv/systemd_exporter/blob/master/systemd/systemd.go#L25)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should treat this problem more holistically; there are various OpenShift components which end up running code on the host, sometimes via systemd units. An allowlist like this is going to be hard to maintain.

- kubelet
- crio
- sshd
- chronyd
- dbus
- getty
- irqbalance
- NetworkManager
- rpc-statd
- rpcbind
- sssd
- systemd-hostnamed
- systemd-journald
- systemd-logind
- systemd-udevd

### Risks and Mitigations

- Node team will own the systemd_exporter image. The daemonset that configures
and deploys the image will be owned by the cluster-monitoring-operator.

## Design Details

- Explained above

### Test Plan

- Validate we see system dbus metrics from the whitelisted services
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you want this tests, I am assuming in origin prometheus tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we do metric validation there, that would be great.


### Graduation Criteria

Delivered in 4.5.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do agree that this is a feature and not a bug(zilla) and there is no need to backport this feature, right? :) cc @smarterclayton


### Upgrade / Downgrade Strategy

N/A

### Version Skew Strategy

N/A

## Implementation History

## Drawbacks

## Alternatives