From c9e3d86822b213ce4fcd066a222a99c11ec969d3 Mon Sep 17 00:00:00 2001 From: Ryan Phillips Date: Mon, 23 Mar 2020 11:06:09 -0500 Subject: [PATCH 1/6] add restart metrics proposal --- enhancements/monitoring/restart-metrics.md | 122 +++++++++++++++++++++ 1 file changed, 122 insertions(+) create mode 100644 enhancements/monitoring/restart-metrics.md diff --git a/enhancements/monitoring/restart-metrics.md b/enhancements/monitoring/restart-metrics.md new file mode 100644 index 0000000000..00a56f4682 --- /dev/null +++ b/enhancements/monitoring/restart-metrics.md @@ -0,0 +1,122 @@ +--- +title: restart-metrics +authors: + - "@rphillips" +reviewers: + - "@smarterclayton" + - "@sjenning" + - "@mrunalp" +approvers: + - "@brancz" + - "@bparees" + - "@squat" + - "@s-urbaniak" + - "@metalmatze" + - "@paulfantom" + - "@LiliC" + - "@pgier" + - "@simonpasquier" +creation-date: 2020-03-23 +last-updated: 2020-03-23 +status: implementable +see-also: +replaces: +superseded-by: +--- + +# Service Restart Metrics + +## Release Signoff Checklist + +- [ ] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +There is not any insight with systemd units restarting. Metrics of this sort are +important to know if Kubelet, crio, or other system services are crashing and +restarting. Restart metrics are going to be extremely important since the +Kubelet is going to be changed to exit on a crash, thus restarting. Previous +behavior of the Kubelet is to recover() from the crash and not exit. + +## Motivation + +Systemd unit restart metrics of a unit are vitally important for system +administrators to see the health of the Kubelet, crio, and other vitally important +system services. + +### Goals + +- Systemd unit restart metrics need to be propogated through to monitoring, and + the alerting system to alert system administrators to problems within the + system. + +### Non-Goals + +- Defining alerts is not in the scope of this proposal. Alerts can be defined + once we see some sample cluster data. + +### Current Behavior + +node_exporter is running in a non-privileged container - as user `nobody` - and +cannot connect to the DBUS socket to communicate with systemd. Monitoring has +concerns the underlying [systemd +collector](https://github.com/prometheus/node_exporter/blob/master/collector/systemd_linux.go) +is not performant enough. + +## Proposal + +- This proposal is to add a privileged sidecar to node_exporter running + [systemd_exporter](https://github.com/povilasv/systemd_exporter]) + +- systemd_exporter will be configured to write metrics to node_exporters + text_collector. + +- A whitelist will be enabled in systemd_exporter to enable metrics from the + following core services: + - kubelet + - crio + - sshd + - chronyd + - dbus + - getty + - irqbalance + - NetworkManager + - rpc-statd + - rpcbind + - sssd + - systemd-hostnamed + - systemd-journald + - systemd-logind + - systemd-udevd + +### Risks and Mitigations + +## Design Details + +- Explained above + +### Test Plan + +- Validate we see system dbus metrics from the whitelisted services + +### Graduation Criteria + +Delivered in 4.5. + +### Upgrade / Downgrade Strategy + +N/A + +### Version Skew Strategy + +N/A + +## Implementation History + +## Drawbacks + +## Alternatives From 9cef5126bb7e6cd3b49db4b980e0992040cdfb4b Mon Sep 17 00:00:00 2001 From: Ryan Phillips Date: Tue, 24 Mar 2020 16:57:10 -0500 Subject: [PATCH 2/6] add command-line argument for collector.unit-whitelist --- enhancements/monitoring/restart-metrics.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/enhancements/monitoring/restart-metrics.md b/enhancements/monitoring/restart-metrics.md index 00a56f4682..1a843e972b 100644 --- a/enhancements/monitoring/restart-metrics.md +++ b/enhancements/monitoring/restart-metrics.md @@ -76,7 +76,8 @@ is not performant enough. text_collector. - A whitelist will be enabled in systemd_exporter to enable metrics from the - following core services: + following core services, using the built in command-line argument + ([collector.unit-whitelist](https://github.com/povilasv/systemd_exporter/blob/master/systemd/systemd.go#L25)): - kubelet - crio - sshd From 96cf786ba111d48569f65065a7e2d2be73d3ee2e Mon Sep 17 00:00:00 2001 From: Ryan Phillips Date: Tue, 24 Mar 2020 17:06:53 -0500 Subject: [PATCH 3/6] add blurb where the systemd_exporter would be added --- enhancements/monitoring/restart-metrics.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/enhancements/monitoring/restart-metrics.md b/enhancements/monitoring/restart-metrics.md index 1a843e972b..e1295e39b6 100644 --- a/enhancements/monitoring/restart-metrics.md +++ b/enhancements/monitoring/restart-metrics.md @@ -69,8 +69,9 @@ is not performant enough. ## Proposal -- This proposal is to add a privileged sidecar to node_exporter running - [systemd_exporter](https://github.com/povilasv/systemd_exporter]) +- This proposal is to add a privileged sidecar running systemd_exporter + [systemd_exporter](https://github.com/povilasv/systemd_exporter]) to the + node_exporter [daemonset](https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/node-exporter/daemonset.yaml#L17). - systemd_exporter will be configured to write metrics to node_exporters text_collector. From 7c6aa3e16efb1464800cac020a0a409db1ad72b0 Mon Sep 17 00:00:00 2001 From: Ryan Phillips Date: Tue, 24 Mar 2020 17:15:37 -0500 Subject: [PATCH 4/6] add ownership info --- enhancements/monitoring/restart-metrics.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/enhancements/monitoring/restart-metrics.md b/enhancements/monitoring/restart-metrics.md index e1295e39b6..2e580c3601 100644 --- a/enhancements/monitoring/restart-metrics.md +++ b/enhancements/monitoring/restart-metrics.md @@ -97,6 +97,9 @@ is not performant enough. ### Risks and Mitigations +- Node team will own the systemd_exporter image. The daemonset that configures + and deploys the image will be owned by the cluster-monitoring-operator. + ## Design Details - Explained above From 24be960c347b82fdf2a0e951f6de2c93163bc9c0 Mon Sep 17 00:00:00 2001 From: Ryan Phillips Date: Wed, 25 Mar 2020 10:04:41 -0500 Subject: [PATCH 5/6] remove extra ] --- enhancements/monitoring/restart-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/monitoring/restart-metrics.md b/enhancements/monitoring/restart-metrics.md index 2e580c3601..8f15ab491f 100644 --- a/enhancements/monitoring/restart-metrics.md +++ b/enhancements/monitoring/restart-metrics.md @@ -70,7 +70,7 @@ is not performant enough. ## Proposal - This proposal is to add a privileged sidecar running systemd_exporter - [systemd_exporter](https://github.com/povilasv/systemd_exporter]) to the + [systemd_exporter](https://github.com/povilasv/systemd_exporter) to the node_exporter [daemonset](https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/node-exporter/daemonset.yaml#L17). - systemd_exporter will be configured to write metrics to node_exporters From 9fc3590f2826d61074bb54ac1b6a1f22313b58b1 Mon Sep 17 00:00:00 2001 From: Ryan Phillips Date: Wed, 25 Mar 2020 10:17:50 -0500 Subject: [PATCH 6/6] tweak wording due to selinux denial --- enhancements/monitoring/restart-metrics.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/enhancements/monitoring/restart-metrics.md b/enhancements/monitoring/restart-metrics.md index 8f15ab491f..87a99712f9 100644 --- a/enhancements/monitoring/restart-metrics.md +++ b/enhancements/monitoring/restart-metrics.md @@ -61,9 +61,9 @@ system services. ### Current Behavior -node_exporter is running in a non-privileged container - as user `nobody` - and -cannot connect to the DBUS socket to communicate with systemd. Monitoring has -concerns the underlying [systemd +node_exporter is running in a non-privileged container and cannot connect to the +DBUS socket to communicate with systemd because of an selinux denial. Monitoring +has concerns the underlying [systemd collector](https://github.com/prometheus/node_exporter/blob/master/collector/systemd_linux.go) is not performant enough.