Skip to content

Commit

Permalink
Enhancement proposal for monitoring Windows Nodes
Browse files Browse the repository at this point in the history
Enhancement proposal for enabling monitoring on
Windows nodes created by Windows Machine Config Operator(WMCO).
  • Loading branch information
VaishnaviHire committed Feb 16, 2021
1 parent 9eb5f69 commit 09013b7
Showing 1 changed file with 170 additions and 0 deletions.
170 changes: 170 additions & 0 deletions enhancements/windows-containers/monitoring-windows-nodes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
---
title: monitoring-windows-nodes
authors:
- "@VaishnaviHire"
- "@PratikMahajan"
reviewers:
- "@@openshift/openshift-team-windows-containers"
- "@simonpasquier"
- "@spadgett"
approvers:
- "@sdodson"
- "@simonpasquier"
- "@spadgett"
creation-date: 2021-02-08
last-updated: 2021-02-12
status: implementable
---

# Monitoring Windows Nodes

## Release Signoff Checklist

- [x] Enhancement is `implementable`
- [x] Design details are appropriately documented from clear requirements
- [x] Test plan is defined
- [ ] Operational readiness criteria is defined
- [x] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

The intent of this enhancement is to enable performance monitoring on Windows
nodes created by Windows Machine Config Operator(WMCO) in OpenShift cluster.

## Motivation

Monitoring is critical to identify issues with nodes, containers running on the
nodes. The main motivation behind this enhancement is to enable monitoring on
the Windows nodes.

### Goals

As part of this enhancement, we plan to do the following:
* Run [windows_exporter](https://github.com/prometheus-community/windows_exporter)
as a service on Windows nodes
* Upgrade the windows_exporter on the Windows Nodes
* Leverage cluster-monitoring operator that sets up Prometheus, Alertmanager
and other components

### Non-Goals

As part of this enhancement, we do not plan to do the following:
* Integrating windows_exporter with cluster monitoring operator
* Ship Grafana dashboards for Windows Nodes

## Proposal

The main idea here is to run windows_exporter as a Windows Service and let
Prometheus instance which was provisioned as part of OpenShift install to
collect data from Windows exporter. The metrics exposed by the windows_exporter
will be used to display node graphs and workload graphs for Windows nodes.

## Justification

Unlike [Node exporter](https://github.com/prometheus/node_exporter) on Linux
nodes, windows_exporter cannot run as a container on the Windows nodes since
Windows container images contains a Windows Kernel and Red Hat has a policy not
to ship third party kernels for support reasons. Please refer to the [WMCO
enhancement](https://github.com/openshift/enhancements/blob/master/enhancements/windows-containers/windows-machine-config-operator.md#justification)
for more details.Hence, we need to ensure that Prometheus operator watches over
the endpoint, `<internal-ip>:9182/metrics`, exposed by windows_exporter for
every Windows node.

### Implementation Details/Notes/Constraints

#### Current Plans

To enable basic monitoring support for Windows node, WMCO does the following:

* Build and add windows_exporter binary to WMCO payload.
* Install windows_exporter on the Windows nodes and ensuring
that it runs as a Windows service.
* Add `openshift.io/cluster-monitoring=true` label to the
`openshift-windows-machine-config-operator` namespace so that cluster
monitoring stack will pick up the Service Monitor created by WMCO.
* Add privileges to WMCO to create Services, Endpoints, Service Monitor in
the `openshift-windows-machine-config-operator` namespace.
* Create a Service and Endpoints object in `openshift-windows-machine-config
-operator` namespace that point to windows_exporter endpoint. WMCO will be
using default values to define metrics endpoint, `<internal-ip>:9182/metrics`,
exposed by windows_exporter for every Windows node. The Endpoints object
created in the namespace will consist of subsets of endpoints from all the
Windows nodes.
* Create a Service Monitor in `openshift-windows-machine-config-operator`
namespace for Service created above.

To display node and workload graphs WMCO does the following:

* Add custom Prometheus rules in `openshift-windows-machine-config-operator`
namespace. The custom recording rules are created using Windows metrics
exposed by the windows_exporter and have the same names as Linux
recording rules. This is to make use of same console queries as Linux.
* Note that WMCO is unable to display workload graphs for the Windows Nodes
with the current implementation. See [Risks and Mitigations](https://github.com/openshift/enhancements/blob/master/enhancements/windows-containers/monitoring-windows-nodes.md#risks-and-mitigations)
for details.

#### Future Plans

As we move forward, our plan to display monitoring graphs is to create a common
interface for Windows and Linux recording rules that is managed by the
cluster monitoring operator. WMCO would no longer be responsible for creating
custom Prometheus rules to display node graphs. The unified interface for
monitoring will return results for both Linux and Windows nodes for a single
query. The console queries created as a result of the unified interface will
be different from the existing queries that would need to be updated. This
will ensure that we have a consistent user experience for monitoring across
Linux and Windows.

### Risks and Mitigations

The main risk with this proposal is renaming Windows metrics to display workload
graphs. The pod metrics for Linux come from cAdvisor. However, we do not get
same metrics from cAdvisor for Windows nodes. This becomes a hindrance to
display pod graphs by creating custom recording rules to use same console
queries as Linux workloads.

Mitigation
Use metrics exposed by the windows_exporter to display pod graphs as
mentioned in the [Future Plans](https://github.com/openshift/enhancements/blob/master/enhancements/windows-containers/monitoring-windows-nodes.md#future-plans)
in detail. This also requires changes in console queries that support OS
specific metrics.

## Design Details

### Test Plan

The current tests ensure that WMCO checks if :
* The operator namespace, `openshift-windows-machine-config-operator`, uses
`openshift.io/cluster-monitoring=true` label.
* Service, endpoints and Service Monitor objects are created as expected.
* Prometheus is able to collect data from windows_exporter.
* Custom Prometheus rules return Windows data.

The test plan for [future implementation](https://github.com/openshift/enhancements/blob/master/enhancements/windows-containers/monitoring-windows-nodes.md#future-plans)
will use existing tests to test creation of windows_exporter service and
metrics Service, endpoints and Service Monitor objects. However, WMCO will not
be responsible for testing Prometheus rules created for Windows. The tests
written for the unified interface will ensure that console queries return
results for both Windows and Linux nodes.

### Graduation Criteria

This enhancement will start as GA

### Upgrade / Downgrade Strategy

* WMCO is responsible for upgrading [windows_exporter](https://github.com/prometheus-community/windows_exporter/tags)
binary to the latest release. Downgrades are [not supported](https://github.com/operator-framework/operator-lifecycle-manager/issues/1177)
by OLM.

## Implementation History

v1: Initial Proposal

### Drawbacks

Running windows_exporter as a Windows service instead of running as a DaemonSet
pod makes it hard for the Prometheus to monitor Windows nodes and workloads. The
limitation of not able to run windows_exporter on Windows nodes as a pod is
because of support reasons as mentioned in the [WMCO_enhancement](https://github.com/openshift/enhancements/blob/master/enhancements/windows-containers/windows-machine-config-operator.md#justification).

0 comments on commit 09013b7

Please sign in to comment.