-
Notifications
You must be signed in to change notification settings - Fork 489
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enhancement proposal for monitoring Windows Nodes
Enhancement proposal for enabling monitoring on Windows nodes created by Windows Machine Config Operator(WMCO).
- Loading branch information
1 parent
9eb5f69
commit 09013b7
Showing
1 changed file
with
170 additions
and
0 deletions.
There are no files selected for viewing
170 changes: 170 additions & 0 deletions
170
enhancements/windows-containers/monitoring-windows-nodes.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,170 @@ | ||
--- | ||
title: monitoring-windows-nodes | ||
authors: | ||
- "@VaishnaviHire" | ||
- "@PratikMahajan" | ||
reviewers: | ||
- "@@openshift/openshift-team-windows-containers" | ||
- "@simonpasquier" | ||
- "@spadgett" | ||
approvers: | ||
- "@sdodson" | ||
- "@simonpasquier" | ||
- "@spadgett" | ||
creation-date: 2021-02-08 | ||
last-updated: 2021-02-12 | ||
status: implementable | ||
--- | ||
|
||
# Monitoring Windows Nodes | ||
|
||
## Release Signoff Checklist | ||
|
||
- [x] Enhancement is `implementable` | ||
- [x] Design details are appropriately documented from clear requirements | ||
- [x] Test plan is defined | ||
- [ ] Operational readiness criteria is defined | ||
- [x] Graduation criteria for dev preview, tech preview, GA | ||
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
||
## Summary | ||
|
||
The intent of this enhancement is to enable performance monitoring on Windows | ||
nodes created by Windows Machine Config Operator(WMCO) in OpenShift cluster. | ||
|
||
## Motivation | ||
|
||
Monitoring is critical to identify issues with nodes, containers running on the | ||
nodes. The main motivation behind this enhancement is to enable monitoring on | ||
the Windows nodes. | ||
|
||
### Goals | ||
|
||
As part of this enhancement, we plan to do the following: | ||
* Run [windows_exporter](https://github.com/prometheus-community/windows_exporter) | ||
as a service on Windows nodes | ||
* Upgrade the windows_exporter on the Windows Nodes | ||
* Leverage cluster-monitoring operator that sets up Prometheus, Alertmanager | ||
and other components | ||
|
||
### Non-Goals | ||
|
||
As part of this enhancement, we do not plan to do the following: | ||
* Integrating windows_exporter with cluster monitoring operator | ||
* Ship Grafana dashboards for Windows Nodes | ||
|
||
## Proposal | ||
|
||
The main idea here is to run windows_exporter as a Windows Service and let | ||
Prometheus instance which was provisioned as part of OpenShift install to | ||
collect data from Windows exporter. The metrics exposed by the windows_exporter | ||
will be used to display node graphs and workload graphs for Windows nodes. | ||
|
||
## Justification | ||
|
||
Unlike [Node exporter](https://github.com/prometheus/node_exporter) on Linux | ||
nodes, windows_exporter cannot run as a container on the Windows nodes since | ||
Windows container images contains a Windows Kernel and Red Hat has a policy not | ||
to ship third party kernels for support reasons. Please refer to the [WMCO | ||
enhancement](https://github.com/openshift/enhancements/blob/master/enhancements/windows-containers/windows-machine-config-operator.md#justification) | ||
for more details.Hence, we need to ensure that Prometheus operator watches over | ||
the endpoint, `<internal-ip>:9182/metrics`, exposed by windows_exporter for | ||
every Windows node. | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
#### Current Plans | ||
|
||
To enable basic monitoring support for Windows node, WMCO does the following: | ||
|
||
* Build and add windows_exporter binary to WMCO payload. | ||
* Install windows_exporter on the Windows nodes and ensuring | ||
that it runs as a Windows service. | ||
* Add `openshift.io/cluster-monitoring=true` label to the | ||
`openshift-windows-machine-config-operator` namespace so that cluster | ||
monitoring stack will pick up the Service Monitor created by WMCO. | ||
* Add privileges to WMCO to create Services, Endpoints, Service Monitor in | ||
the `openshift-windows-machine-config-operator` namespace. | ||
* Create a Service and Endpoints object in `openshift-windows-machine-config | ||
-operator` namespace that point to windows_exporter endpoint. WMCO will be | ||
using default values to define metrics endpoint, `<internal-ip>:9182/metrics`, | ||
exposed by windows_exporter for every Windows node. The Endpoints object | ||
created in the namespace will consist of subsets of endpoints from all the | ||
Windows nodes. | ||
* Create a Service Monitor in `openshift-windows-machine-config-operator` | ||
namespace for Service created above. | ||
|
||
To display node and workload graphs WMCO does the following: | ||
|
||
* Add custom Prometheus rules in `openshift-windows-machine-config-operator` | ||
namespace. The custom recording rules are created using Windows metrics | ||
exposed by the windows_exporter and have the same names as Linux | ||
recording rules. This is to make use of same console queries as Linux. | ||
* Note that WMCO is unable to display workload graphs for the Windows Nodes | ||
with the current implementation. See [Risks and Mitigations](https://github.com/openshift/enhancements/blob/master/enhancements/windows-containers/monitoring-windows-nodes.md#risks-and-mitigations) | ||
for details. | ||
|
||
#### Future Plans | ||
|
||
As we move forward, our plan to display monitoring graphs is to create a common | ||
interface for Windows and Linux recording rules that is managed by the | ||
cluster monitoring operator. WMCO would no longer be responsible for creating | ||
custom Prometheus rules to display node graphs. The unified interface for | ||
monitoring will return results for both Linux and Windows nodes for a single | ||
query. The console queries created as a result of the unified interface will | ||
be different from the existing queries that would need to be updated. This | ||
will ensure that we have a consistent user experience for monitoring across | ||
Linux and Windows. | ||
|
||
### Risks and Mitigations | ||
|
||
The main risk with this proposal is renaming Windows metrics to display workload | ||
graphs. The pod metrics for Linux come from cAdvisor. However, we do not get | ||
same metrics from cAdvisor for Windows nodes. This becomes a hindrance to | ||
display pod graphs by creating custom recording rules to use same console | ||
queries as Linux workloads. | ||
|
||
Mitigation | ||
Use metrics exposed by the windows_exporter to display pod graphs as | ||
mentioned in the [Future Plans](https://github.com/openshift/enhancements/blob/master/enhancements/windows-containers/monitoring-windows-nodes.md#future-plans) | ||
in detail. This also requires changes in console queries that support OS | ||
specific metrics. | ||
|
||
## Design Details | ||
|
||
### Test Plan | ||
|
||
The current tests ensure that WMCO checks if : | ||
* The operator namespace, `openshift-windows-machine-config-operator`, uses | ||
`openshift.io/cluster-monitoring=true` label. | ||
* Service, endpoints and Service Monitor objects are created as expected. | ||
* Prometheus is able to collect data from windows_exporter. | ||
* Custom Prometheus rules return Windows data. | ||
|
||
The test plan for [future implementation](https://github.com/openshift/enhancements/blob/master/enhancements/windows-containers/monitoring-windows-nodes.md#future-plans) | ||
will use existing tests to test creation of windows_exporter service and | ||
metrics Service, endpoints and Service Monitor objects. However, WMCO will not | ||
be responsible for testing Prometheus rules created for Windows. The tests | ||
written for the unified interface will ensure that console queries return | ||
results for both Windows and Linux nodes. | ||
|
||
### Graduation Criteria | ||
|
||
This enhancement will start as GA | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
* WMCO is responsible for upgrading [windows_exporter](https://github.com/prometheus-community/windows_exporter/tags) | ||
binary to the latest release. Downgrades are [not supported](https://github.com/operator-framework/operator-lifecycle-manager/issues/1177) | ||
by OLM. | ||
|
||
## Implementation History | ||
|
||
v1: Initial Proposal | ||
|
||
### Drawbacks | ||
|
||
Running windows_exporter as a Windows service instead of running as a DaemonSet | ||
pod makes it hard for the Prometheus to monitor Windows nodes and workloads. The | ||
limitation of not able to run windows_exporter on Windows nodes as a pod is | ||
because of support reasons as mentioned in the [WMCO_enhancement](https://github.com/openshift/enhancements/blob/master/enhancements/windows-containers/windows-machine-config-operator.md#justification). |