WINC-544: Enhancement proposal for monitoring Windows Nodes #647

VaishnaviHire · 2021-02-11T22:22:55Z

Enhancement proposal for enabling monitoring on
Windows nodes created by Windows Machine Config Operator(WMCO).

russellb · 2021-02-15T17:18:54Z

I would suggest updating the PR title (and commit message) to reflect that this is about monitoring windows nodes to help make it more clear at a glance what this is about

VaishnaviHire · 2021-02-16T22:35:17Z

/cc @sdodson

VaishnaviHire · 2021-02-16T22:36:00Z

/cc @openshift/openshift-team-windows-containers

VaishnaviHire · 2021-02-16T22:41:02Z

/cc @simonpasquier

This enhancement contains details regarding creating unified monitoring interface for Windows and Linux.

VaishnaviHire · 2021-02-16T22:42:30Z

/cc @spadgett

enhancements/windows-containers/monitoring-windows-nodes.md

simonpasquier

Overall the approach of maintaining a custom endpoint object sounds ok to me. It can be noted that it is similar to what the Prometheus operator is doing with the kube-system/kubelet object.

@s-urbaniak for awareness.

enhancements/windows-containers/monitoring-windows-nodes.md

simonpasquier · 2021-02-22T16:16:26Z

enhancements/windows-containers/monitoring-windows-nodes.md

+
+### Future Plans
+
+As we move forward, our plan to display monitoring graphs is to create a [common


I'm not sure we want to offload the recording rules to the cluster-monitoring-operator (CMO) repository. My thinking was more that the Linux-specific rules would live in CMO while the Windows-specific rules would live in WMCO. Otherwise I fear that ownership will be diluted.
Another option would be to move all recording rules to the console operator but again my intuition is that the rules should stay close to the things that expose the metrics.

I am not sure how distributed ownership of recording rules will work if we are thinking of creating a unified interface. My opinion on this is that WMCO will be responsible for making sure that we receive metrics from windows nodes and CMO handles the complexity of grouping these(Linux + Windows) metrics to make the queries platform-independent. This will help in reducing the complexity on the Console side as well. See

The issue with pushing down all recording rules to CMO is that CMO doesn't have a way to validate that Windows-specific rules are correct. We should probably start by listing all metrics that are used by the console and identifying the gaps from the Windows side.

Also what is the general testing strategy to ensure that the console works properly for Windows?
@spadgett, I think that there are e2e console tests validating that the dashboards display correct result?

We should probably start by listing all metrics that are used by the console and identifying the gaps from the Windows side.

@VaishnaviHire please look into this.

Also what is the general testing strategy to ensure that the console works properly for Windows?

We were discussing this as a console change broke the filesystem graphs for Windows nodes: BZ 1930347. The approach we were thinking about was to add a Windows e2e to the console repo or add console tests for validating the dashboards to the WMCO repo depending on whatever is easier. We will defer to @spadgett on what the better approach is.

@simonpasquier I have included a list of metrics used for displaying console graphs and the corresponding Windows metrics in this doc. Please take a look. I have also added the cases where the existing queries won`t work for Windows.

Updated to reflect Windows-specific rules will be added in WMCO.

@simonpasquier Updated the future plans section to reflect the discussion from our meeting. Please review.

enhancements/windows-containers/monitoring-windows-nodes.md

simonpasquier · 2021-03-09T17:02:45Z

enhancements/windows-containers/monitoring-windows-nodes.md

+* Since the windows-exporter is not running as a [pod](#justification), the
+  endpoint is not secure. The reason for this is when running inside a pod, we
+  can use CA signer for providing TLS cert/key to the service for
+  authentication. We plan to leverage windows_exporter's support for `https`
+  configuration. WMCO will be responsible for adding [web config](https://github.com/prometheus/exporter-toolkit/blob/master/docs/web-configuration.md)
+  for TLS. This will ensure that the metrics Endpoint will be able to
+  authenticate the requests.


While the exporter toolkit supports TLS for client authentication, it would only validate that the certificate presented by the client is signed by the trusted certificate authority. For instance, it won't be able to validate the certificate subject. This is something I plan to bring upstream because we might need it for node_exporter too.

simonpasquier · 2021-03-24T09:03:53Z

enhancements/windows-containers/monitoring-windows-nodes.md

+  [support](https://github.com/prometheus-operator/prometheus-operator/issues/3862)
+  for EndpointSlices object.
+
+#### Securing windows_exporter endpoint


It doesn't say how the windows exporter would get a certificate and key materials if it is configured with TLS.

simonpasquier · 2021-03-24T09:25:40Z

enhancements/windows-containers/monitoring-windows-nodes.md

+  we have a consistent user experience for monitoring across Linux and Windows.
+* In the cases where `metric labels` are equivalent, we plan to relabel the
+  Windows metrics to align with the Linux metrics.
+


There's still a gap with the "USE Method / Node" dashboard in the monitoring menu.

There's already a dashboard for Windows in kubernetes-monitoring/kubernetes-mixin. IIUC it would be technically possible for WMCO to ship this dashboard as a configmap in the openshift-config-managed namespace with the console.openshift.io/dashboard=true label (the logging operator already does that) but it would require additional permissions for WMCO.

cc @spadgett to assess whether adding dashboards in the openshift-config-managed namespace is supported in this case.

@spadgett could you please take that?

@simonpasquier Yes, you can definitely add dashboards to openshift-config-managed here. I think it makes a lot of sense, particularly if the Home -> Overview page isn't including metrics for Windows nodes.

Added a sub-section to address the grafana dashboard in the future plans section, @simonpasquier @spadgett PTAL.

simonpasquier · 2021-03-24T09:59:24Z

enhancements/windows-containers/monitoring-windows-nodes.md

+  Windows metrics to align with the Linux metrics.
+
+##### Node Metrics
+


I would add another column explaining how each metric is going to be used by the console eventually.

In details:

node_memory_MemTotal_bytes / windows_cs_physical_memory_bytes => WMCO renames the Windows metric to node_memory_MemTotal_bytes because there are no label differences

node_memory_MemAvailable_bytes / windows_memory_available_bytes => WMCO renames the Windows metric to node_memory_MemTotal_bytes because there are no label differences

node_filesystem_size_bytes & node_filesystem_free_bytes / windows_logical_disk_size_bytes & windows_logical_disk_free_bytes => today the console uses these metrics to report the storage capacity (example), 2 new recording rules should be created by CMO and WMCO which would aggregate metrics per node/instance. Once available the console can use them instead of the custom queries.

instance:filesystem_size_bytes:sum

instance:filesystem_used_bytes:sum

node_cpu_seconds_total => console uses metrics from recording rules (e.g. instance:node_cpu:rate:sum), WMCO needs to produce similar metrics.

simonpasquier · 2021-03-24T13:40:11Z

enhancements/windows-containers/monitoring-windows-nodes.md

+| node_filesystem_free_bytes     | windows_logical_disk_free_bytes  | Missing Label: device, mountpoint, fstype) Additional label : (volume)   |
+| node_cpu_seconds_total         | windows_cpu_time_total           | Missing Label : cpu Additional Label: core                               |
+
+##### Pod Metrics


same as above, we need to identify where recording rules would need to be created/updated vs. where WMCO can modify the metrics to look like metrics from Linux nodes.

mansikulkarni96 · 2021-03-26T21:52:28Z

@simonpasquier updated PR PTAL.

aravindhp · 2021-04-16T20:03:59Z

@simonpasquier @spadgett please review

spadgett

Thanks! The console changes LGTM. I'll let @simonpasquier and others weigh in on the other areas.

/approve

spadgett · 2021-04-21T14:42:32Z

enhancements/windows-containers/monitoring-windows-nodes.md

+
+As part of this enhancement, we do not plan to do the following:
+* Integrating windows_exporter with cluster monitoring operator
+* Ship Grafana dashboards for Windows Nodes


It looks like we do plan to do this now? (Using console Monitoring -> Dashboards.)

openshift-ci-robot · 2021-04-21T14:45:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: spadgett

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [spadgett]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aravindhp · 2021-04-26T15:41:38Z

@simonpasquier we are hoping to close this out at the end of this sprint. Please let us know if more changes are needed.

aravindhp · 2021-04-30T18:00:48Z

/lgtm

@simonpasquier looks all open items have been addressed by @mansikulkarni96. Please comment if you want any changes and we will handle that with a follow up PR.

Enhancement proposal for enabling monitoring on Windows nodes created by Windows Machine Config Operator(WMCO).

aravindhp · 2021-05-03T14:28:40Z

/lgtm

openshift-ci-robot requested review from enxebre and joelanford February 11, 2021 22:23

VaishnaviHire changed the title ~~Enhancement proposal for monitoring~~ [WIP]Enhancement proposal for monitoring Feb 11, 2021

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 11, 2021

VaishnaviHire force-pushed the windows_monitoring branch 5 times, most recently from 3550806 to 852b693 Compare February 12, 2021 21:26

VaishnaviHire force-pushed the windows_monitoring branch from 852b693 to 09013b7 Compare February 16, 2021 15:48

VaishnaviHire changed the title ~~[WIP]Enhancement proposal for monitoring~~ [WIP]Enhancement proposal for monitoring Windows Nodes Feb 16, 2021

VaishnaviHire force-pushed the windows_monitoring branch 5 times, most recently from 946e3dc to d3cbd51 Compare February 16, 2021 21:30

VaishnaviHire changed the title ~~[WIP]Enhancement proposal for monitoring Windows Nodes~~ WINC-544: Enhancement proposal for monitoring Windows Nodes Feb 16, 2021

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 16, 2021

openshift-ci-robot requested a review from sdodson February 16, 2021 22:35

openshift-ci-robot requested a review from simonpasquier February 16, 2021 22:41

openshift-ci-robot requested a review from spadgett February 16, 2021 22:42

spadgett reviewed Feb 19, 2021

View reviewed changes

enhancements/windows-containers/monitoring-windows-nodes.md Outdated Show resolved Hide resolved

enhancements/windows-containers/monitoring-windows-nodes.md Outdated Show resolved Hide resolved

VaishnaviHire force-pushed the windows_monitoring branch from d3cbd51 to c1d6ba4 Compare February 22, 2021 15:27

simonpasquier reviewed Feb 22, 2021

View reviewed changes

VaishnaviHire force-pushed the windows_monitoring branch from c1d6ba4 to 64260d0 Compare February 22, 2021 23:27

VaishnaviHire force-pushed the windows_monitoring branch from 639cf00 to 1fa886c Compare March 3, 2021 18:19

aravindhp reviewed Mar 3, 2021

View reviewed changes

enhancements/windows-containers/monitoring-windows-nodes.md Outdated Show resolved Hide resolved

enhancements/windows-containers/monitoring-windows-nodes.md Outdated Show resolved Hide resolved

enhancements/windows-containers/monitoring-windows-nodes.md Outdated Show resolved Hide resolved

VaishnaviHire force-pushed the windows_monitoring branch from 1fa886c to 4fe5cf2 Compare March 4, 2021 23:44

simonpasquier reviewed Mar 9, 2021

View reviewed changes

VaishnaviHire force-pushed the windows_monitoring branch 3 times, most recently from 86f3541 to fdda97f Compare March 10, 2021 05:11

simonpasquier reviewed Mar 24, 2021

View reviewed changes

mansikulkarni96 force-pushed the windows_monitoring branch from 1286d1e to e0eb5c8 Compare March 26, 2021 21:43

mansikulkarni96 force-pushed the windows_monitoring branch 3 times, most recently from 543cebd to 12004c8 Compare March 29, 2021 17:44

mansikulkarni96 force-pushed the windows_monitoring branch 2 times, most recently from 19af7be to ce41d4a Compare April 5, 2021 20:33

spadgett approved these changes Apr 21, 2021

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 21, 2021

mansikulkarni96 force-pushed the windows_monitoring branch from ce41d4a to 8695014 Compare April 22, 2021 15:18

openshift-ci-robot assigned aravindhp Apr 30, 2021

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels Apr 30, 2021

Enhancement proposal for monitoring Windows Nodes

258dc80

Enhancement proposal for enabling monitoring on Windows nodes created by Windows Machine Config Operator(WMCO).

mansikulkarni96 force-pushed the windows_monitoring branch from b441e4b to 258dc80 Compare May 2, 2021 14:26

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 3, 2021

openshift-merge-robot merged commit 0e2f247 into openshift:master May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WINC-544: Enhancement proposal for monitoring Windows Nodes #647

WINC-544: Enhancement proposal for monitoring Windows Nodes #647

VaishnaviHire commented Feb 11, 2021 •

edited

Loading

russellb commented Feb 15, 2021

VaishnaviHire commented Feb 16, 2021

VaishnaviHire commented Feb 16, 2021

VaishnaviHire commented Feb 16, 2021

VaishnaviHire commented Feb 16, 2021

simonpasquier left a comment

simonpasquier Feb 22, 2021

PratikMahajan Feb 22, 2021

simonpasquier Feb 23, 2021

aravindhp Feb 23, 2021

VaishnaviHire Feb 23, 2021

VaishnaviHire Feb 23, 2021

VaishnaviHire Mar 3, 2021

simonpasquier Mar 9, 2021

simonpasquier Mar 24, 2021

simonpasquier Mar 24, 2021

mansikulkarni96 Mar 26, 2021

spadgett Mar 29, 2021

mansikulkarni96 Mar 29, 2021

simonpasquier Mar 24, 2021

simonpasquier Mar 24, 2021

mansikulkarni96 commented Mar 26, 2021

aravindhp commented Apr 16, 2021

spadgett left a comment

spadgett Apr 21, 2021

openshift-ci-robot commented Apr 21, 2021

aravindhp commented Apr 26, 2021

aravindhp commented Apr 30, 2021

aravindhp commented May 3, 2021


		### Future Plans

		As we move forward, our plan to display monitoring graphs is to create a [common

		Windows metrics to align with the Linux metrics.

		##### Node Metrics

WINC-544: Enhancement proposal for monitoring Windows Nodes #647

WINC-544: Enhancement proposal for monitoring Windows Nodes #647

Conversation

VaishnaviHire commented Feb 11, 2021 • edited Loading

russellb commented Feb 15, 2021

VaishnaviHire commented Feb 16, 2021

VaishnaviHire commented Feb 16, 2021

VaishnaviHire commented Feb 16, 2021

VaishnaviHire commented Feb 16, 2021

simonpasquier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mansikulkarni96 commented Mar 26, 2021

aravindhp commented Apr 16, 2021

spadgett left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Apr 21, 2021

aravindhp commented Apr 26, 2021

aravindhp commented Apr 30, 2021

aravindhp commented May 3, 2021

VaishnaviHire commented Feb 11, 2021 •

edited

Loading