-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WINC-544: Enhancement proposal for monitoring Windows Nodes #647
WINC-544: Enhancement proposal for monitoring Windows Nodes #647
Conversation
3550806
to
852b693
Compare
I would suggest updating the PR title (and commit message) to reflect that this is about monitoring windows nodes to help make it more clear at a glance what this is about |
852b693
to
09013b7
Compare
946e3dc
to
d3cbd51
Compare
/cc @sdodson |
/cc @openshift/openshift-team-windows-containers |
/cc @simonpasquier This enhancement contains details regarding creating unified monitoring interface for Windows and Linux. |
/cc @spadgett |
d3cbd51
to
c1d6ba4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the approach of maintaining a custom endpoint object sounds ok to me. It can be noted that it is similar to what the Prometheus operator is doing with the kube-system/kubelet
object.
@s-urbaniak for awareness.
|
||
### Future Plans | ||
|
||
As we move forward, our plan to display monitoring graphs is to create a [common |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want to offload the recording rules to the cluster-monitoring-operator (CMO) repository. My thinking was more that the Linux-specific rules would live in CMO while the Windows-specific rules would live in WMCO. Otherwise I fear that ownership will be diluted.
Another option would be to move all recording rules to the console operator but again my intuition is that the rules should stay close to the things that expose the metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how distributed ownership of recording rules will work if we are thinking of creating a unified interface. My opinion on this is that WMCO will be responsible for making sure that we receive metrics from windows nodes and CMO handles the complexity of grouping these(Linux + Windows) metrics to make the queries platform-independent. This will help in reducing the complexity on the Console side as well. See
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with pushing down all recording rules to CMO is that CMO doesn't have a way to validate that Windows-specific rules are correct. We should probably start by listing all metrics that are used by the console and identifying the gaps from the Windows side.
Also what is the general testing strategy to ensure that the console works properly for Windows?
@spadgett, I think that there are e2e console tests validating that the dashboards display correct result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably start by listing all metrics that are used by the console and identifying the gaps from the Windows side.
@VaishnaviHire please look into this.
Also what is the general testing strategy to ensure that the console works properly for Windows?
We were discussing this as a console change broke the filesystem graphs for Windows nodes: BZ 1930347. The approach we were thinking about was to add a Windows e2e to the console repo or add console tests for validating the dashboards to the WMCO repo depending on whatever is easier. We will defer to @spadgett on what the better approach is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier I have included a list of metrics used for displaying console graphs and the corresponding Windows metrics in this doc. Please take a look. I have also added the cases where the existing queries won`t work for Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to reflect Windows-specific
rules will be added in WMCO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier Updated the future plans
section to reflect the discussion from our meeting. Please review.
c1d6ba4
to
64260d0
Compare
639cf00
to
1fa886c
Compare
1fa886c
to
4fe5cf2
Compare
* Since the windows-exporter is not running as a [pod](#justification), the | ||
endpoint is not secure. The reason for this is when running inside a pod, we | ||
can use CA signer for providing TLS cert/key to the service for | ||
authentication. We plan to leverage windows_exporter's support for `https` | ||
configuration. WMCO will be responsible for adding [web config](https://github.com/prometheus/exporter-toolkit/blob/master/docs/web-configuration.md) | ||
for TLS. This will ensure that the metrics Endpoint will be able to | ||
authenticate the requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the exporter toolkit supports TLS for client authentication, it would only validate that the certificate presented by the client is signed by the trusted certificate authority. For instance, it won't be able to validate the certificate subject. This is something I plan to bring upstream because we might need it for node_exporter too.
86f3541
to
fdda97f
Compare
[support](https://github.com/prometheus-operator/prometheus-operator/issues/3862) | ||
for EndpointSlices object. | ||
|
||
#### Securing windows_exporter endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't say how the windows exporter would get a certificate and key materials if it is configured with TLS.
we have a consistent user experience for monitoring across Linux and Windows. | ||
* In the cases where `metric labels` are equivalent, we plan to relabel the | ||
Windows metrics to align with the Linux metrics. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a gap with the "USE Method / Node" dashboard in the monitoring menu.
There's already a dashboard for Windows in kubernetes-monitoring/kubernetes-mixin. IIUC it would be technically possible for WMCO to ship this dashboard as a configmap in the openshift-config-managed
namespace with the console.openshift.io/dashboard=true
label (the logging operator already does that) but it would require additional permissions for WMCO.
cc @spadgett to assess whether adding dashboards in the openshift-config-managed
namespace is supported in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spadgett could you please take that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier Yes, you can definitely add dashboards to openshift-config-managed
here. I think it makes a lot of sense, particularly if the Home -> Overview page isn't including metrics for Windows nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a sub-section to address the grafana dashboard in the future plans section, @simonpasquier @spadgett PTAL.
Windows metrics to align with the Linux metrics. | ||
|
||
##### Node Metrics | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add another column explaining how each metric is going to be used by the console eventually.
In details:
- node_memory_MemTotal_bytes / windows_cs_physical_memory_bytes => WMCO renames the Windows metric to node_memory_MemTotal_bytes because there are no label differences
- node_memory_MemAvailable_bytes / windows_memory_available_bytes => WMCO renames the Windows metric to node_memory_MemTotal_bytes because there are no label differences
- node_filesystem_size_bytes & node_filesystem_free_bytes / windows_logical_disk_size_bytes & windows_logical_disk_free_bytes => today the console uses these metrics to report the storage capacity (example), 2 new recording rules should be created by CMO and WMCO which would aggregate metrics per node/instance. Once available the console can use them instead of the custom queries.
instance:filesystem_size_bytes:sum
instance:filesystem_used_bytes:sum
- node_cpu_seconds_total => console uses metrics from recording rules (e.g.
instance:node_cpu:rate:sum
), WMCO needs to produce similar metrics.
| node_filesystem_free_bytes | windows_logical_disk_free_bytes | Missing Label: device, mountpoint, fstype) Additional label : (volume) | | ||
| node_cpu_seconds_total | windows_cpu_time_total | Missing Label : cpu Additional Label: core | | ||
|
||
##### Pod Metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, we need to identify where recording rules would need to be created/updated vs. where WMCO can modify the metrics to look like metrics from Linux nodes.
1286d1e
to
e0eb5c8
Compare
@simonpasquier updated PR PTAL. |
543cebd
to
12004c8
Compare
19af7be
to
ce41d4a
Compare
@simonpasquier @spadgett please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The console changes LGTM. I'll let @simonpasquier and others weigh in on the other areas.
/approve
|
||
As part of this enhancement, we do not plan to do the following: | ||
* Integrating windows_exporter with cluster monitoring operator | ||
* Ship Grafana dashboards for Windows Nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we do plan to do this now? (Using console Monitoring -> Dashboards.)
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: spadgett The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ce41d4a
to
8695014
Compare
@simonpasquier we are hoping to close this out at the end of this sprint. Please let us know if more changes are needed. |
/lgtm @simonpasquier looks all open items have been addressed by @mansikulkarni96. Please comment if you want any changes and we will handle that with a follow up PR. |
Enhancement proposal for enabling monitoring on Windows nodes created by Windows Machine Config Operator(WMCO).
b441e4b
to
258dc80
Compare
/lgtm |
Enhancement proposal for enabling monitoring on
Windows nodes created by Windows Machine Config Operator(WMCO).