-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add restart metrics enhancement #255
add restart metrics enhancement #255
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rphillips The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If its our goal to monitor restarts/lifetime of crio and kubelet (from what I can tell) why do we not just use those metrics to those two components directly, for example process_start_time_seconds
that comes from the go process collector, which both crio and kubelet already have.
Note that I commented that this is about monitoring restarts of all services on the box, not just kubelet and crio. We are responsible for knowing whether all services shipped as part of the OS or installed are restarting. |
ddaa06e
to
c9e3d86
Compare
@lilic @smarterclayton I updated the proposal with everything we discussed. I think we could defer the alarm definitions until after we see some sample clusters. Ready for re-review. |
I also added a list of possible core services we would want to start with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, couple of questions.
cc @openshift/openshift-team-monitoring
|
||
### Non-Goals | ||
|
||
- Defining alerts is not in the scope of this proposal. Alerts can be defined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the node team would define those alerts, we are of course happy to sheppard this! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
|
||
### Test Plan | ||
|
||
- Validate we see system dbus metrics from the whitelisted services |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you want this tests, I am assuming in origin prometheus tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we do metric validation there, that would be great.
|
||
### Graduation Criteria | ||
|
||
Delivered in 4.5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do agree that this is a feature and not a bug(zilla) and there is no need to backport this feature, right? :) cc @smarterclayton
text_collector. | ||
|
||
- A whitelist will be enabled in systemd_exporter to enable metrics from the | ||
following core services: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like at least the services listed here deserve to be monitored in a whitebox fashion, not just indirectly through the proposed mechanism. Can we ensure that as a follow up of this "generic" monitoring, we also add whitebox monitoring to the node team and/or coreos team's backlogs? kubelet and crio are already, and chrony would have been helpful multiple times already and I imagine the same holds true for all other services listed here (I believe I've talked about NetworkManager with @lucab before as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brancz I can add a section on what we would like to add later... Can you provide me some examples of the whitebox testing you would like to see?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially each of these components should either have a metrics endpoint of their own or we should write an exporter for it. kubelet and crio already do so that's good, but all the others in this list should as well or at least have an exporter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I can't seem to find where this is documented. Can you point me at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these other components are system components, and the restart metric is likely a good indicator of failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern left from me, otherwise looks good to me.
|
||
- This proposal is to add a privileged sidecar running systemd_exporter | ||
[systemd_exporter](https://github.com/povilasv/systemd_exporter]) to the | ||
node_exporter [daemonset](https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/node-exporter/daemonset.yaml#L17). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still would prefer this DaemonSet to be separate of the node-exporter one, any reasons against this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clayton specifically mentioned a sidecar, which is typically deployed with the dependent component. Since the systemd_exporter depends on node_exporter, I think it makes sense to deploy it in the same pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any dependency between those two components. Could someone describe what is the dependency here and why it couldn't be deployed as DaemonSet?
@lilic does this enhancement sound ok? |
@brancz ? |
To me this #255 (comment) is still not resolved. When we agree on that, it looks good to me 👍 |
I'm maybe just missing it, but I also don't see this: #255 (comment) |
The other processes are system level processes. I doubt the maintainers will want to include metrics or http endpoints. Are we ok with 'monitoring' their restart counts? |
It's not about native integrations necessarily. Many thing in the Prometheus world are monitored using "exporters". Exporters convert the format a project chooses to Prometheus format. A project that implements a long running daemon but doesn't care about the ability to monitor it, is arguably careless (that's like saying you don't log because it's not necessary). The order of preference, depending on the respective project's willingness is roughly like this to me:
If none of these are feasible then, and only then should we be thinking about this kind of blackbox monitoring, to have any insight at all. |
I think this proposal has gotten off a tad off track. The proposal is to track system process restart counts with the included way systemd tracks restart counts, so we can see if an underlying system process is continuously crashing. I am a bit confused as to where we are in this approval process. |
What I'm asking for is for the node team to plan an effort to whitebox monitor these components, for me to be comfortable with this tactical solution of blackbox monitoring them. |
I think we are fine with merging this, after my question is answered, as long as we open a RFE for node team to whitebox monitoring the components. |
- systemd_exporter will be configured to write metrics to node_exporters | ||
text_collector. | ||
|
||
- A whitelist will be enabled in systemd_exporter to enable metrics from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we should find alternative to the term 'whitelist'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, yes that sounds great idea! We use allowlist/denylist in our project so that would be good.
@lilic could you add lgtm / approval labels to this enhancement? |
Could you link the RFE? Then I'm happy to lgtm as well. |
|
||
- A whitelist will be enabled in systemd_exporter to enable metrics from the | ||
following core services, using the built in command-line argument | ||
([collector.unit-whitelist](https://github.com/povilasv/systemd_exporter/blob/master/systemd/systemd.go#L25)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should treat this problem more holistically; there are various OpenShift components which end up running code on the host, sometimes via systemd units. An allowlist like this is going to be hard to maintain.
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
I believe this is still very valid and would love to see it, it has only one comment left to address. @rphillips any updates on it? Thanks! |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle rotten My understanding is that we just need an approval on this and it should be good to go. @mrunalp @sjenning @smarterclayton could you PTAL when you get the chance? |
Also waiting for an LGTM from @lilic I think :) |
Thanks Elana, we are just waiting for this comment to be answered, as we would prefer to deploy it separately from node_exporter. #255 (comment) |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This enhancement enables metrics for certain systemd units to track restarts.