Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add restart metrics enhancement #255

Closed

Conversation

rphillips
Copy link
Contributor

@rphillips rphillips commented Mar 23, 2020

This enhancement enables metrics for certain systemd units to track restarts.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rphillips
To complete the pull request process, please assign jwmatthews
You can assign the PR to them by writing /assign @jwmatthews in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rphillips
Copy link
Contributor Author

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its our goal to monitor restarts/lifetime of crio and kubelet (from what I can tell) why do we not just use those metrics to those two components directly, for example process_start_time_seconds that comes from the go process collector, which both crio and kubelet already have.

enhancements/monitoring/restart-metrics.md Outdated Show resolved Hide resolved
enhancements/monitoring/restart-metrics.md Show resolved Hide resolved
enhancements/monitoring/restart-metrics.md Show resolved Hide resolved
enhancements/monitoring/restart-metrics.md Outdated Show resolved Hide resolved
enhancements/monitoring/restart-metrics.md Outdated Show resolved Hide resolved
@smarterclayton
Copy link
Contributor

Note that I commented that this is about monitoring restarts of all services on the box, not just kubelet and crio. We are responsible for knowing whether all services shipped as part of the OS or installed are restarting.

@rphillips rphillips force-pushed the feat/add_systemd_restart_metrics branch from ddaa06e to c9e3d86 Compare March 23, 2020 21:00
@rphillips
Copy link
Contributor Author

@lilic @smarterclayton I updated the proposal with everything we discussed. I think we could defer the alarm definitions until after we see some sample clusters. Ready for re-review.

@rphillips
Copy link
Contributor Author

I also added a list of possible core services we would want to start with.

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, couple of questions.

cc @openshift/openshift-team-monitoring


### Non-Goals

- Defining alerts is not in the scope of this proposal. Alerts can be defined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the node team would define those alerts, we are of course happy to sheppard this! :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!


### Test Plan

- Validate we see system dbus metrics from the whitelisted services
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you want this tests, I am assuming in origin prometheus tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we do metric validation there, that would be great.


### Graduation Criteria

Delivered in 4.5.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do agree that this is a feature and not a bug(zilla) and there is no need to backport this feature, right? :) cc @smarterclayton

enhancements/monitoring/restart-metrics.md Show resolved Hide resolved
text_collector.

- A whitelist will be enabled in systemd_exporter to enable metrics from the
following core services:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like at least the services listed here deserve to be monitored in a whitebox fashion, not just indirectly through the proposed mechanism. Can we ensure that as a follow up of this "generic" monitoring, we also add whitebox monitoring to the node team and/or coreos team's backlogs? kubelet and crio are already, and chrony would have been helpful multiple times already and I imagine the same holds true for all other services listed here (I believe I've talked about NetworkManager with @lucab before as well).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brancz I can add a section on what we would like to add later... Can you provide me some examples of the whitebox testing you would like to see?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially each of these components should either have a metrics endpoint of their own or we should write an exporter for it. kubelet and crio already do so that's good, but all the others in this list should as well or at least have an exporter.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I can't seem to find where this is documented. Can you point me at it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these other components are system components, and the restart metric is likely a good indicator of failure.

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern left from me, otherwise looks good to me.


- This proposal is to add a privileged sidecar running systemd_exporter
[systemd_exporter](https://github.com/povilasv/systemd_exporter]) to the
node_exporter [daemonset](https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/node-exporter/daemonset.yaml#L17).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still would prefer this DaemonSet to be separate of the node-exporter one, any reasons against this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clayton specifically mentioned a sidecar, which is typically deployed with the dependent component. Since the systemd_exporter depends on node_exporter, I think it makes sense to deploy it in the same pod.

@smarterclayton ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any dependency between those two components. Could someone describe what is the dependency here and why it couldn't be deployed as DaemonSet?

@rphillips
Copy link
Contributor Author

@lilic does this enhancement sound ok?

@rphillips
Copy link
Contributor Author

@brancz ?

@lilic
Copy link
Contributor

lilic commented Apr 8, 2020

To me this #255 (comment) is still not resolved. When we agree on that, it looks good to me 👍

@brancz
Copy link

brancz commented Apr 14, 2020

I'm maybe just missing it, but I also don't see this: #255 (comment)

@rphillips
Copy link
Contributor Author

The other processes are system level processes. I doubt the maintainers will want to include metrics or http endpoints. Are we ok with 'monitoring' their restart counts?

@brancz
Copy link

brancz commented May 12, 2020

It's not about native integrations necessarily. Many thing in the Prometheus world are monitored using "exporters". Exporters convert the format a project chooses to Prometheus format. A project that implements a long running daemon but doesn't care about the ability to monitor it, is arguably careless (that's like saying you don't log because it's not necessary).

The order of preference, depending on the respective project's willingness is roughly like this to me:

  • native integration with http endpoint
  • native integration with unix domain socket (then exposed to Prometheus via something like local_exporter)
  • @lucab is currently trying to see if there is a possibility to have something via dbus for components that already make use of it
  • implement exporter if project uses bespoke mechanism to expose metrics

If none of these are feasible then, and only then should we be thinking about this kind of blackbox monitoring, to have any insight at all.

@rphillips
Copy link
Contributor Author

I think this proposal has gotten off a tad off track. The proposal is to track system process restart counts with the included way systemd tracks restart counts, so we can see if an underlying system process is continuously crashing. I am a bit confused as to where we are in this approval process.

@brancz
Copy link

brancz commented May 20, 2020

What I'm asking for is for the node team to plan an effort to whitebox monitor these components, for me to be comfortable with this tactical solution of blackbox monitoring them.

@lilic
Copy link
Contributor

lilic commented May 20, 2020

I think we are fine with merging this, after my question is answered, as long as we open a RFE for node team to whitebox monitoring the components.

- systemd_exporter will be configured to write metrics to node_exporters
text_collector.

- A whitelist will be enabled in systemd_exporter to enable metrics from the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we should find alternative to the term 'whitelist'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, yes that sounds great idea! We use allowlist/denylist in our project so that would be good.

@rphillips
Copy link
Contributor Author

@lilic could you add lgtm / approval labels to this enhancement?

@brancz
Copy link

brancz commented Jul 3, 2020

Could you link the RFE? Then I'm happy to lgtm as well.


- A whitelist will be enabled in systemd_exporter to enable metrics from the
following core services, using the built in command-line argument
([collector.unit-whitelist](https://github.com/povilasv/systemd_exporter/blob/master/systemd/systemd.go#L25)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should treat this problem more holistically; there are various OpenShift components which end up running code on the host, sometimes via systemd units. An allowlist like this is going to be hard to maintain.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 16, 2020
@lilic
Copy link
Contributor

lilic commented Nov 18, 2020

I believe this is still very valid and would love to see it, it has only one comment left to address. @rphillips any updates on it? Thanks!

@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 18, 2020
@ehashman
Copy link
Contributor

/remove-lifecycle rotten

My understanding is that we just need an approval on this and it should be good to go. @mrunalp @sjenning @smarterclayton could you PTAL when you get the chance?

@openshift-ci-robot openshift-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 21, 2020
@ehashman
Copy link
Contributor

Also waiting for an LGTM from @lilic I think :)

@lilic
Copy link
Contributor

lilic commented Jan 4, 2021

Thanks Elana, we are just waiting for this comment to be answered, as we would prefer to deploy it separately from node_exporter. #255 (comment)

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 4, 2021
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Jun 4, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 4, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.