Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture and report systemd unit failures by default to the MCS #1365

Open
cgwalters opened this issue Jan 10, 2020 · 5 comments
Open

Capture and report systemd unit failures by default to the MCS #1365

cgwalters opened this issue Jan 10, 2020 · 5 comments
Assignees
Labels
jira lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@cgwalters
Copy link
Member

We've debated this a few times - who watches host failures? I think we should do something like have a small systemd unit machine-config-daemon-host-monitor.service which watches systemd and if any unit fails, does a POST to the MCS.

Yes this implies a MCS write endpoint, not just read. We could start by just dumping the data into the pod logs.

This is also related to coreos/ignition#585 which we'd also want a MCS endpoint for.

@cgwalters
Copy link
Member Author

Also related to this, something in the platform really must start watching for kernel oopses.

Also xref https://github.com/kubernetes/node-problem-detector

cgwalters added a commit to cgwalters/must-gather that referenced this issue Jun 5, 2020
Motivated by https://bugzilla.redhat.com/show_bug.cgi?id=1842906
and many prior bugs.  Also added a link to
openshift/machine-config-operator#1365
which I think would be a better fix for early cluster bringup.

Also xref openshift/machine-config-operator#1790
@cgwalters
Copy link
Member Author

Now that #1766 landed - we can make this be part of the MCD itself quite easily.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2020
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 23, 2020
@cgwalters
Copy link
Member Author

/lifecycle frozen
We really need this. See e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1906057 where the real problem is lost in the black hole somewhere in Ignition or firstboot.

I'd been hoping we could reuse openshift/enhancements#443 to authenticate the MCS endpoint for failure report submission.

@openshift-ci-robot openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

4 participants