-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capture and report systemd unit failures by default to the MCS #1365
Comments
Also related to this, something in the platform really must start watching for kernel oopses. Also xref https://github.com/kubernetes/node-problem-detector |
Motivated by https://bugzilla.redhat.com/show_bug.cgi?id=1842906 and many prior bugs. Also added a link to openshift/machine-config-operator#1365 which I think would be a better fix for early cluster bringup. Also xref openshift/machine-config-operator#1790
Now that #1766 landed - we can make this be part of the MCD itself quite easily. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
/lifecycle frozen I'd been hoping we could reuse openshift/enhancements#443 to authenticate the MCS endpoint for failure report submission. |
We've debated this a few times - who watches host failures? I think we should do something like have a small systemd unit
machine-config-daemon-host-monitor.service
which watches systemd and if any unit fails, does aPOST
to the MCS.Yes this implies a MCS write endpoint, not just read. We could start by just dumping the data into the pod logs.
This is also related to coreos/ignition#585 which we'd also want a MCS endpoint for.
The text was updated successfully, but these errors were encountered: