Capture and report systemd unit failures by default to the MCS #1365

cgwalters · 2020-01-10T17:47:30Z

We've debated this a few times - who watches host failures? I think we should do something like have a small systemd unit machine-config-daemon-host-monitor.service which watches systemd and if any unit fails, does a POST to the MCS.

Yes this implies a MCS write endpoint, not just read. We could start by just dumping the data into the pod logs.

This is also related to coreos/ignition#585 which we'd also want a MCS endpoint for.

The text was updated successfully, but these errors were encountered:

cgwalters · 2020-05-04T12:18:24Z

Also related to this, something in the platform really must start watching for kernel oopses.

Also xref https://github.com/kubernetes/node-problem-detector

Motivated by https://bugzilla.redhat.com/show_bug.cgi?id=1842906 and many prior bugs. Also added a link to openshift/machine-config-operator#1365 which I think would be a better fix for early cluster bringup. Also xref openshift/machine-config-operator#1790

cgwalters · 2020-07-01T17:24:39Z

Now that #1766 landed - we can make this be part of the MCD itself quite easily.

openshift-bot · 2020-10-24T13:10:39Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-11-23T15:02:38Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

cgwalters · 2020-12-15T18:01:44Z

/lifecycle frozen
We really need this. See e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1906057 where the real problem is lost in the black hole somewhere in Ignition or firstboot.

I'd been hoping we could reuse openshift/enhancements#443 to authenticate the MCS endpoint for failure report submission.

runcom assigned sinnykumari Jan 22, 2020

sinnykumari added the jira label Feb 20, 2020

cgwalters mentioned this issue Apr 8, 2020

[WIP] Start host openvswitch using systemctl openshift/cluster-network-operator#477

Closed

cgwalters mentioned this issue Jun 5, 2020

gather_service_logs: Gather MCO host services openshift/must-gather#158

Closed

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 23, 2020

openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Dec 15, 2020

cgwalters mentioned this issue May 31, 2022

Bug 1928581: validate the proxy by trying oc image info #2539

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture and report systemd unit failures by default to the MCS #1365

Capture and report systemd unit failures by default to the MCS #1365

cgwalters commented Jan 10, 2020

cgwalters commented May 4, 2020

cgwalters commented Jul 1, 2020

openshift-bot commented Oct 24, 2020

openshift-bot commented Nov 23, 2020

cgwalters commented Dec 15, 2020

Capture and report systemd unit failures by default to the MCS #1365

Capture and report systemd unit failures by default to the MCS #1365

Comments

cgwalters commented Jan 10, 2020

cgwalters commented May 4, 2020

cgwalters commented Jul 1, 2020

openshift-bot commented Oct 24, 2020

openshift-bot commented Nov 23, 2020

cgwalters commented Dec 15, 2020