supporting factory reset #2520

cgwalters · 2021-04-09T15:47:45Z

This is an extension of coreos/fedora-coreos-tracker#399

I think at a basic level, to "light reprovision" a node it should work to do:

$ oc debug node $x -- /bin/bash -c 'mount -o remount,rw /host/boot && touch /host/boot/ignition.firstboot'

Then, oc edit node/$x and forcibly update the currentConfig+desiredConfig annotations to match the current rendered config value for the pool.

Then, oc debug node $x -- chroot /host reboot

The text was updated successfully, but these errors were encountered:

cgwalters · 2021-04-09T15:48:08Z

xref https://github.com/openshift/machine-config-operator/blob/master/docs/FAQ.md#q-can-i-use-the-mco-to-re-partition-or-re-install

cgwalters · 2021-04-09T18:14:18Z

OK a lot of internal chat about this , the desire is to try to un-wedge e.g. a control plane node if a bad MC rolled out. But it clearly generalizes to workers too.

So my strawman is:

Add /usr/libexec/machine-config-daemon reset-config written in Go that does the above touch plus also reconciles the annotations on the node
Change the default templates to inject /usr/local/bin/openshift-node-reset that is just an exec /run/bin/machine-config-daemon reset-config (but allows us to change the implementation in the future)
Add CI coverage for this in the MCO repo

Now...hmm actually /run/bin will (intentionally) go away on reboot, the idea is we want to sync the MCD's binary. Perhaps instead we should dynamically write /usr/local/bin/machine-config-daemon (persistent) but update it based on contents from the pod when that runs.

cgwalters · 2021-04-09T18:16:40Z

And another further thing we can do once we have this fuctionality is add support for cluster initated resets, like the admin adds an annotation io.openshift.machineconfig.reset="true" or whatever to the node object, and the MCD says "OK" and does this for you.

deads2k · 2021-04-09T18:27:11Z

So my strawman is:

Add /usr/libexec/machine-config-daemon reset-config written in Go that does the above touch plus also reconciles the annotations on the node

Change the default templates to inject /usr/local/bin/openshift-node-reset that is just an exec /run/bin/machine-config-daemon reset-config (but allows us to change the implementation in the future)

Add CI coverage for this in the MCO repo

I like the idea. I think it covers the three different failures we've seen on actual clusters that I think we can resolve

without a running kubelet
without a running crio
without the ability to pull images

I don't think we can handle the without a functional network in any reasonable way.

Trying to get creds for for your step 1 is a thing I think @sttts can work out on masters

openshift-bot · 2021-07-08T20:12:42Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-08-07T23:58:24Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-09-07T02:29:17Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-09-07T02:29:38Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 8, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 7, 2021

openshift-ci bot closed this as completed Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

supporting factory reset #2520

supporting factory reset #2520

cgwalters commented Apr 9, 2021

cgwalters commented Apr 9, 2021

cgwalters commented Apr 9, 2021

cgwalters commented Apr 9, 2021 •

edited

Loading

deads2k commented Apr 9, 2021

openshift-bot commented Jul 8, 2021

openshift-bot commented Aug 7, 2021

openshift-bot commented Sep 7, 2021

openshift-ci bot commented Sep 7, 2021

supporting factory reset #2520

supporting factory reset #2520

Comments

cgwalters commented Apr 9, 2021

cgwalters commented Apr 9, 2021

cgwalters commented Apr 9, 2021

cgwalters commented Apr 9, 2021 • edited Loading

deads2k commented Apr 9, 2021

openshift-bot commented Jul 8, 2021

openshift-bot commented Aug 7, 2021

openshift-bot commented Sep 7, 2021

openshift-ci bot commented Sep 7, 2021

cgwalters commented Apr 9, 2021 •

edited

Loading