Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supporting factory reset #2520

Closed
cgwalters opened this issue Apr 9, 2021 · 8 comments
Closed

supporting factory reset #2520

cgwalters opened this issue Apr 9, 2021 · 8 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@cgwalters
Copy link
Member

This is an extension of coreos/fedora-coreos-tracker#399

I think at a basic level, to "light reprovision" a node it should work to do:

$ oc debug node $x -- /bin/bash -c 'mount -o remount,rw /host/boot && touch /host/boot/ignition.firstboot'

Then, oc edit node/$x and forcibly update the currentConfig+desiredConfig annotations to match the current rendered config value for the pool.

Then, oc debug node $x -- chroot /host reboot

@cgwalters
Copy link
Member Author

OK a lot of internal chat about this , the desire is to try to un-wedge e.g. a control plane node if a bad MC rolled out. But it clearly generalizes to workers too.

So my strawman is:

  • Add /usr/libexec/machine-config-daemon reset-config written in Go that does the above touch plus also reconciles the annotations on the node
  • Change the default templates to inject /usr/local/bin/openshift-node-reset that is just an exec /run/bin/machine-config-daemon reset-config (but allows us to change the implementation in the future)
  • Add CI coverage for this in the MCO repo

Now...hmm actually /run/bin will (intentionally) go away on reboot, the idea is we want to sync the MCD's binary. Perhaps instead we should dynamically write /usr/local/bin/machine-config-daemon (persistent) but update it based on contents from the pod when that runs.

@cgwalters
Copy link
Member Author

cgwalters commented Apr 9, 2021

And another further thing we can do once we have this fuctionality is add support for cluster initated resets, like the admin adds an annotation io.openshift.machineconfig.reset="true" or whatever to the node object, and the MCD says "OK" and does this for you.

@deads2k
Copy link
Contributor

deads2k commented Apr 9, 2021

So my strawman is:

  • Add /usr/libexec/machine-config-daemon reset-config written in Go that does the above touch plus also reconciles the annotations on the node
  • Change the default templates to inject /usr/local/bin/openshift-node-reset that is just an exec /run/bin/machine-config-daemon reset-config (but allows us to change the implementation in the future)
  • Add CI coverage for this in the MCO repo

I like the idea. I think it covers the three different failures we've seen on actual clusters that I think we can resolve

  1. without a running kubelet
  2. without a running crio
  3. without the ability to pull images

I don't think we can handle the without a functional network in any reasonable way.

Trying to get creds for for your step 1 is a thing I think @sttts can work out on masters

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 8, 2021
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 7, 2021
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 7, 2021

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this as completed Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

3 participants