MCO-356: daemon: Do a secondary in-place update if rpm-ostree is too old #3310

cgwalters · 2022-08-24T22:57:41Z

xref coreos/rpm-ostree@89f5802

On firstboot, the rpm-ostree (and skopeo) running in the system
may be too old to do the native flow. This hacky bit runs
the new OS image as a container, and reboots another time before
doing anything else.

openshift-ci · 2022-08-24T22:57:47Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2022-08-24T22:58:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yuqi-zhang · 2022-08-29T21:27:56Z

pkg/daemon/daemon.go

+ // If the host isn't new enough to understand the new container model natively, run as a privileged container and reboot.
+ // See https://github.com/coreos/rpm-ostree/pull/3961 and https://issues.redhat.com/browse/MCO-356
+ if !newEnough {
+ err := runCmdSync("podman", "run", "--privileged", "--pid=host", "--net=host", "--rm", "-v", "/:/run/host", mc.Spec.OSImageURL, "rpm-ostree", "ex", "deploy-from-self", "/run/host", "--reboot")


Just curious, this is basically updating the on-host rpm-ostree from the updated rpm-ostree in the updated (4.12) oscontainer. Why is the reboot needed vs, say, restarting the rpm-ostree daemon?

The apply-live path was only stabilized in 2022.13...earlier versions do have ex apply-live, but really old versions don't.

We could indeed optimize this in the future to skip a reboot for sufficiently new versions; but I think ultimately what we want to do is to enhance this code to do all the setup from the container too, to avoid the next update doing a reboot.

IOW our firstboot path then always runs rpm-ostree from a container, and not the host.

On the other hand, we have some code for updating bootimages, and once that lands this code is going to be hit much less often.

We decided baseOSContainerImage was clearer than baseOperatingSystemContainer (and the same for the extensions container) so this renames all of this throughout to match. I'm doing this first before all of the rest of the "new image by default" code so if we have to revert one of those commits, we don't have to worry about the names. This just renames all instances of the BaseOperatingSystem variables througout to BaseOS and favors the more clear "ContainerImage" suffix rather than the less clear "Container", as "Container" typically signifies something that's running.

Now that we have the extensions container, we can apply extensions and switch kernels again. This just re-enables that functionality for the 'new image' path during daemon updates, and removes the "not supported" messages.

Both rpm-ostree rebase and update require access to the pull secrets in order to pull manifests and images. This changes the behavior such that we symlink the kubelet config.json secrets to /run/ostree/auth.json when we initialize our "NodeUpdaterClient" (rather than only in rebase) so any of our rpm-ostree calls can use it.

Until we figure out if/how we want to use the extensions container as a service, we're going to have to extract it to disk to use it. In order for that to happen, the daemon needs to know where to get it, so it needs to be present in machineconfig. This adds the extensions container to the machineconfig type, the crd, and the resourcemerge library.

This passes the extensions container from controller config through to machineconfig.

Since we can't use the extensions container as a service right now due to boostrap/dns/complexity, for now, we're going to extract them to disk like we did the extensions in machine-os-contanet. This pulls and extracts the extensions container to /run and adds it as a yum repo if we are on a new format image and there are extensions present in MachineConfig.

Currently if someone overrides OSImageURL on their master pool, the required pools sync in the operator will fail because it checks to make sure the OSImageURL matches, and when it's overridden it doesn't. (This results in degradation and timeouts). The check was originally put in to fix a bug around proper update completion reporting. This allows the check to be skipped if the user has "taken the wheel" by overriding OSImageURL.

This adds an "image selection function" to the controller helpers and a boolean variable in helpers.go that determines which image it will use by default. The intent is that it is that it be set to false by default here, rendering the "new image format by default" functionality inert, but that a later commit will enable it. This logic can be taken out later once we get rid of machine-os-content.

This just flips our "gating boolean" to true so that the MCO will use the new format image by default.

This excludes all non-base (right now, only extensions) images from image-references if the build is using fcos, because we currently don't ship fcos extensions images, and including them would break the payload. This also removes the baseOSExtensionsContainerImage from the osimagurl configmap so the placeholder value doesn't get passed through. (oc won't rewrite it if it's not in image-references).

cgwalters · 2022-09-14T15:23:38Z

OK I've

Rebased this on MCO-289: OCPBUGS-1324: Teach the MCO to use new format image #3317
Dropped the dependency on Switch to https://github.com/coreos/rpmostree-client-go #3302

/test e2e-agnostic-upgrade

sinnykumari · 2022-09-14T17:08:58Z

@cgwalters looks like tests are not running on this PR because it is in WIP (not sure, could be prow as well)? should we remove WIP?

yuqi-zhang · 2022-09-14T17:12:06Z

I believe leaving the WIP is fine, the "draft" github status is what is preventing it

sinnykumari · 2022-09-14T17:12:45Z

also commit message needs to be updated for 9a7cea4

sinnykumari · 2022-09-15T09:33:21Z

/test e2e-agnostic-upgrade

cgwalters · 2022-09-15T12:51:24Z

/test e2e-agnostic-upgrade

Oops sorry I hadn't noticed you had a test run going here, would have been nice to see the results! But in any case we really need to land a PR as an atomic unit that has everything, so might as well do it in the other PR.

xref coreos/rpm-ostree@89f5802 Change the logic for doing OS updates to detect if rpm-ostree is too old to natively fetch a container image on its own. If so, we use `ex deploy-from-self` via podman. Introduce a new `/etc/machine-config-daemon-force-once` file that will cause us to skip validation, and hence we should re-reconcile and attempt to apply the OS upgrade again, this time natively via rpm-ostree. This is needed for scaleup of old nodes, as well as temporarily for 4.11 upgrades.

rpm-ostree deploy-from-self does get us to the right image, but the MCO doesn't know that because the OSImageURL/Container Name/Custom Deployment name doesn't get set when it rebases, it just uses the commit hash. This tells the MCO to: - inspect the current image specified by OSImageURL, - grab its commit hash from the labels - compare it to the base commit of what we have on-disk. If they match, we know we are on the correct image, even if we did not get there the usual way.

xref coreos/rpm-ostree@89f5802 Change the logic for firstboot detect if rpm-ostree is too old to natively fetch a container image on its own. If so, we use `ex deploy-from-self` via podman. This is needed for scaleup of old nodes because we won't be able to natively fetch the new-format base image (e.g. `rhel-coreos-8`) from the old rpm-ostree.

cgwalters · 2022-09-30T19:37:14Z

This was merged in #3317

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 24, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 24, 2022

This was referenced Aug 25, 2022

[WIP] Teach MCO to use the new-format image by default #3258

Closed

Add ex deploy-from-self coreos/rpm-ostree#3961

Merged

yuqi-zhang reviewed Aug 29, 2022

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 8, 2022

jkyros added 11 commits September 10, 2022 03:06

Kernel/Exetensions support for new image format

05274a2

Now that we have the extensions container, we can apply extensions and switch kernels again. This just re-enables that functionality for the 'new image' path during daemon updates, and removes the "not supported" messages.

Reenable extensions container in osimageurl map

f9c2c22

Pass extensions container through to machineconfig

ca2c352

This passes the extensions container from controller config through to machineconfig.

Enable "use new format image by default"

1e4917b

This just flips our "gating boolean" to true so that the MCO will use the new format image by default.

sinnykumari mentioned this pull request Sep 14, 2022

MCO-289: OCPBUGS-1324: Teach the MCO to use new format image #3317

Merged

cgwalters force-pushed the double-rebase-for-old-versions branch from 7225bd4 to 9a7cea4 Compare September 14, 2022 15:22

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 14, 2022

cgwalters force-pushed the double-rebase-for-old-versions branch from 9a7cea4 to f3f7daf Compare September 14, 2022 19:28

cgwalters changed the title ~~WIP: MCO-356: Add an extra reboot for old bootimages~~ MCO-356: daemon: Do a secondary in-place update if rpm-ostree is too old Sep 14, 2022

cgwalters closed this Sep 15, 2022

cgwalters reopened this Sep 19, 2022

cgwalters force-pushed the double-rebase-for-old-versions branch 3 times, most recently from 3973062 to 361808c Compare September 20, 2022 00:25

jkyros and others added 2 commits September 20, 2022 00:09

cgwalters force-pushed the double-rebase-for-old-versions branch from 361808c to 5f47ed3 Compare September 20, 2022 08:40

cgwalters closed this Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCO-356: daemon: Do a secondary in-place update if rpm-ostree is too old #3310

MCO-356: daemon: Do a secondary in-place update if rpm-ostree is too old #3310

cgwalters commented Aug 24, 2022 •

edited

Loading

openshift-ci bot commented Aug 24, 2022

openshift-ci bot commented Aug 24, 2022

yuqi-zhang Aug 29, 2022

cgwalters Aug 29, 2022

cgwalters commented Sep 14, 2022

sinnykumari commented Sep 14, 2022

yuqi-zhang commented Sep 14, 2022

sinnykumari commented Sep 14, 2022

sinnykumari commented Sep 15, 2022

cgwalters commented Sep 15, 2022

cgwalters commented Sep 30, 2022

MCO-356: daemon: Do a secondary in-place update if rpm-ostree is too old #3310

MCO-356: daemon: Do a secondary in-place update if rpm-ostree is too old #3310

Conversation

cgwalters commented Aug 24, 2022 • edited Loading

openshift-ci bot commented Aug 24, 2022

openshift-ci bot commented Aug 24, 2022

yuqi-zhang Aug 29, 2022

Choose a reason for hiding this comment

cgwalters Aug 29, 2022

Choose a reason for hiding this comment

cgwalters commented Sep 14, 2022

sinnykumari commented Sep 14, 2022

yuqi-zhang commented Sep 14, 2022

sinnykumari commented Sep 14, 2022

sinnykumari commented Sep 15, 2022

cgwalters commented Sep 15, 2022

cgwalters commented Sep 30, 2022

cgwalters commented Aug 24, 2022 •

edited

Loading