-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCO-356: daemon: Do a secondary in-place update if rpm-ostree is too old #3310
MCO-356: daemon: Do a secondary in-place update if rpm-ostree is too old #3310
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
pkg/daemon/daemon.go
Outdated
// If the host isn't new enough to understand the new container model natively, run as a privileged container and reboot. | ||
// See https://github.com/coreos/rpm-ostree/pull/3961 and https://issues.redhat.com/browse/MCO-356 | ||
if !newEnough { | ||
err := runCmdSync("podman", "run", "--privileged", "--pid=host", "--net=host", "--rm", "-v", "/:/run/host", mc.Spec.OSImageURL, "rpm-ostree", "ex", "deploy-from-self", "/run/host", "--reboot") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, this is basically updating the on-host rpm-ostree from the updated rpm-ostree in the updated (4.12) oscontainer. Why is the reboot needed vs, say, restarting the rpm-ostree daemon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The apply-live path was only stabilized in 2022.13...earlier versions do have ex apply-live
, but really old versions don't.
We could indeed optimize this in the future to skip a reboot for sufficiently new versions; but I think ultimately what we want to do is to enhance this code to do all the setup from the container too, to avoid the next update doing a reboot.
IOW our firstboot path then always runs rpm-ostree from a container, and not the host.
On the other hand, we have some code for updating bootimages, and once that lands this code is going to be hit much less often.
We decided baseOSContainerImage was clearer than baseOperatingSystemContainer (and the same for the extensions container) so this renames all of this throughout to match. I'm doing this first before all of the rest of the "new image by default" code so if we have to revert one of those commits, we don't have to worry about the names. This just renames all instances of the BaseOperatingSystem variables througout to BaseOS and favors the more clear "ContainerImage" suffix rather than the less clear "Container", as "Container" typically signifies something that's running.
Now that we have the extensions container, we can apply extensions and switch kernels again. This just re-enables that functionality for the 'new image' path during daemon updates, and removes the "not supported" messages.
Both rpm-ostree rebase and update require access to the pull secrets in order to pull manifests and images. This changes the behavior such that we symlink the kubelet config.json secrets to /run/ostree/auth.json when we initialize our "NodeUpdaterClient" (rather than only in rebase) so any of our rpm-ostree calls can use it.
Until we figure out if/how we want to use the extensions container as a service, we're going to have to extract it to disk to use it. In order for that to happen, the daemon needs to know where to get it, so it needs to be present in machineconfig. This adds the extensions container to the machineconfig type, the crd, and the resourcemerge library.
This passes the extensions container from controller config through to machineconfig.
Since we can't use the extensions container as a service right now due to boostrap/dns/complexity, for now, we're going to extract them to disk like we did the extensions in machine-os-contanet. This pulls and extracts the extensions container to /run and adds it as a yum repo if we are on a new format image and there are extensions present in MachineConfig.
Currently if someone overrides OSImageURL on their master pool, the required pools sync in the operator will fail because it checks to make sure the OSImageURL matches, and when it's overridden it doesn't. (This results in degradation and timeouts). The check was originally put in to fix a bug around proper update completion reporting. This allows the check to be skipped if the user has "taken the wheel" by overriding OSImageURL.
This adds an "image selection function" to the controller helpers and a boolean variable in helpers.go that determines which image it will use by default. The intent is that it is that it be set to false by default here, rendering the "new image format by default" functionality inert, but that a later commit will enable it. This logic can be taken out later once we get rid of machine-os-content.
This just flips our "gating boolean" to true so that the MCO will use the new format image by default.
This excludes all non-base (right now, only extensions) images from image-references if the build is using fcos, because we currently don't ship fcos extensions images, and including them would break the payload. This also removes the baseOSExtensionsContainerImage from the osimagurl configmap so the placeholder value doesn't get passed through. (oc won't rewrite it if it's not in image-references).
7225bd4
to
9a7cea4
Compare
OK I've
/test e2e-agnostic-upgrade |
@cgwalters looks like tests are not running on this PR because it is in WIP (not sure, could be prow as well)? should we remove WIP? |
I believe leaving the WIP is fine, the "draft" github status is what is preventing it |
also commit message needs to be updated for 9a7cea4 |
9a7cea4
to
f3f7daf
Compare
/test e2e-agnostic-upgrade |
Oops sorry I hadn't noticed you had a test run going here, would have been nice to see the results! But in any case we really need to land a PR as an atomic unit that has everything, so might as well do it in the other PR. |
xref coreos/rpm-ostree@89f5802 Change the logic for doing OS updates to detect if rpm-ostree is too old to natively fetch a container image on its own. If so, we use `ex deploy-from-self` via podman. Introduce a new `/etc/machine-config-daemon-force-once` file that will cause us to skip validation, and hence we should re-reconcile and attempt to apply the OS upgrade again, this time natively via rpm-ostree. This is needed for scaleup of old nodes, as well as temporarily for 4.11 upgrades.
3973062
to
361808c
Compare
rpm-ostree deploy-from-self does get us to the right image, but the MCO doesn't know that because the OSImageURL/Container Name/Custom Deployment name doesn't get set when it rebases, it just uses the commit hash. This tells the MCO to: - inspect the current image specified by OSImageURL, - grab its commit hash from the labels - compare it to the base commit of what we have on-disk. If they match, we know we are on the correct image, even if we did not get there the usual way.
xref coreos/rpm-ostree@89f5802 Change the logic for firstboot detect if rpm-ostree is too old to natively fetch a container image on its own. If so, we use `ex deploy-from-self` via podman. This is needed for scaleup of old nodes because we won't be able to natively fetch the new-format base image (e.g. `rhel-coreos-8`) from the old rpm-ostree.
361808c
to
5f47ed3
Compare
This was merged in #3317 |
xref coreos/rpm-ostree@89f5802
On firstboot, the rpm-ostree (and skopeo) running in the system
may be too old to do the native flow. This hacky bit runs
the new OS image as a container, and reboots another time before
doing anything else.