Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCO-356: daemon: Do a secondary in-place update if rpm-ostree is too old #3310

Closed

Conversation

cgwalters
Copy link
Member

@cgwalters cgwalters commented Aug 24, 2022

xref coreos/rpm-ostree@89f5802

On firstboot, the rpm-ostree (and skopeo) running in the system
may be too old to do the native flow. This hacky bit runs
the new OS image as a container, and reboots another time before
doing anything else.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 24, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 24, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 24, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 24, 2022
// If the host isn't new enough to understand the new container model natively, run as a privileged container and reboot.
// See https://github.com/coreos/rpm-ostree/pull/3961 and https://issues.redhat.com/browse/MCO-356
if !newEnough {
err := runCmdSync("podman", "run", "--privileged", "--pid=host", "--net=host", "--rm", "-v", "/:/run/host", mc.Spec.OSImageURL, "rpm-ostree", "ex", "deploy-from-self", "/run/host", "--reboot")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, this is basically updating the on-host rpm-ostree from the updated rpm-ostree in the updated (4.12) oscontainer. Why is the reboot needed vs, say, restarting the rpm-ostree daemon?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The apply-live path was only stabilized in 2022.13...earlier versions do have ex apply-live, but really old versions don't.

We could indeed optimize this in the future to skip a reboot for sufficiently new versions; but I think ultimately what we want to do is to enhance this code to do all the setup from the container too, to avoid the next update doing a reboot.

IOW our firstboot path then always runs rpm-ostree from a container, and not the host.

On the other hand, we have some code for updating bootimages, and once that lands this code is going to be hit much less often.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 8, 2022
We decided baseOSContainerImage was clearer than
baseOperatingSystemContainer (and the same for the extensions container)
so this renames all of this throughout to match.

I'm doing this first before all of the rest of the "new image by
default" code so if we have to revert one of those commits, we don't
have to worry about the names.

This just renames all instances of the BaseOperatingSystem variables
througout to BaseOS and favors the more clear "ContainerImage" suffix
rather than the less clear "Container", as "Container" typically
signifies something that's running.
Now that we have the extensions container, we can apply extensions and
switch kernels again.

This just re-enables that functionality for the 'new image' path during
daemon updates, and removes the "not supported" messages.
Both rpm-ostree rebase and update require access to the pull secrets in
order to pull manifests and images.

This changes the behavior such that we symlink the kubelet config.json
secrets to /run/ostree/auth.json when we initialize our
"NodeUpdaterClient" (rather than only in rebase) so any of our
rpm-ostree calls can use it.
Until we figure out if/how we want to use the extensions container as a
service, we're going to have to extract it to disk to use it.

In order for that to happen, the daemon needs to know where to get it,
so it needs to be present in machineconfig.

This adds the extensions container to the machineconfig type, the crd,
and the resourcemerge library.
This passes the extensions container from controller config through to
machineconfig.
Since we can't use the extensions container as a service right now due
to boostrap/dns/complexity, for now, we're going to extract them to
disk like we did the extensions in machine-os-contanet.

This pulls and extracts the extensions container to /run and adds it as
a yum repo if we are on a new format image and there are extensions
present in MachineConfig.
Currently if someone overrides OSImageURL on their master pool, the
required pools sync in the operator will fail because it checks to make
sure the OSImageURL matches, and when it's overridden it doesn't. (This
results in degradation and timeouts).

The check was originally put in to fix a bug around proper update
completion reporting.

This allows the check to be skipped if the user has "taken the wheel" by
overriding OSImageURL.
This adds an "image selection function" to the controller helpers and a
boolean variable in helpers.go that determines which image it will use
by default.

The intent is that it is that it be set to false by default here,
rendering the "new image format by default" functionality inert, but
that a later commit will enable it.

This logic can be taken out later once we get rid of machine-os-content.
This just flips our "gating boolean" to true so that the MCO will use
the new format image by default.
This excludes all non-base (right now, only extensions) images from
image-references if the build is using fcos, because we currently don't
ship fcos extensions images, and including them would break the
payload.

This also removes the baseOSExtensionsContainerImage from the osimagurl
configmap so the placeholder value doesn't get passed through. (oc won't
rewrite it if it's not in image-references).
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 14, 2022
@cgwalters
Copy link
Member Author

@sinnykumari
Copy link
Contributor

@cgwalters looks like tests are not running on this PR because it is in WIP (not sure, could be prow as well)? should we remove WIP?

@yuqi-zhang
Copy link
Contributor

I believe leaving the WIP is fine, the "draft" github status is what is preventing it

@sinnykumari
Copy link
Contributor

also commit message needs to be updated for 9a7cea4

@cgwalters cgwalters changed the title WIP: MCO-356: Add an extra reboot for old bootimages MCO-356: daemon: Do a secondary in-place update if rpm-ostree is too old Sep 14, 2022
@sinnykumari
Copy link
Contributor

/test e2e-agnostic-upgrade

@cgwalters cgwalters closed this Sep 15, 2022
@cgwalters
Copy link
Member Author

/test e2e-agnostic-upgrade

Oops sorry I hadn't noticed you had a test run going here, would have been nice to see the results! But in any case we really need to land a PR as an atomic unit that has everything, so might as well do it in the other PR.

xref coreos/rpm-ostree@89f5802

Change the logic for doing OS updates to detect if rpm-ostree is too
old to natively fetch a container image on its own.  If so,
we use `ex deploy-from-self` via podman.

Introduce a new `/etc/machine-config-daemon-force-once` file that
will cause us to skip validation, and hence we should re-reconcile
and attempt to apply the OS upgrade again, this time natively
via rpm-ostree.

This is needed for scaleup of old nodes, as well as temporarily
for 4.11 upgrades.
@cgwalters cgwalters reopened this Sep 19, 2022
@cgwalters cgwalters force-pushed the double-rebase-for-old-versions branch 3 times, most recently from 3973062 to 361808c Compare September 20, 2022 00:25
jkyros and others added 2 commits September 20, 2022 00:09
rpm-ostree deploy-from-self does get us to the right image, but the MCO
doesn't know that because the OSImageURL/Container Name/Custom
Deployment name doesn't get set when it rebases, it just uses the
commit hash.

This tells the MCO to:
- inspect the current image specified by OSImageURL,
- grab its commit hash from the labels
- compare it to the base commit of what we have on-disk.
If they match, we know we are on the correct image, even if we
did not get there the usual way.
xref coreos/rpm-ostree@89f5802

Change the logic for firstboot detect if rpm-ostree is too
old to natively fetch a container image on its own.  If so,
we use `ex deploy-from-self` via podman.

This is needed for scaleup of old nodes because we won't be able
to natively fetch the new-format base image (e.g. `rhel-coreos-8`)
from the old rpm-ostree.
@cgwalters
Copy link
Member Author

This was merged in #3317

@cgwalters cgwalters closed this Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants