Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

baremetal: Add strategy for upgrading CoreOS-based deploy image #879

Closed
wants to merge 4 commits into from

Conversation

zaneb
Copy link
Member

@zaneb zaneb commented Aug 26, 2021

To ensure that ironic-python-agent runs on top of an up-to-date OS, we will
update the CoreOS image URLs in the baremetal Provisioning CR to the latest
specified by the release metadata. For users running disconnected installs, we
will require them to make the latest versions available and block further
upgrades until they do so.

@zaneb
Copy link
Member Author

zaneb commented Aug 26, 2021

Copy link
Member

@dtantsur dtantsur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Left some comments inline regarding the flow of the text, but the idea seems correct and matches what we have discussed.

will update itself to the version matching the cluster it is to join. It
remains suboptimal because new Machines will take longer and longer (and more
and more bandwidth) to boot as the cluster ages, and also because support for
particular hardware may theoretically require a particular version of CoreOS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's not a theoretical problem, and it's much worse for baremetal than for other platforms because real hardware evolves much quicker than virtual environments.


## Summary

To ensure that ironic-python-agent runs on top of an up-to-date OS, we will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe a footnote explaining for ironic-python-agent in case somebody from the rest of openshift reads this proposal.


Once this is in place, we no longer need the QCOW2 image at all, since we can
‘provision’ by asking CoreOS in the deploy image to install itself (using
custom deploy steps in Ironic, exposed as a custom deploy method in Metal³).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The information in brackets is probably too detailed.


### Non-Goals

* Automatically switch pre-existing MachineSets to deploy with `coreos-install`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: coreos-install is our internal concept that is not defined above. Maybe "... away from using qcow2 images"?

The cluster-baremetal-operator will verify the image URLs as part of
reconciling the Provisioning CR.

If any of the `PreprovisioningOSDownloadURLs` are not set and the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This implies new fields that are not introduced in this spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's an API change proposed here, it would be good to spell out the new field structure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields are already present in 4.9.

or just require the user to manually set them?
* Is it acceptable to require users of disconnected installs to take action on
every upgrade? Is that better or worse than leaving them with an out-of-date
OS to run IPA on (since it will *not* be updated until actually provisioned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From support perspective, it introduces a version of CoreOS that is not current and is not known. If it causes a bug in IPA (e.g. unsupported hardware), debugging it will be problematic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change the API to make it less necessary to guess the URLs? For example, could we ask for a URL base, to point to a directory in which files with names we control should be stored? So the user would give us https://my.private.server/openshift and we would add something like /4.y.z/coreos.kernel? That would avoid them having to update the URL on every upgrade, but would still let the operator report that it is Degraded if the necessary files do not exist on the server after the upgrade.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the installer's API and I think it's common to every platform, so I'm not sure that changing it would be welcomed.
In any event, that wouldn't help us with clusters that were installed in the past.

That would avoid them having to update the URL on every upgrade

Updating the URL ourselves is easy enough, it's the user actually putting the file where we need it that's the blocker. Being slightly more strict about renaming doesn't really move the needle much imho.

Ideally the CoreOS images would be provided automatically by the same tooling that mirrors the release payload (actually ideally the image would be in the release payload), but that's not where we are.

require only that working images have been specified at least once.

Instead of upgrading, have CoreOS somehow try to update itself in place before
running IPA. (This is likely to be slow, and it's not clear that it is even
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very slow and memory-greedy. Won't work for disconnected installs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder how disconnected installs update CoreOS (which is supposed to update itself before joining the cluster) now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow RHCOS gets written to disk, and then the system boots it from disk for the first time. At that point it upgrades the same way RHCOS gets updated in on an ongoing basis; it retrieves a particular container image from the payload that contains the current RHCOS ostree, applies that ostree to the filesystem on disk, then reboots again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was not aware that the ostree is already in the payload.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 27, 2021
enhancements/baremetal/upgrade-coreos-deploy-image.md Outdated Show resolved Hide resolved
enhancements/baremetal/upgrade-coreos-deploy-image.md Outdated Show resolved Hide resolved
mechanism to update the image to match the current Ironic version, or fix
bugs, including security bugs.
* We have no way of knowing the URLs for the new deploy image, because they can
only be supplied at install time by the installer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 85-94 sound like they are describing a rejected alternative, and should go down in that section of the doc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context here is that all of the stuff described above was ready to land in 4.9 (and in fact a lot of it did, we just didn't flip the switch in the installer or remove ironic-ipa-downloader yet), but for the fact that we don't have a way to safely upgrade existing clusters.

So from the perspective of this enhancement, the lack of an upgrade path to the work we've already done is the problem, and the proposal is the solution. We could rewrite it so that the existing work is incorporated into the proposal and the actual stuff we need input on is bumped to the 'Upgrade Strategy' section, but that doesn't seem conducive to getting the best feedback.

enhancements/baremetal/upgrade-coreos-deploy-image.md Outdated Show resolved Hide resolved
enhancements/baremetal/upgrade-coreos-deploy-image.md Outdated Show resolved Hide resolved

* Will marking the cluster-baremetal-operator as incomplete cause an upgrade to
be rolled back? Is there a better form of alert that will prevent a future
upgrade without rolling back the current one?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing triggers an automated rollback, IIUC. The CVO just keeps trying the upgrade until it succeeds. @sdodson can you verify that?

or just require the user to manually set them?
* Is it acceptable to require users of disconnected installs to take action on
every upgrade? Is that better or worse than leaving them with an out-of-date
OS to run IPA on (since it will *not* be updated until actually provisioned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change the API to make it less necessary to guess the URLs? For example, could we ask for a URL base, to point to a directory in which files with names we control should be stored? So the user would give us https://my.private.server/openshift and we would add something like /4.y.z/coreos.kernel? That would avoid them having to update the URL on every upgrade, but would still let the operator report that it is Degraded if the necessary files do not exist on the server after the upgrade.

release payload.

Furthermore, we will need to test each of these scenarios both with the default
image URLs and with custom URLs for disconnected installs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ought to be able to automate those as unit tests for the operator, right? We don't need end-to-end tests for all of those scenarios?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously we do need unit tests, but I think it would also be nice to know that all the moving parts actually work together the way we expect.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 30, 2021

New changes are detected. LGTM label has been removed.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 30, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 30, 2021

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from dtantsur after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Limit ourselves to finding an image URL when there is none; do not
attempt to keep it up to date in perpetuity.
@zaneb
Copy link
Member Author

zaneb commented Sep 9, 2021

I reduced the scope of this so that rather than trying to always keep the image up to date, we just try to ensure that the image URLs are set once (so that we can upgrade to this new method and not have to maintain the old one as well).

@zaneb
Copy link
Member Author

zaneb commented Sep 23, 2021

After discussion with the CoreOS team and other stakeholders, I'm withdrawing this in favour of #909.
/close

@openshift-ci openshift-ci bot closed this Sep 23, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 23, 2021

@zaneb: Closed this PR.

In response to this:

After discussion with the CoreOS team and other stakeholders, I'm withdrawing this in favour of #909.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants