-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
baremetal: Add strategy for upgrading CoreOS-based deploy image #879
Conversation
/assign @hardys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Left some comments inline regarding the flow of the text, but the idea seems correct and matches what we have discussed.
will update itself to the version matching the cluster it is to join. It | ||
remains suboptimal because new Machines will take longer and longer (and more | ||
and more bandwidth) to boot as the cluster ages, and also because support for | ||
particular hardware may theoretically require a particular version of CoreOS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It's not a theoretical problem, and it's much worse for baremetal than for other platforms because real hardware evolves much quicker than virtual environments.
|
||
## Summary | ||
|
||
To ensure that ironic-python-agent runs on top of an up-to-date OS, we will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe a footnote explaining for ironic-python-agent in case somebody from the rest of openshift reads this proposal.
|
||
Once this is in place, we no longer need the QCOW2 image at all, since we can | ||
‘provision’ by asking CoreOS in the deploy image to install itself (using | ||
custom deploy steps in Ironic, exposed as a custom deploy method in Metal³). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The information in brackets is probably too detailed.
|
||
### Non-Goals | ||
|
||
* Automatically switch pre-existing MachineSets to deploy with `coreos-install` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: coreos-install
is our internal concept that is not defined above. Maybe "... away from using qcow2 images"?
The cluster-baremetal-operator will verify the image URLs as part of | ||
reconciling the Provisioning CR. | ||
|
||
If any of the `PreprovisioningOSDownloadURLs` are not set and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This implies new fields that are not introduced in this spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's an API change proposed here, it would be good to spell out the new field structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields are already present in 4.9.
or just require the user to manually set them? | ||
* Is it acceptable to require users of disconnected installs to take action on | ||
every upgrade? Is that better or worse than leaving them with an out-of-date | ||
OS to run IPA on (since it will *not* be updated until actually provisioned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From support perspective, it introduces a version of CoreOS that is not current and is not known. If it causes a bug in IPA (e.g. unsupported hardware), debugging it will be problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change the API to make it less necessary to guess the URLs? For example, could we ask for a URL base, to point to a directory in which files with names we control should be stored? So the user would give us https://my.private.server/openshift
and we would add something like /4.y.z/coreos.kernel
? That would avoid them having to update the URL on every upgrade, but would still let the operator report that it is Degraded if the necessary files do not exist on the server after the upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the installer's API and I think it's common to every platform, so I'm not sure that changing it would be welcomed.
In any event, that wouldn't help us with clusters that were installed in the past.
That would avoid them having to update the URL on every upgrade
Updating the URL ourselves is easy enough, it's the user actually putting the file where we need it that's the blocker. Being slightly more strict about renaming doesn't really move the needle much imho.
Ideally the CoreOS images would be provided automatically by the same tooling that mirrors the release payload (actually ideally the image would be in the release payload), but that's not where we are.
require only that working images have been specified at least once. | ||
|
||
Instead of upgrading, have CoreOS somehow try to update itself in place before | ||
running IPA. (This is likely to be slow, and it's not clear that it is even |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very slow and memory-greedy. Won't work for disconnected installs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me wonder how disconnected installs update CoreOS (which is supposed to update itself before joining the cluster) now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow RHCOS gets written to disk, and then the system boots it from disk for the first time. At that point it upgrades the same way RHCOS gets updated in on an ongoing basis; it retrieves a particular container image from the payload that contains the current RHCOS ostree, applies that ostree to the filesystem on disk, then reboots again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was not aware that the ostree is already in the payload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mechanism to update the image to match the current Ironic version, or fix | ||
bugs, including security bugs. | ||
* We have no way of knowing the URLs for the new deploy image, because they can | ||
only be supplied at install time by the installer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 85-94 sound like they are describing a rejected alternative, and should go down in that section of the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context here is that all of the stuff described above was ready to land in 4.9 (and in fact a lot of it did, we just didn't flip the switch in the installer or remove ironic-ipa-downloader yet), but for the fact that we don't have a way to safely upgrade existing clusters.
So from the perspective of this enhancement, the lack of an upgrade path to the work we've already done is the problem, and the proposal is the solution. We could rewrite it so that the existing work is incorporated into the proposal and the actual stuff we need input on is bumped to the 'Upgrade Strategy' section, but that doesn't seem conducive to getting the best feedback.
|
||
* Will marking the cluster-baremetal-operator as incomplete cause an upgrade to | ||
be rolled back? Is there a better form of alert that will prevent a future | ||
upgrade without rolling back the current one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing triggers an automated rollback, IIUC. The CVO just keeps trying the upgrade until it succeeds. @sdodson can you verify that?
or just require the user to manually set them? | ||
* Is it acceptable to require users of disconnected installs to take action on | ||
every upgrade? Is that better or worse than leaving them with an out-of-date | ||
OS to run IPA on (since it will *not* be updated until actually provisioned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change the API to make it less necessary to guess the URLs? For example, could we ask for a URL base, to point to a directory in which files with names we control should be stored? So the user would give us https://my.private.server/openshift
and we would add something like /4.y.z/coreos.kernel
? That would avoid them having to update the URL on every upgrade, but would still let the operator report that it is Degraded if the necessary files do not exist on the server after the upgrade.
release payload. | ||
|
||
Furthermore, we will need to test each of these scenarios both with the default | ||
image URLs and with custom URLs for disconnected installs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ought to be able to automate those as unit tests for the operator, right? We don't need end-to-end tests for all of those scenarios?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously we do need unit tests, but I think it would also be nice to know that all the moving parts actually work together the way we expect.
New changes are detected. LGTM label has been removed. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Limit ourselves to finding an image URL when there is none; do not attempt to keep it up to date in perpetuity.
I reduced the scope of this so that rather than trying to always keep the image up to date, we just try to ensure that the image URLs are set once (so that we can upgrade to this new method and not have to maintain the old one as well). |
After discussion with the CoreOS team and other stakeholders, I'm withdrawing this in favour of #909. |
@zaneb: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
To ensure that ironic-python-agent runs on top of an up-to-date OS, we will
update the CoreOS image URLs in the baremetal Provisioning CR to the latest
specified by the release metadata. For users running disconnected installs, we
will require them to make the latest versions available and block further
upgrades until they do so.