-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow overriding OSImageURL with a layered image #3272
Allow overriding OSImageURL with a layered image #3272
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking good! Some questions/nits below
Also as a general concern, none of this would affect RHEL since the gating would prevent us going into any OS update code paths, right?
pkg/controller/common/helpers.go
Outdated
// For layering, we want to let the user override OSImageURL again | ||
for _, cfg := range configs { | ||
if cfg.Spec.OSImageURL != "" { | ||
osImageURL = cfg.Spec.OSImageURL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, is our current plan to override the base OSImageURL as opposed to a new field?
minor nit: there is a comment at the start of the function
// It uses only the OSImageURL provided by the CVO and ignores any MC provided OSImageURL.
we should probably remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I let Colin twist my arm too hard, but yes, the current plan was to let the user "take the wheel" and override OSImageURL. 😄
I updated the comment to match this PR's behavior.
@@ -148,6 +149,15 @@ func (r *RpmOstreeClient) GetBootedOSImageURL() (string, string, error) { | |||
} | |||
} | |||
|
|||
// we have container images now, make sure we can parse those too | |||
if bootedDeployment.ContainerImageReference != "" { | |||
// right now they start with "ostree-unverified-registry:", so scrape that off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this always be the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at least for the time being we know there will be a prefix, but I don't know that it will always be explicitly the "ostree-unverified-registry" prefix.
I can just scrape the prefix of "ostree-unverified-registry" and move on, but then if the prefix changes, we're broken. This here, however, will break if the prefix is ever completely absent, so that's not 100% future-proof either.
We talked about this when we put this in the layering branch -- we were trying to avoid having to regex this, but I can?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So...well, yes this is a bit messy.
It actually does open up the question of how the user specifies security-related settings for this image. (Not just for the ostree-level signatures, but also e.g. TLS)
For the moment, I have to say that we hardcode injecting ostree-unverified-registry:
as a prefix on the URL they provide.
We're planning to introduce a CRD, and that's probably the right place to have explicit settings for this.
That said, there is an argument that the interface to this containers-policy.json
- and we shouldn't try to treat the OS image container specially at all.
But I dunno...it is different because we're going to boot it.
Anyways yeah my instinct right now is to inject ostree-unverified-registry:
on the container image reference they provide.
Parsing the value by splitting on the first :
is certainly something we can rely on right now.
But yes, this is all still technically experimental and change-able. But we wouldn't change it without getting the MCO to work with the new way first!
pkg/daemon/update.go
Outdated
func (dn *CoreOSDaemon) applyOSChanges(mcDiff machineConfigDiff, oldConfig, newConfig *mcfgv1.MachineConfig) (retErr error) { | ||
// Extract image and add coreos-extensions repo if we have either OS update or package layering to perform | ||
|
||
if dn.nodeWriter != nil { | ||
dn.nodeWriter.Eventf(corev1.EventTypeNormal, "OSUpdateStarted", mcDiff.osChangesString()) | ||
} | ||
|
||
// TODO(jkyros): both paths (layering/non-layering) have different pre/update/post steps, but the events are the same. I hate to | ||
// duplicate the eventing code, but I feel like I should split this function into like "applyLegacyOSChanges" and "applyLayeredOSChanges" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense
pkg/daemon/update.go
Outdated
// TODO(jkyros): both paths (layering/non-layering) have different pre/update/post steps, but the events are the same. I hate to | ||
// duplicate the eventing code, but I feel like I should split this function into like "applyLegacyOSChanges" and "applyLayeredOSChanges" | ||
|
||
client := NewNodeUpdaterClient() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, shouldn't there already be a client that's been initialized somewhere? Would we be able to use that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there is one in the daemon. using it now.
pkg/daemon/update.go
Outdated
if osImageContentDir, err = ExtractOSImage(newConfig.Spec.OSImageURL); err != nil { | ||
|
||
if isLayeredImage, err = client.IsBootableImage(newConfig.Spec.OSImageURL); err != nil { | ||
// TODO(jkyros): Maybe this shouldn't be an error for now, so we don't impact any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory this should only return err if the image itself can't be pulled right? Which would be a failure anyways?
Probably should wrap the error with some context though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, I was just paranoid about adding additional error possibility for people who weren't using the feature. I added some context to the error.
I was able to boot OKD with machine-os-content layered using this PR, excellent job! |
1f88386
to
c404bcd
Compare
From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3272/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1554512424382500864/artifacts/e2e-gcp-op/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-daemon-gnpvb_machine-config-daemon.log the previous MCD appears to have not applied any kargs, which, hmm, should have been applied as far as the code is concerned? Doesn't look like the previous update failed either. Unfortunately no previous MCD logs |
c404bcd
to
6a37ec0
Compare
I totally broke it because I nested my layering check too high in |
No waste 😄 just took a 5 minute look at the logs |
Looked like aws/terraform issues the first time, trying again. |
pkg/controller/common/helpers.go
Outdated
if overriddenOSImageURL != "" && overriddenOSImageURL != osImageURL { | ||
|
||
glog.Infof("OSImageURL has been overridden via machineconfig with %s (was: %s)", overriddenOSImageURL, osImageURL) | ||
osImageURL = overriddenOSImageURL | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably where I'd safety check/feature gate it -- just prevent the merge to keep the old behavior.
If we aren't going to feature gate it, do we want a safety check in here that's:
- check to make sure it's a "new format" image
- only allow it to be overridden if we were able to check and it was a "new format" image?
So that no existing clusters would be impacted if they had, say, an old MachineConfig
with an old image in it that previously would have been ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, this does introduce the additional behaviour of having the controller pull the image as well (or at least a remote inspect), which isn't necessarily a bad thing, since if this errors, the daemon would likely have errored anyways, so we are surfacing this earlier in the upgrade pipeline.
To summarize our options then are:
- add a featuregate to enable osImageURL overrides in 4.12, and then remove this in future versions once this is more mature (does featuregates prevent upgrades at all?)
- add an additional image check in the controller with additional error
- no additional safety at all. the daemon will error if you have an override image that isn't in the correct format, which would only trip up users in one scenario (the user had a osImageurl override from a previous version incorrectly, but since we didn't error, didn't even notice)
I think... I would lean towards 2 personally, but don't have a strong opinion. Essentially 2 and 3 are the "same" in terms of what would fail, but 2 surfaces it earlier, and 1 would just be "make sure you absolutely know what you are doing" which I would also be ok with so long as we have a plan out of it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked on Monday and given:
- the complexity of adding the safety
- some of the container/registry library constraints (potentially having to parse/manage
config.json
entries ourselves in the controller, or writing secrets out to disk) and - the likelihood of someone actually having an old OSImageURL in their machineconfig
We opted not to put the safety in at this time, but we can later if we think we need to.
This only works with public images right now because I didn't do anything with pull secrets, so Minimally, letting ostree have at the kubelet
but there might be reasons we don't want to do that? |
Yep, this is what we should do by default. (But, I think it does probably long term make sense to expose a knob for this, it's one thing that would naturally be done via the "host as pod" model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! Looking generally good to me
pkg/daemon/rpm-ostree.go
Outdated
args := []string{"rebase", "--experimental", "ostree-unverified-registry:" + imgURL} | ||
|
||
return runRpmOstree(args...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this simpler as
args := []string{"rebase", "--experimental", "ostree-unverified-registry:" + imgURL} | |
return runRpmOstree(args...) | |
return runRpmOstree("rebase", "--experimental", "ostree-unverified-registry:" + imgURL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, simplified. Thank you.
@@ -303,3 +348,23 @@ func runGetOut(command string, args ...string) ([]byte, error) { | |||
} | |||
return rawOut, nil | |||
} | |||
|
|||
func (state *rpmOstreeState) getBootedDeployment() (*RpmOstreeDeployment, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Tangentially related to this I was kind of thinking about splitting out a github.com/coreos/rpmostree-client-go repository)
pkg/daemon/update.go
Outdated
// Switch to real time kernel | ||
if err := dn.switchKernel(oldConfig, newConfig); err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one can't work yet either...I'm thinking of creating rhel-coreos-8-rt
as a separate image, but some prep work needs to happen for that.
So I think for now we should just error out if we detect kernelType != "default"
pkg/daemon/update.go
Outdated
} | ||
|
||
// TODO(jkyros): This is where we will handle Joseph's extensions container | ||
glog.Infof("Can't apply extensions, not supported yet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say we should error out instead if any are set.
Previously our machine-os-content images were just "wrappers" around ostree commits. With our new "layering" work, we have a new image format that is actually a working OCI container, but this type of image needs to be applied differently by rpm-ostree. This extends the RpmOstreeClient to support new layered images by: - capturing more metadata from rpm-ostree status - adding an additional rebase function specifically for the layered case
We changed the interface for RpmOstreeClient, so we need to make sure the mocks are also up to date with the new signatures
cd1b9d3
to
3bfc9b9
Compare
pkg/controller/common/helpers.go
Outdated
} | ||
// Make sure it's obvious in the logs that it was overridden | ||
if overriddenOSImageURL != "" && overriddenOSImageURL != osImageURL { | ||
glog.Infof("OSImageURL has been overridden via machineconfig with %s (was: %s)", overriddenOSImageURL, osImageURL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should also send an event here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the logging out a level so the name of the config was available to be logged, and also added an event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor nits, otherwise this looks good to me. Nice work John.
Previously the MCO was arranged so that any user-supplied values for OSImageURL were ignored, and only the value from controllerconfig was used. This allows OSImageURL to be specified in a machineconfig, using the same precedence rules as everything else ( the last machineconfig with a populated OSImageURL "wins" the merge). If there is no user-specified OSImageURL (the field is blank in all machineconfigs) the MCO will fall back to using the controllerconfig OSImageURL like it did previousy. The purpose of this is for the user to be able to override the OS image using a custom/hotfix/layered image instead of being forced to use the one in the release payload.
Previously we were testing to make sure that OSImageURL overrides were prevented, but now we want to allow them again. This updates the test to make sure that OSImageURL gets merged properly as part of a machineconfig merge.
Previously there were a bunch of if conditions deciding what work needed to be done to the OS based on the mcDiff. We want to handle new format images, but adding an additional branch adds additional complexity to those ifs. This breaks the old and new paths out into separate functions for readability, ease of maintenance, and ease of retirement once we drop the old format image.
There is a place in the daemon where the image can get updated as part of the bootstrap flow. For completeness, we should make sure that it can use a "new format" image, if it were to ever get one. This makes the bootstrap flow check to see if the image it got is a new format image, so it can apply it properly.
For public images, everything works great, rpm-ostree can pull them with no problem, but when trying to use images that are in a private repository and require a pull secret, rpm-ostree needs to have that pull secret in order to use it. ostree by default looks in `/run/ostree/auth.json` for pull secrets, , and the pull secrets are already in `/var/lib/kubelet/config/json`, so this just links to them from `/run/ostree` so that ostree can access them.
3bfc9b9
to
6796633
Compare
Fixed nits, added event per Sinny's request. |
@jkyros: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, jkyros, sinnykumari The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This allows
OSImageURL
inMachineConfig
to be overridden with a layered image.This isn't entirely complete, I'm still working on cleaning it up, but:
In #3258 I was more worried about dealing with the image in the payload, but since that seems like it might be blocked for a little while, we'll start with this and then we can go back and connect those tubes later.
Credit where credit is due, @yuqi-zhang started doing this against the layering branch over here back in February: #2939
This is in service to: MCO-291