Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow overriding OSImageURL with a layered image #3272

Merged

Conversation

jkyros
Copy link
Contributor

@jkyros jkyros commented Jul 28, 2022

This allows OSImageURLin MachineConfig to be overridden with a layered image.

This isn't entirely complete, I'm still working on cleaning it up, but:

  • the functionality does work
  • I'd like to see what e2e tests this breaks 😄

In #3258 I was more worried about dealing with the image in the payload, but since that seems like it might be blocked for a little while, we'll start with this and then we can go back and connect those tubes later.

Credit where credit is due, @yuqi-zhang started doing this against the layering branch over here back in February: #2939

This is in service to: MCO-291

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2022
Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good! Some questions/nits below

Also as a general concern, none of this would affect RHEL since the gating would prevent us going into any OS update code paths, right?

// For layering, we want to let the user override OSImageURL again
for _, cfg := range configs {
if cfg.Spec.OSImageURL != "" {
osImageURL = cfg.Spec.OSImageURL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is our current plan to override the base OSImageURL as opposed to a new field?

minor nit: there is a comment at the start of the function
// It uses only the OSImageURL provided by the CVO and ignores any MC provided OSImageURL.
we should probably remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I let Colin twist my arm too hard, but yes, the current plan was to let the user "take the wheel" and override OSImageURL. 😄

I updated the comment to match this PR's behavior.

@@ -148,6 +149,15 @@ func (r *RpmOstreeClient) GetBootedOSImageURL() (string, string, error) {
}
}

// we have container images now, make sure we can parse those too
if bootedDeployment.ContainerImageReference != "" {
// right now they start with "ostree-unverified-registry:", so scrape that off
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this always be the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at least for the time being we know there will be a prefix, but I don't know that it will always be explicitly the "ostree-unverified-registry" prefix.

I can just scrape the prefix of "ostree-unverified-registry" and move on, but then if the prefix changes, we're broken. This here, however, will break if the prefix is ever completely absent, so that's not 100% future-proof either.

We talked about this when we put this in the layering branch -- we were trying to avoid having to regex this, but I can?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So...well, yes this is a bit messy.

It actually does open up the question of how the user specifies security-related settings for this image. (Not just for the ostree-level signatures, but also e.g. TLS)

For the moment, I have to say that we hardcode injecting ostree-unverified-registry: as a prefix on the URL they provide.

We're planning to introduce a CRD, and that's probably the right place to have explicit settings for this.

That said, there is an argument that the interface to this containers-policy.json - and we shouldn't try to treat the OS image container specially at all.

But I dunno...it is different because we're going to boot it.

Anyways yeah my instinct right now is to inject ostree-unverified-registry: on the container image reference they provide.

Parsing the value by splitting on the first : is certainly something we can rely on right now.

But yes, this is all still technically experimental and change-able. But we wouldn't change it without getting the MCO to work with the new way first!

func (dn *CoreOSDaemon) applyOSChanges(mcDiff machineConfigDiff, oldConfig, newConfig *mcfgv1.MachineConfig) (retErr error) {
// Extract image and add coreos-extensions repo if we have either OS update or package layering to perform

if dn.nodeWriter != nil {
dn.nodeWriter.Eventf(corev1.EventTypeNormal, "OSUpdateStarted", mcDiff.osChangesString())
}

// TODO(jkyros): both paths (layering/non-layering) have different pre/update/post steps, but the events are the same. I hate to
// duplicate the eventing code, but I feel like I should split this function into like "applyLegacyOSChanges" and "applyLayeredOSChanges"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense

// TODO(jkyros): both paths (layering/non-layering) have different pre/update/post steps, but the events are the same. I hate to
// duplicate the eventing code, but I feel like I should split this function into like "applyLegacyOSChanges" and "applyLayeredOSChanges"

client := NewNodeUpdaterClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, shouldn't there already be a client that's been initialized somewhere? Would we be able to use that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is one in the daemon. using it now.

if osImageContentDir, err = ExtractOSImage(newConfig.Spec.OSImageURL); err != nil {

if isLayeredImage, err = client.IsBootableImage(newConfig.Spec.OSImageURL); err != nil {
// TODO(jkyros): Maybe this shouldn't be an error for now, so we don't impact any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory this should only return err if the image itself can't be pulled right? Which would be a failure anyways?

Probably should wrap the error with some context though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, I was just paranoid about adding additional error possibility for people who weren't using the feature. I added some context to the error.

@vrutkovs
Copy link
Member

I was able to boot OKD with machine-os-content layered using this PR, excellent job!

@yuqi-zhang
Copy link
Contributor

From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3272/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1554512424382500864/artifacts/e2e-gcp-op/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-daemon-gnpvb_machine-config-daemon.log the previous MCD appears to have not applied any kargs, which, hmm, should have been applied as far as the code is concerned? Doesn't look like the previous update failed either. Unfortunately no previous MCD logs

@jkyros
Copy link
Contributor Author

jkyros commented Aug 2, 2022

I totally broke it because I nested my layering check too high in applyOSChanges() and it was skipping kargs when nothing else had changed. Sorry for the trouble, wasn't trying to waste your time 😄

@yuqi-zhang
Copy link
Contributor

No waste 😄 just took a 5 minute look at the logs

@jkyros
Copy link
Contributor Author

jkyros commented Aug 3, 2022

Looked like aws/terraform issues the first time, trying again.
/test e2e-agnostic-upgrade

@jkyros jkyros changed the title [ WIP ] Allow overriding OSImageURL with a layered image Allow overriding OSImageURL with a layered image Aug 3, 2022
Comment on lines 136 to 139
if overriddenOSImageURL != "" && overriddenOSImageURL != osImageURL {

glog.Infof("OSImageURL has been overridden via machineconfig with %s (was: %s)", overriddenOSImageURL, osImageURL)
osImageURL = overriddenOSImageURL
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably where I'd safety check/feature gate it -- just prevent the merge to keep the old behavior.

If we aren't going to feature gate it, do we want a safety check in here that's:

  1. check to make sure it's a "new format" image
  2. only allow it to be overridden if we were able to check and it was a "new format" image?

So that no existing clusters would be impacted if they had, say, an old MachineConfig with an old image in it that previously would have been ignored?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, this does introduce the additional behaviour of having the controller pull the image as well (or at least a remote inspect), which isn't necessarily a bad thing, since if this errors, the daemon would likely have errored anyways, so we are surfacing this earlier in the upgrade pipeline.

To summarize our options then are:

  1. add a featuregate to enable osImageURL overrides in 4.12, and then remove this in future versions once this is more mature (does featuregates prevent upgrades at all?)
  2. add an additional image check in the controller with additional error
  3. no additional safety at all. the daemon will error if you have an override image that isn't in the correct format, which would only trip up users in one scenario (the user had a osImageurl override from a previous version incorrectly, but since we didn't error, didn't even notice)

I think... I would lean towards 2 personally, but don't have a strong opinion. Essentially 2 and 3 are the "same" in terms of what would fail, but 2 surfaces it earlier, and 1 would just be "make sure you absolutely know what you are doing" which I would also be ok with so long as we have a plan out of it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked on Monday and given:

  • the complexity of adding the safety
  • some of the container/registry library constraints (potentially having to parse/manage config.json entries ourselves in the controller, or writing secrets out to disk) and
  • the likelihood of someone actually having an old OSImageURL in their machineconfig

We opted not to put the safety in at this time, but we can later if we think we need to.

@jkyros
Copy link
Contributor Author

jkyros commented Aug 4, 2022

This only works with public images right now because I didn't do anything with pull secrets, so registry.ci.openshift.org/rhcos-devel/rhel-coreos:latest can't be used directly.

Minimally, letting ostree have at the kubelet config.json makes it work:

mkdir /run/ostree
ln -s /var/lib/kubelet/config.json /run/ostree/auth.json 

but there might be reasons we don't want to do that?

@cgwalters
Copy link
Member

ln -s /var/lib/kubelet/config.json /run/ostree/auth.json

Yep, this is what we should do by default.

(But, I think it does probably long term make sense to expose a knob for this, it's one thing that would naturally be done via the "host as pod" model)

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! Looking generally good to me

Comment on lines 318 to 320
args := []string{"rebase", "--experimental", "ostree-unverified-registry:" + imgURL}

return runRpmOstree(args...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this simpler as

Suggested change
args := []string{"rebase", "--experimental", "ostree-unverified-registry:" + imgURL}
return runRpmOstree(args...)
return runRpmOstree("rebase", "--experimental", "ostree-unverified-registry:" + imgURL)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, simplified. Thank you.

@@ -303,3 +348,23 @@ func runGetOut(command string, args ...string) ([]byte, error) {
}
return rawOut, nil
}

func (state *rpmOstreeState) getBootedDeployment() (*RpmOstreeDeployment, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Tangentially related to this I was kind of thinking about splitting out a github.com/coreos/rpmostree-client-go repository)

// Switch to real time kernel
if err := dn.switchKernel(oldConfig, newConfig); err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one can't work yet either...I'm thinking of creating rhel-coreos-8-rt as a separate image, but some prep work needs to happen for that.

So I think for now we should just error out if we detect kernelType != "default"

}

// TODO(jkyros): This is where we will handle Joseph's extensions container
glog.Infof("Can't apply extensions, not supported yet")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say we should error out instead if any are set.

Previously our machine-os-content images were just "wrappers" around
ostree commits. With our new "layering" work, we have a new image format
that is actually a working OCI container, but this type of image needs
to be applied differently by rpm-ostree.

This extends the RpmOstreeClient to support new layered images by:
- capturing more metadata from rpm-ostree status
- adding an additional rebase function specifically for the layered case
We changed the interface for RpmOstreeClient, so we need to make sure
the mocks are also up to date with the new signatures
@jkyros jkyros force-pushed the new-image-format-minimal branch 4 times, most recently from cd1b9d3 to 3bfc9b9 Compare August 10, 2022 08:24
}
// Make sure it's obvious in the logs that it was overridden
if overriddenOSImageURL != "" && overriddenOSImageURL != osImageURL {
glog.Infof("OSImageURL has been overridden via machineconfig with %s (was: %s)", overriddenOSImageURL, osImageURL)
Copy link
Contributor

@sinnykumari sinnykumari Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should also send an event here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the logging out a level so the name of the config was available to be logged, and also added an event.

pkg/daemon/update.go Outdated Show resolved Hide resolved
pkg/daemon/update.go Outdated Show resolved Hide resolved
Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor nits, otherwise this looks good to me. Nice work John.

Previously the MCO was arranged so that any user-supplied values for
OSImageURL were ignored, and only the value from controllerconfig was
used.

This allows OSImageURL to be specified in a machineconfig, using the
same precedence rules as everything else ( the last machineconfig with a
populated OSImageURL "wins" the merge). If there is no user-specified
OSImageURL (the field is blank in all machineconfigs) the MCO will fall
back to using the controllerconfig OSImageURL like it did previousy.

The purpose of this is for the user to be able to override the OS image
using a custom/hotfix/layered image instead of being forced to use the
one in the release payload.
Previously we were testing to make sure that OSImageURL overrides were
prevented, but now we want to allow them again.

This updates the test to make sure that OSImageURL gets merged properly
as part of a machineconfig merge.
Previously there were a bunch of if conditions deciding what work needed
to be done to the OS based on the mcDiff. We want to handle new format
images, but adding an additional branch adds additional complexity to
those ifs.

This breaks the old and new paths out into separate functions for
readability, ease of maintenance, and ease of retirement once we drop
the old format image.
There is a place in the daemon where the image can get updated as part
of the bootstrap flow. For completeness, we should make sure that it can
use a "new format" image, if it were to ever get one.

This makes the bootstrap flow check to see if the image it got is a new
format image, so it can apply it properly.
For public images, everything works great, rpm-ostree can pull them with
no problem, but when trying to use images that are in a private
repository and require a pull secret, rpm-ostree needs to have that pull
secret in order to use it.

ostree by default looks in `/run/ostree/auth.json` for pull secrets,
, and the pull secrets are already in `/var/lib/kubelet/config/json`,
 so this just links to them from `/run/ostree` so that ostree can access
them.
@jkyros
Copy link
Contributor Author

jkyros commented Aug 15, 2022

Fixed nits, added event per Sinny's request.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 15, 2022

@jkyros: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 15, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 15, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, jkyros, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,jkyros,sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 0d563c3 into openshift:master Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants