-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/rhcos/release: Extract RHCOS build from release image #1286
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
An additional benefit of this approach is that invalid pull-secrets will result in installer-host errors like:
For previous discussion of that issue, see #901. And this reminds me that I need to wire in the user-supplied pull-secret for these fetches ;). |
I only skimmed this but it looks good so far! One thing I'd add to this...for release payloads let's be prepared to bake in the AMI metadata - don't fetch it over HTTP. It'd probably be just changing code around here to parse a constant JSON blob embedded in the code or so. |
/hold This change is to big to be included after freeze. |
Fair enough, but let's at least switch to hardcoding the AMI data for releases to drop the dependency on the "RHCOS build metadata" server? |
We'd still rely on that server for libvirt QCOW2 and whatever OpenStack uses, right? Can we put that JSON blob in the machine-os-content? |
|
return nil, nil, err | ||
} | ||
defer func() { | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is confusing.. which err are we trying to check...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is confusing.. which err are we trying to check...?
If FromUnparsedImage
(or the coming signature validation) fails, I want to be sure we close imageSource
. So this closure looks at the err
that will be returned by pull
, and, if it's not nil
, closes imageSource
for us. That way pull
callers can just return the error without having to worry about checking imageSource
for nil
-ness and possibly closing it themselves; they can just return pull's err
.
… image A temporary workaround while [1] is in flight. [1]: openshift#1286
… image A temporary workaround while [1] is in flight. [1]: openshift#1286
… image A temporary workaround while [1] is in flight. [1]: openshift#1286
… image A temporary workaround while [1] is in flight. [1]: openshift#1286
… image A temporary workaround while [1] is in flight. [1]: openshift#1286
e189de1
to
c1d35cc
Compare
3a99f85
to
327976e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just a few comments on the use of c/image API.)
defer blob.Close() | ||
|
||
// fmt.Println(layer.MediaType) // TODO: blank from quay? Just assume this is a gzipped tar? | ||
gzipReader, err := gzip.NewReader(blob) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider c/image/pkg/compression.DetectCompression
. Yes, there is a MIME type in the manifest, but historically it’s not been used to differentiate between compressed/uncompressed in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is a MIME type in the manifest...
Fix the broken servers! ;) In practice, these pulls are going to be from release images that we cut with oc adm release new ...
and the layer is going to be gzipped, but sure, I can reroll to use peek-inside detection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, it’s not just the servers; the schema2 spec does not officially specify an “uncompressed tar” MIME type, that’s new in OCI (although there is a de-facto schema2 definition in containerd).
Actually, …/compression.AutoDecompress
would probably be even more convenient.
I agree that in practice pretty much every image is compressed, and this may not be worth worrying about. Only noting that this exists because I saw that TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW there is work ongoing to add new compression types like Zstd, so hard-coding gzip
here will eventually break.
Since e2b31b2 (bootkube: Supply machine-os-content to MCO, 2019-01-29, openshift#1149), we have been using the machine-os-content image to seed the machine-config operator. With this commit, use the RHCOS build ID from that image's annotations to calculate our AMI, etc. as well. This gives one less degree of freedom for breaking things ;). Users who want to test clusters based on a different RHCOS build should bump the value in their release image, just like users testing operator updates and other changes. The new pkg/asset/release subpackage allows users to continue using pkg/rhcos without pulling in all of containers/image as a dependency. The pull-secret handling is a bit of a hack, leaning on the fact that users are likely providing clean secrets from [1]. Hopefully soon containers/image will grow an API for injecting in-memory bytes into their more-robust Docker-auth-config parser, but my attempt at that [2] is unlikely to land in the next few days, so I've cludged together a minimal implementation here. [1]: https://cloud.openshift.com/clusters/install#pull-secret [2]: containers/image#588
And zounds of other dependencies. Use Gopkg.toml's 'ignored' [1] to exclude, BoltDB, which comes in via: github.com/containers/image/pkg/blobinfocache github.com/boltdb/bolt and github.com/docker/distribution/registry/storage/cache, which comes in via: github.com/containers/image/docker github.com/docker/distribution/registry/client github.com/docker/distribution/registry/storage/cache because we don't use either the storage backend or the BoltDB blob-info cache, but dep isn't checking at that level of granularity. Ideally the upstream repositories would restructure to split these out into separate packages (and/or dep will grow support for pruning by build tag [2]), but until then, we can manually prune via 'ignored'. Also ignore mtrmac/gpgme, because we only need signature verification and we don't want to use CGO (because we want to be portable to other operating systems, and we only need verification support, not signing support [3]). There may be more fat we can prune as well, but I got tired of looking and gave up, even though ~50k lines of new code is pretty embarassing for what is effectively just a handful of HTTPS requests and an OpenPGP check. Generated (after manually editing Gopkg.toml) with: $ dep ensure using: $ dep version dep: version : v0.5.0-31-g73b3afe build date : 2019-02-08 git hash : 73b3afe go version : go1.10.3 go compiler : gc platform : linux/amd64 features : ImportDuringSolve=false [1]: https://golang.github.io/dep/docs/Gopkg.toml.html#ignored [2]: golang/dep#291 [3]: containers/image#268
c6be68c
to
c41e524
Compare
I've pushed c6be68c -> c41e524, rebasing onto master and dropping the runc ignore, because we apparently do need runc's homedir library. |
Clayton pushed release:4.0.0-0.nightly-2019-03-04-234414 to quay.io/openshift-release-dev/ocp-release:4.0.0-0.7. Extracting the associated RHCOS build: $ oc adm release info --pullspecs quay.io/openshift-release-dev/ocp-release:4.0.0-0.7 | grep machine-os-content machine-os-content quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 $ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 | grep version version=47.330 I'd like to be pulling that version out of the release image at install-time [1], but that PR isn't green yet, so hard-coding here is a quick fix. [1]: openshift#1286
This is the most recent RHCOS: $ curl -s https://releases-rhcos.svc.ci.openshift.org/storage/releases/maipo/builds.json | jq '{"latest": .builds[0], timestamp}' { "latest": "400.7.20190306.0", "timestamp": "2019-03-06T22:24:53Z" } I'd prefer to pull this from the release image: $ oc adm release info --pullspecs quay.io/openshift-release-dev/ocp-release:4.0.0-0.7 | grep machine-os-content machine-os-content quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 $ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 | grep version version=47.330 But [1] isn't green yet, so hard-coding here is a quick fix. Supported regions: $ curl -s https://releases-rhcos.svc.ci.openshift.org/storage/releases/maipo/400.7.20190306.0/meta.json | jq -r '.amis[] | .name' ap-northeast-1 ap-northeast-2 ap-south-1 ap-southeast-1 ap-southeast-2 ca-central-1 eu-central-1 eu-west-1 eu-west-2 eu-west-3 sa-east-1 us-east-1 us-east-2 us-west-1 us-west-2 [1]: openshift#1286
@wking: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I think this PR is a good idea - it seems like we made an architectural decision not to do it right now though. One thing I would say though is it seems quite likely at some point we'll make a stronger push for "cluster managed bootimages" - https://url.corp.redhat.com/6f27cd4 "Lifecycle of RHCOS image source". If we start to implement that more ambitious path, it may involve something like having the bootstrap node set up the bootimage (i.e. the bootimage for the bootstrap would be different from master/worker) potentially. That may lead us down a different architectural path. |
One potential blocker here is that going forward it's not clear to me that we'll ship new bootimages for every We also know that we cannot support updating the bootimage in every environment right now; we need to deal with pivoting. The argument is it's better to always pivot so it's tested. These two combined say that even if we extract the RHCOS metadata from the release image, we should support a separate "bootimage" annotation or so. |
That's fine, just don't bump the embedded bootimage labels.
I think prioritizing RHCOS-build-speed (affects devs) over cluster-install-speed (affects users) is slanting things the wrong way ;).
We definitely want this to work. But surely we can get sufficient soak time on pivots in CI without pushing them onto users too?
Yeah, and since AMI IDs, etc., are only used for boot images and |
That's currently true, but we are thinking about changing it - it would
speed up the builds rather dramatically, particularly since we keep adding
new bootimage types.
I think prioritizing RHCOS-build-speed (affects devs) over
cluster-install-speed (affects users) is slanting things the wrong way ;).
But don't forget that faster builds has compounding returns :)
faster builds -> sorter iteration times -> quicker bug fixes -> less time
blocked on bugs -> more time on features and improvements (incl -> faster
builds)
faster builds -> sorter iteration times -> faster feature development ->
more user impacting improvements
|
Right, and if you just want to iterate on the post-pivot OSTree, you can do that now by skipping AMI, etc. builds and just tweaking |
It's not just about speeding up builds - let's say image uploads for 1 region in a public cloud like AWS/Azure are broken - do we stall everything on that? No, we shouldn't need to. |
layers := image.LayerInfos() | ||
for i := len(layers) - 1; i >= 0; i-- { | ||
layer := layers[i] | ||
blob, _, err := imageSource.GetBlob(ctx, layer, blobinfocache.NoCache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is expected to enforce/verify signatures (as the code does by calling IsRunningImageAllowed
in pull
), this code must enforce that the blob
stream matches the digest in layer.Digest
; otherwise the registry could substitute a malicious blob not matching the manifest.
(IsRunningImageAllowed
can’t do that for you: it would have to download the blobs, verify them, and cache them somewhere, so that later calls to GetBlob
can read the cached version to ensure it has not been changed; the layering/responsibilities are allocated the opposite way, which allows the caller to stream the data directly with no storage requirements, as long as it handles the verification.
GetBlob
also does not do digest verification; without, again, requiring extra storage space to cache the data, it could in principle return an error at EOF, but that would only help consumers that don’t consume the data as being streamed, just the way this code does. Also, GetBlob
is implemented in all the different transports, and having every one of them copy&paste code to verify digests would be hard to maintain and enforce; centralizing this in the consumer, typically c/image/copy.Image
is more practical.)
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
closing due to inactivity. Please reopen if needed. /close |
@abhinavdahiya: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Since e2b31b2 (#1149), we have been using the machine-os-content image to seed the machine-config operator. With this commit, use the RHCOS build ID from that image's annotations to calculate our AMI, etc. as well. This gives one less degree of freedom for breaking things ;). Users who want to test clusters based on a different RHCOS build should bump the value in their update payload, just like users testing operator updates and other changes.
The vendor commit is depressingly large, but I don't think there's much we can do about that from this repository. See the e189de1 commit message for notes.
This is a WIP until I look into signature verification.
I'm not sure where this stands with openshift/origin#21998 deferred, but I was looking into ways to simplify the release process and this was one of them. CC @cgwalters
Fixes #987.