Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

legacy backfill attempt #2 #632

Merged
merged 2 commits into from
Mar 5, 2020
Merged

legacy backfill attempt #2 #632

merged 2 commits into from
Mar 5, 2020

Conversation

listx
Copy link
Contributor

@listx listx commented Mar 5, 2020

Linus Arver added 2 commits March 4, 2020 19:29
This reverts commit 0ec285c.

This is the second time we are trying to backfill with legacy images.
The first attempt failed, as detailed here:
kubernetes/kubernetes#88553
These images were pushed in the last 24 hours. I expect such changes to
continue to happen until the gcr.io/google-containers becomes frozen as
read-only.
@k8s-ci-robot k8s-ci-robot requested review from dims and thockin March 5, 2020 03:32
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/k8s-infra labels Mar 5, 2020
@listx
Copy link
Contributor Author

listx commented Mar 5, 2020

There are 2 commits this time --- the original PR (revert of #629), as well as a second commit to capture the delta of google-containers between yesterday and today.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the untagged ones about?

/lgtm
/approve
/hold

unhold at your discretion.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Mar 5, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: listx, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 5, 2020
@listx
Copy link
Contributor Author

listx commented Mar 5, 2020

What are the untagged ones about?

They should be the architecture-based images that are not tagged. E.g., https://console.cloud.google.com/gcr/images/google-containers/GLOBAL/debian-iptables@sha256:25c4396386a2d3f2f4785da473d3428bc542a6f774feee830e8b6bc6b053d11b/details?tab=info where the digest 1e531753f00f944e1f9d8b0ce63cf85b18afbe8e4d4608f5af974759ecb27ad1 (one of the untagged images) is for the amd64 arch, for the tagged 25c4396386a2d3f2f4785da473d3428bc542a6f774feee830e8b6bc6b053d11b manifest list.

I recall a conversation with @detiber and others where it was desirable to not tag individual architecture images and only tag the manifest lists.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2020
@k8s-ci-robot k8s-ci-robot merged commit dc6ff32 into kubernetes:master Mar 5, 2020
@thockin
Copy link
Member

thockin commented Mar 5, 2020 via email

@detiber
Copy link
Member

detiber commented Mar 5, 2020

When pushing a manifest-list image (multi-arch) do we have to manually
add all of the untagged per-arch images to promoter manifest, or is it
smart enough to drag those in as deps? I have never specified
anything but the manifest-list itself on internal promoter.

It handles the specified dependencies automatically, we are doing this with the Cluster API images today :)

@detiber
Copy link
Member

detiber commented Mar 5, 2020

That said, if you wanted the arch/platform specific images to be pullable by other tag(s), then they would need to be specified separately, if only specifying the manifest list image, the other images are only available via sha

@thockin
Copy link
Member

thockin commented Mar 5, 2020 via email

@detiber
Copy link
Member

detiber commented Mar 5, 2020

IMO that's what we want - we should normalize on manifest lists and curtail
per-arch tags.

+1 from me, I just wanted to make sure that I had clarified my earlier statement in case someone assumed otherwise.

@thockin
Copy link
Member

thockin commented Mar 5, 2020 via email

@listx
Copy link
Contributor Author

listx commented Mar 5, 2020

So for the purpose of 1:1 backfill, untagged sha256s are OK, but I think we generally want all "living" manifests to only have tagged images, ideally manifests lists. Does that sound right @listx ?

Yes.

That being said, I want tooling to automatically (and correctly) inject images into the correct promoter manifest, without asking contributors to edit the YAML by hand. Maybe having untagged images recorded in the YAML makes sense (to track everything) --- but we'll cross that bridge when we get there.

I will now post some numbers for the backfill. As expected, the backfill on its own did not finish within the 2hr prow job timeout. We are now sitting at roughly 7k of the 30k images copied in the EU region (US and ASIA show similar numbers) --- for 4 hours worth of running the promoter (this is attempt 2 of 2). If we consider that it's a 3-region thing, we actually copied 21k images in 4 hours, or about ~5k images per hour. We have about 90k - 21k or ~70k images left to copy. This means we have to run the promoter at least another ~13hrs, or about 6 or 7 more 2-hr invocations. We are now running the promoter every 4 hours, so it will take about another 24 hours or so, assuming no additional PRs, before the backfill is complete.

Meanwhile, I have not observed any TRANSACTION REJECTED messages in the logs. Curiously, the older messages that were getting rejected have not resurfaced since ~18 hrs ago. It would appear that they were dropped by Pub/Sub because it's been over 7 days since they were originally published --- I see similar language in the Resource limits docs here. I think this is fine --- in a real-world situation we would have 7 days to respond to alerts, which is plenty of time. In other words, @thockin there is no need to purge the Pub/Sub messages for the dummy images that were pushed/deleted last week because they've been purged already by the system.

listx pushed a commit to listx/test-infra that referenced this pull request Mar 5, 2020
We are trying to run the backfill [1] mostly continuously to minimize
the promoter's downtime for new promotions.

Even without the backfill, we would still want this to run more
frequently as a sanity check anyway.

[1]: kubernetes/k8s.io#632
@listx
Copy link
Contributor Author

listx commented Mar 6, 2020

Now that kubernetes/test-infra#16640 has merged, the periodic job is running every hour. It is running right now. The cool thing I noticed is that Cloud Run is autoscaling automatically to handle all requests. I've done some grepping through the logs from the UI (because gcloud logging read ... does not work for me for some reason...) and see that there are 19 instances of the cip-auditor chugging along to process all the incoming Pub/Sub messages. All of these are TRANSACTION VERIFIED as expected (there are no TRANSACTION REJECTED logs for the past 3 hours).

@listx
Copy link
Contributor Author

listx commented Mar 6, 2020

To get a live view of the backfill doing its thing, check out https://prow.k8s.io/?job=ci-k8sio-cip. I expect these jobs to continuously fail for the next several hours going into the evening (picking up where the last one finished at), until the backfill is complete.

@listx
Copy link
Contributor Author

listx commented Mar 6, 2020

Strange, I have detected an invalid push of an image:

(a4d11200-0544-425d-b007-368581514406) TRANSACTION REJECTED: could not determine source registry: &{INSERT asia.gcr.io/k8s-artifacts-prod/kas-network-proxy/proxy-server@sha256:101863ca4affcb2121b785ab34f87321a3f954dd2146ae7069eea87ce1bd7059 }

Digging deeper, this should not have raised an alarm because the image is part of a valid promotion here: #633

This is the case of a fat manifest. I'll work on fixing this bug tomorrow.

The good news is that we can still manually verify all images after the backfill is done, by comparing the totality of images in the prod repos vs what we have defined in the promoter manifests.

@listx
Copy link
Contributor Author

listx commented Mar 7, 2020

OK so the backfill finished with this Prow job: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-k8sio-cip/1235800257359515648 at 2020-03-06 07:12:01 UTC (basically ~11PM PST on March 5).

Ever since then, the same job (which runs periodically: https://prow.k8s.io/?job=ci-k8sio-cip) has finished successfully.

I have noticed a bug with the auditor in parsing certain references. I will work on fixing this today.

After this is fixed, I have to:

(1) promote this image to prod
(2) deploy it as the next Cloud Run revision

Meanwhile, I'm doing a manual verification of what is in GCR (k8s.artifacts.prod) vs. what we have defined in our promoter manifests.

@listx
Copy link
Contributor Author

listx commented Mar 7, 2020

I've finished manually verifying the images. If I run the following script from the promoter's root repo:

#!/usr/bin/env bash

set -o errexit
set -o nounset
set -o pipefail

t=$(date +'%Y-%m-%d--%H-%M-%S')

d="snapshots/$t"
mkdir -p "$d"

bazel run \
    --workspace_status_command=$PWD/workspace_status.sh \
    --host_force_python=PY2 //:cip -- -snapshot=asia.gcr.io/k8s-artifacts-prod -minimal-snapshot \
    -output-format=CSV > $d/asia.csv &
bazel run \
    --workspace_status_command=$PWD/workspace_status.sh \
    --host_force_python=PY2 //:cip -- -snapshot=eu.gcr.io/k8s-artifacts-prod -minimal-snapshot \
    -output-format=CSV > $d/eu.csv &
bazel run \
    --workspace_status_command=$PWD/workspace_status.sh \
    --host_force_python=PY2 //:cip -- -snapshot=us.gcr.io/k8s-artifacts-prod -minimal-snapshot \
    -output-format=CSV > $d/us.csv &

bazel run \
    --workspace_status_command=$PWD/workspace_status.sh \
    --host_force_python=PY2 //:cip -- -manifest-based-snapshot-of=asia.gcr.io/k8s-artifacts-prod -minimal-snapshot \
    -thin-manifest-dir=$HOME/go/src/github.com/kubernetes/k8s.io/k8s.gcr.io \
    -output-format=CSV > $d/asia.disk.csv &
bazel run \
    --workspace_status_command=$PWD/workspace_status.sh \
    --host_force_python=PY2 //:cip -- -manifest-based-snapshot-of=eu.gcr.io/k8s-artifacts-prod -minimal-snapshot \
    -thin-manifest-dir=$HOME/go/src/github.com/kubernetes/k8s.io/k8s.gcr.io \
    -output-format=CSV > $d/eu.disk.csv &
bazel run \
    --workspace_status_command=$PWD/workspace_status.sh \
    --host_force_python=PY2 //:cip -- -manifest-based-snapshot-of=us.gcr.io/k8s-artifacts-prod -minimal-snapshot \
    -thin-manifest-dir=$HOME/go/src/github.com/kubernetes/k8s.io/k8s.gcr.io \
    -output-format=CSV > $d/us.disk.csv

I essentially get identical outputs for all 6 snapshots (3 from the 3 GCR regions, the other 3 from the aggregated promoter manifest YAMLs on disk). I've opened up a PR to make the promoter manifests match what's in production by creating a PR here: #642

NOTE: It's important to pass in the -minimal-snapshot flag so that we don't compare the child manifests of untagged manifest-list images.

@listx
Copy link
Contributor Author

listx commented Mar 9, 2020

I'm still working on fixing kubernetes-sigs/promo-tools#191. It will take a little effort, but I should be able to fix it today.

@bartsmykla
Copy link
Contributor

Thanks for the update @listx! :-)

@listx
Copy link
Contributor Author

listx commented Mar 11, 2020

I'm still working on fixing kubernetes-sigs/k8s-container-image-promoter#191. It will take a little effort, but I should be able to fix it today.

I was overly optimistic about this; more likely timeline is today/tomorrow.

@listx
Copy link
Contributor Author

listx commented Mar 14, 2020

Update on #191 here: kubernetes-sigs/promo-tools#191 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants