Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sunset for k8s.gcr.io repository #4872

Closed
dims opened this issue Mar 4, 2023 · 47 comments
Closed

Sunset for k8s.gcr.io repository #4872

dims opened this issue Mar 4, 2023 · 47 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@dims
Copy link
Member

dims commented Mar 4, 2023

Here are the community blogs and announcements so far around k8s.gcr.io

However we are finding out that the numbers don't add up and we will end up using all the budget we have as our GCP cloud credits well before Dec 31, 2023. So we need to do something more drastic than just the freeze. Please see the thread in #sig-k8s-infra:
https://kubernetes.slack.com/archives/CCK68P2Q2/p1677793138667629?thread_ts=1677709804.935919&cid=CCK68P2Q2

We will need to start by enumerating some of images that carry the biggest cost (storage+network) and removing them from k8s.gcr.io right away (possibly by freeze date - April 3rd). Some data is in the thread, but we will need to revisit the logs and come up with a clear set of images based on some criteria, announce their deletion as well. Note that these specific set of images will still be available in the new registry registry.k8s.io So folks will have to fix their kubernetes manifests / helm charts etc as we mentioned in the 3 urls above.

Thought about deadline for deletion of k8s.gcr.io:
Since the freeze is on April 3rd 2023 (10 days before 1.27 is released) and we expect to send comms out at kubecon EU ( 18 – 21 APRIL ). How about we put the marker on end of June? (So we get 6 months of cost savings on the costs)

Risk: We will end up interrupting clusters that are working right now. Specifically given the traffic patterns, a bunch of these will be in AWS, but is very likely to be anyone who has an older working cluster that they haven't touched in a while.

What i have enumerated above is just the beginning of the discussion. Please feel free to add your thought below, so we can then draft a KEP around it.

@dims
Copy link
Member Author

dims commented Mar 4, 2023

cc @kubernetes/sig-architecture-leads @kubernetes/sig-release-leads

@enj
Copy link
Member

enj commented Mar 4, 2023

@dims can we start by having brownouts of the old registry (they should start immediately)?

@sftim
Copy link
Contributor

sftim commented Mar 5, 2023

Let's aim to very clearly communicate a recommended approach (eg: mirror the images that you depend on, or use a pull through cache, or...) and consider the lead time on those comms when we pick a date.

The comms plan does not have to be perfect, it just has to be good enough.

@dims
Copy link
Member Author

dims commented Mar 5, 2023

@sftim agree. Recommended approach, so far:

@dims
Copy link
Member Author

dims commented Mar 5, 2023

@dims can we start by having brownouts of the old registry (they should start immediately)?

@enj yep, agree. The brownout we had in mind was as Arnaud mentioned here:
https://kubernetes.slack.com/archives/CCK68P2Q2/p1677793564552829?thread_ts=1677709804.935919&cid=CCK68P2Q2

@enj
Copy link
Member

enj commented Mar 5, 2023

@enj yep, agree. The brownout we had in mind was as Arnaud mentioned here: https://kubernetes.slack.com/archives/CCK68P2Q2/p1677793564552829?thread_ts=1677709804.935919&cid=CCK68P2Q2

@dims I suppose deleting images is one form of brownout... I was more thinking that we have the old registry return 429 errors every day at noon for a few hours. The transient service disruption will get people's attention.

@dims
Copy link
Member Author

dims commented Mar 5, 2023

@enj k8s.gcr.io is GCR based and has only a few folks left to take care of it. Last year some helpful folks tried to setup redirects (automatic from k8s.gcr.io to registry.k8s.io) in a small portion and ran into snags, so we can't do much over there other than delete images.

Details are in this thread: https://kubernetes.slack.com/archives/CCK68P2Q2/p1666725317568709

@enj
Copy link
Member

enj commented Mar 5, 2023

@dims makes sense. One suggestion that may also not be implementable would be to temporarily delete and then recreate image tags to cause pull failures (another form of brownout).

@dims
Copy link
Member Author

dims commented Mar 5, 2023

Year to date GCP Billing data, please see here:
GCP_Billing_Report-year-to-date.pdf

($682,683.81 year-to-date / 62 days from Jan 1 to March 4) * 365 = $4,019,025.65 (our budget/credits is $3m )

@sftim
Copy link
Contributor

sftim commented Mar 6, 2023

(edited)

One option we have is to actually delete some images - and then optionally reinstate them per #4872 (comment). A 429 is subject to Google's say-so, but deleting an image is something we can Just Do™. So long as the comms are in place to explain why.

@dims
Copy link
Member Author

dims commented Mar 6, 2023

@sftim yes, we will have a list of limited set of images that we will delete ASAP! (and will NOT reinstate them). @hh and folks are coming up with the high traffic / costly image list as the first step. Our comms will depend on what's in that list.

@dims
Copy link
Member Author

dims commented Mar 6, 2023

xref: #4738

@dims
Copy link
Member Author

dims commented Mar 6, 2023

An energetic discussion with @thockin here https://kubernetes.slack.com/archives/CCK68P2Q2/p1678118252030639

  • Weighing risk of project shutdown vs k8s.gcr.io shutdown
  • Input of GCR customers in what happens here
  • Risks of breaking old versions of k8s clusters that seem to contribute highly to the cost
  • What is the right amount of push needed for us to influence change
  • Will we ever be able to get rid of this old repository?
  • Are we able to say as a project that folks who run their clusters in production should have their own repositories?

@BenTheElder
Copy link
Member

I think we can do broad brownouts ahead of any final sunset by toggling the access controls on the 3 backing GCR instances. To make the images public read we set the backing GCS bucket to have read permission for allUsers, we could probably invert that and put it back on a schedule to gradually increase the period of total non-availability.

Doing this is a big deal, and I'm not sure what the time frame should be. We know that users are very slow to migrate, and that doing this will disrupt their base ""cloud-native"" infrastructure. (E.G. I saw some recent data that Kubernetes 1.11 from 2018 is still reasonably popular (!))

@dims
Copy link
Member Author

dims commented Mar 7, 2023

Some data from @justinsb:

image

image

@dims
Copy link
Member Author

dims commented Mar 7, 2023

Some good discussion with @TheFoxAtWork here: https://cloud-native.slack.com/archives/CSCPTLTPE/p1678219030800149 on #tag-chairs channel on CNCF slack

This will likely break a lot of clusters and organizations, but it is certainly a good wake up call to the world that even open source has its costs. I know this is drastic, but we’ve broken the internet before, this one at least is more well coordinated with plenty of advance warnings. We can’t go to everyone personally, so we do our best with the time and energy we have available to us as open source volunteers and community members. Side note, eliminating older versions and forcing upgrades is a huge global security uplift.

I would also recommend (though this is likely already done) to work with the Ambassadors, Marketing Team, and other Foundations.

@TheFoxAtWork
Copy link

@dims i want to confirm what i'm looking at from the chart (i understand there is a new one in the works), can you confirm that each colored bar is who/what is primarily requesting the images? If this is the case, has AWS/Amazon been engaged to redirect requests they field to registry.k8s.io ? have we done this with other cloud providers? ( i know i'm late to the party trying to understand what has already been completed)

@chris-short
Copy link
Contributor

@dims @rothgar and I are engaging folks on the AWS side.

@dims
Copy link
Member Author

dims commented Mar 8, 2023

@TheFoxAtWork yep, there has been a bunch of back and forth.

@chris-short
Copy link
Contributor

Has anyone pinged Microsoft? I don't know where Azure stands at the moment.

@dims
Copy link
Member Author

dims commented Mar 8, 2023

A single line kubectl command to find images from the old registry:

A Kyverno and Gatekeeper policy to help folks!

A kubectl/krew plugin:

@dims
Copy link
Member Author

dims commented Mar 8, 2023

FAQ(s) we are getting asked:

@TheFoxAtWork
Copy link

attempted to pull a lot of the details from this ticket into a single LinkedIn post for sharing in case it helps: https://www.linkedin.com/posts/themoxiefox_action-required-update-references-from-activity-7039245748525256704-IrES

@dims
Copy link
Member Author

dims commented Mar 8, 2023

Some good news from @BenTheElder here - https://kubernetes.slack.com/archives/CCK68P2Q2/p1678299674725429

image

@chris-short
Copy link
Contributor

AWS just posted a bulletin in its StackOverflow: https://stackoverflow.com/collectives/aws/bulletins/75676424/important-kubernetes-registry-changes

@chris-short
Copy link
Contributor

I chatted with @jeremyrickard at Microsoft. They are all over this.

@dims
Copy link
Member Author

dims commented Mar 9, 2023

Question: when the new k8s.gcr.io->registry.k8s.io redirection takes effect, what is likely to fail?

  • folks using old versions of kubelet/docker/containerd are likely to see problems. newer kubelet/containerd have better back off and retry in place so they will fare better
  • folks with non-typical network configurations needing explicit whitelist of url(s) etc are likely to get hit by the http redirection (to the new s3 buckets)

@thockin
Copy link
Member

thockin commented Mar 9, 2023

Touching on the topic of network level firewalls or other things causing impact:

This is fairly easily tested - run a pod which uses a "registry.k8s.io" image in your cluster(s). If it is able to pull that image, you're almost certainly OK. If not, debug now before the redirect goes live (next week, we hope).

@recollir
Copy link

recollir commented Mar 9, 2023

How will the redirect work? Just on DNS level? I have tried this locally myself, but containerd/Docker, obviously and for the right reasons, complains about certificate mismatch between k8s.gcr.io and registry.k8s.io. I solved it then by downloading the ca.crt and installing it locally for containerd/Docker.

@tuapuikia
Copy link

Some good news from @BenTheElder here - https://kubernetes.slack.com/archives/CCK68P2Q2/p1678299674725429

image

Do we have enough bandwidth on registry.k8s.io ?

@BenTheElder
Copy link
Member

How will the redirect work? Just on DNS level? I have tried this locally myself, but containerd/Docker, obviously and for the right reasons, complains about certificate mismatch between k8s.gcr.io and registry.k8s.io. I solved it then by downloading the ca.crt and installing it locally for containerd/Docker.

HTTP 3XX redirect, not DNS. No cert changes.

You can test by taking any image you would pull and substituting registry.k8s.io instead of k8s.gcr.io. All images in k8s.gcr.io are in registry.k8s.io.

The only difference between doing this test and the redirect will be your client reaching k8s.gcr.io first and then following the redirect, but presumably k8s.gcr.io was already reachable for you if you're switching, and all production-grade registry clients follow HTTP redirects.

The same existing GCR endpoint will serve the redirect instead of the usual response. Existing GCR image pulls already involve redirects to backing storage, just not redirects to registry.k8s.io

Do we have enough bandwidth on registry.k8s.io ?

We should have more than enough capacity on https://registry.k8s.io, we've looked at traffic levels for k8s.gcr.io and planned accordingly.
We aren't hitting bandwidth limits on GCR either, just impractical cost of serving ever-increasing cross-cloud bandwidth.

registry.k8s.io gives us the ability to offload bandwidth-intensive image layer serving to additional hosts securely.
We're doing that on GCP (Artifact Registry, Cloud Run) and now AWS (S3) thanks to additional funding from Amazon and we will be serving substantially less expensive egress traffic. In the future it might include additional hosts / sponsors (https://registry.k8s.io#stability).

Just serving AWS traffic (which is the majority) from region-local AWS storage should bring us back within our budgets.

We have a lot more context in the docs (https://registry.k8s.io) and this talk https://www.youtube.com/watch?v=9CdzisDQkjE

@recollir
Copy link

recollir commented Mar 9, 2023

@BenTheElder 👍

@dims
Copy link
Member Author

dims commented Mar 9, 2023

Experiment results for redirect k8s.gcr.io->registry.k8s.io last october:
https://kubernetes.slack.com/archives/CCK68P2Q2/p1666725317568709

@dims
Copy link
Member Author

dims commented Mar 10, 2023

xref: kubernetes/website#39887

@dims
Copy link
Member Author

dims commented Mar 10, 2023

this text may get dropped from the blog post being drafted for automatic redirects, so saving it here:

Technical Details

The new registry.k8s.io is a secure blob redirector that allows the Kubernetes project to direct traffic based on request IP to the best possible blob storage for the user. If a user makes a request from an AWS region network and pulls a Kubernetes container image, for example, that user will be automatically redirected to pull an image from the closest S3 bucket image layer store. For the current decision tree, refer to this architecture decision tree [1]. To be clear, the new registry.k8s.io implementation allows the upstream project to host registries on more clouds in the future, not just GCP and AWS, which will increase stability, reduce cost, and accelerate bothspeed downloads and deployments. Please do not rely on the internal implementation details of the new image registry as these can be changed without notice.

Please note the upstream Kubernetes teams are working to provide additional communication, and the situation around how long the old registry remains is still being discussed.

[1]: https://kubernetes.io/blog/2023/02/06/k8s-gcr-io-freeze-announcement/
[2]: https://github.com/kubernetes/registry.k8s.io/blob/main/cmd/archeio/docs/request-handling.md

@afbjorklund
Copy link

afbjorklund commented Mar 10, 2023

The first step for minikube will be to start adding --image-repository=registry.k8s.io to the old kubeadm commands.

Probably add it to all kubeadm versions before 1.25.0, shouldn't hurt anything if it is already the default registry...

The second step is to retag all the older preloads with the new registry, to work air-gapped (but rather small download)

Some mirrors might still use a "k8s.gcr.io" subdirectory, which is fine, so this change is only for the default registry.


Main issue is that those people who are pulling those older kubernetes releases, also use older versions of minikube.

Or if we invalidate old caches, and have people pull "new" versions of the same images - but with a different name...

~/.minikube/cache/images/amd64 : k8s.gcr.io/pause_3.6 -> registry.k8s.io/pause_3.6

That would be somewhat contra-productive, so trying to "upgrade" those old caches in place (by re-tagging images)

@BenTheElder
Copy link
Member

kubeadm had the default changed in patch releases back to 1.23 (older releases were not accepting any patches), when we published https://kubernetes.io/blog/2022/11/28/registry-k8s-io-faster-cheaper-ga/

@dims
Copy link
Member Author

dims commented Mar 10, 2023

So on March 20, we'll be turning on redirects for almost everyone from k8s.gcr.io to registry.k8s.io, details here:
https://kubernetes.io/blog/2023/03/10/image-registry-redirect/

So the next question will be, how may folks still be using the underlying content of k8s.gcr.io from other ways:

  • directly using the underlying storage ( us.artifacts / eu.artifacts / asia.artifacts ?)
  • folks using k8s.gcr.io folks may still be getting pointing back to the underlying storage

So we'll have to then watch how much savings we get over time. Assuming about a week of roll out starting March 20, we'll get some concrete data a week or so after that ( lets' say April 3rd - monday given we have a saw tooth pattern of usage over the week with lows on saturday and sunday )

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 9, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 9, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 19, 2024
@sftim
Copy link
Contributor

sftim commented Jan 19, 2024

/reopen

@sftim
Copy link
Contributor

sftim commented Jan 19, 2024

We did this
/close

@k8s-ci-robot k8s-ci-robot reopened this Jan 19, 2024
@k8s-ci-robot
Copy link
Contributor

@sftim: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@sftim: Closing this issue.

In response to this:

We did this
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sftim
Copy link
Contributor

sftim commented Jan 19, 2024

(but feel free to reopen if needed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests