-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sunset for k8s.gcr.io repository #4872
Comments
cc @kubernetes/sig-architecture-leads @kubernetes/sig-release-leads |
Let's aim to very clearly communicate a recommended approach (eg: mirror the images that you depend on, or use a pull through cache, or...) and consider the lead time on those comms when we pick a date. The comms plan does not have to be perfect, it just has to be good enough. |
@sftim agree. Recommended approach, so far:
|
@enj yep, agree. The brownout we had in mind was as Arnaud mentioned here: |
@dims I suppose deleting images is one form of brownout... I was more thinking that we have the old registry return |
@enj k8s.gcr.io is GCR based and has only a few folks left to take care of it. Last year some helpful folks tried to setup redirects (automatic from k8s.gcr.io to registry.k8s.io) in a small portion and ran into snags, so we can't do much over there other than delete images. Details are in this thread: https://kubernetes.slack.com/archives/CCK68P2Q2/p1666725317568709 |
@dims makes sense. One suggestion that may also not be implementable would be to temporarily delete and then recreate image tags to cause pull failures (another form of brownout). |
Year to date GCP Billing data, please see here: ($682,683.81 year-to-date / 62 days from Jan 1 to March 4) * 365 = $4,019,025.65 (our budget/credits is $3m ) |
(edited) One option we have is to actually delete some images - and then optionally reinstate them per #4872 (comment). A 429 is subject to Google's say-so, but deleting an image is something we can Just Do™. So long as the comms are in place to explain why. |
xref: #4738 |
An energetic discussion with @thockin here https://kubernetes.slack.com/archives/CCK68P2Q2/p1678118252030639
|
I think we can do broad brownouts ahead of any final sunset by toggling the access controls on the 3 backing GCR instances. To make the images public read we set the backing GCS bucket to have read permission for Doing this is a big deal, and I'm not sure what the time frame should be. We know that users are very slow to migrate, and that doing this will disrupt their base ""cloud-native"" infrastructure. (E.G. I saw some recent data that Kubernetes 1.11 from 2018 is still reasonably popular (!)) |
Some data from @justinsb: |
Some good discussion with @TheFoxAtWork here: https://cloud-native.slack.com/archives/CSCPTLTPE/p1678219030800149 on #tag-chairs channel on CNCF slack
|
@dims i want to confirm what i'm looking at from the chart (i understand there is a new one in the works), can you confirm that each colored bar is who/what is primarily requesting the images? If this is the case, has AWS/Amazon been engaged to redirect requests they field to |
@TheFoxAtWork yep, there has been a bunch of back and forth. |
Has anyone pinged Microsoft? I don't know where Azure stands at the moment. |
A single line kubectl command to find images from the old registry: A Kyverno and Gatekeeper policy to help folks! A kubectl/krew plugin: |
FAQ(s) we are getting asked:
|
attempted to pull a lot of the details from this ticket into a single LinkedIn post for sharing in case it helps: https://www.linkedin.com/posts/themoxiefox_action-required-update-references-from-activity-7039245748525256704-IrES |
Some good news from @BenTheElder here - https://kubernetes.slack.com/archives/CCK68P2Q2/p1678299674725429 |
AWS just posted a bulletin in its StackOverflow: https://stackoverflow.com/collectives/aws/bulletins/75676424/important-kubernetes-registry-changes |
I chatted with @jeremyrickard at Microsoft. They are all over this. |
Question: when the new k8s.gcr.io->registry.k8s.io redirection takes effect, what is likely to fail?
|
Touching on the topic of network level firewalls or other things causing impact: This is fairly easily tested - run a pod which uses a "registry.k8s.io" image in your cluster(s). If it is able to pull that image, you're almost certainly OK. If not, debug now before the redirect goes live (next week, we hope). |
How will the redirect work? Just on DNS level? I have tried this locally myself, but containerd/Docker, obviously and for the right reasons, complains about certificate mismatch between k8s.gcr.io and registry.k8s.io. I solved it then by downloading the ca.crt and installing it locally for containerd/Docker. |
Do we have enough bandwidth on registry.k8s.io ? |
HTTP 3XX redirect, not DNS. No cert changes. You can test by taking any image you would pull and substituting The only difference between doing this test and the redirect will be your client reaching k8s.gcr.io first and then following the redirect, but presumably k8s.gcr.io was already reachable for you if you're switching, and all production-grade registry clients follow HTTP redirects. The same existing GCR endpoint will serve the redirect instead of the usual response. Existing GCR image pulls already involve redirects to backing storage, just not redirects to registry.k8s.io
We should have more than enough capacity on https://registry.k8s.io, we've looked at traffic levels for k8s.gcr.io and planned accordingly. registry.k8s.io gives us the ability to offload bandwidth-intensive image layer serving to additional hosts securely. Just serving AWS traffic (which is the majority) from region-local AWS storage should bring us back within our budgets. We have a lot more context in the docs (https://registry.k8s.io) and this talk https://www.youtube.com/watch?v=9CdzisDQkjE |
Experiment results for redirect k8s.gcr.io->registry.k8s.io last october: |
xref: kubernetes/website#39887 |
this text may get dropped from the blog post being drafted for automatic redirects, so saving it here: Technical Details The new registry.k8s.io is a secure blob redirector that allows the Kubernetes project to direct traffic based on request IP to the best possible blob storage for the user. If a user makes a request from an AWS region network and pulls a Kubernetes container image, for example, that user will be automatically redirected to pull an image from the closest S3 bucket image layer store. For the current decision tree, refer to this architecture decision tree [1]. To be clear, the new registry.k8s.io implementation allows the upstream project to host registries on more clouds in the future, not just GCP and AWS, which will increase stability, reduce cost, and accelerate bothspeed downloads and deployments. Please do not rely on the internal implementation details of the new image registry as these can be changed without notice. Please note the upstream Kubernetes teams are working to provide additional communication, and the situation around how long the old registry remains is still being discussed. [1]: https://kubernetes.io/blog/2023/02/06/k8s-gcr-io-freeze-announcement/ |
The first step for minikube will be to start adding Probably add it to all kubeadm versions before 1.25.0, shouldn't hurt anything if it is already the default registry... The second step is to retag all the older preloads with the new registry, to work air-gapped (but rather small download) Some mirrors might still use a "k8s.gcr.io" subdirectory, which is fine, so this change is only for the default registry. Main issue is that those people who are pulling those older kubernetes releases, also use older versions of minikube. Or if we invalidate old caches, and have people pull "new" versions of the same images - but with a different name... ~/.minikube/cache/images/amd64 : That would be somewhat contra-productive, so trying to "upgrade" those old caches in place (by re-tagging images) |
kubeadm had the default changed in patch releases back to 1.23 (older releases were not accepting any patches), when we published https://kubernetes.io/blog/2022/11/28/registry-k8s-io-faster-cheaper-ga/ |
So on March 20, we'll be turning on redirects for almost everyone from k8s.gcr.io to registry.k8s.io, details here: So the next question will be, how may folks still be using the underlying content of k8s.gcr.io from other ways:
So we'll have to then watch how much savings we get over time. Assuming about a week of roll out starting March 20, we'll get some concrete data a week or so after that ( lets' say April 3rd - monday given we have a saw tooth pattern of usage over the week with lows on saturday and sunday ) |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
We did this |
@sftim: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@sftim: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
(but feel free to reopen if needed) |
Here are the community blogs and announcements so far around k8s.gcr.io
However we are finding out that the numbers don't add up and we will end up using all the budget we have as our GCP cloud credits well before Dec 31, 2023. So we need to do something more drastic than just the freeze. Please see the thread in
#sig-k8s-infra
:https://kubernetes.slack.com/archives/CCK68P2Q2/p1677793138667629?thread_ts=1677709804.935919&cid=CCK68P2Q2
We will need to start by enumerating some of images that carry the biggest cost (storage+network) and removing them from
k8s.gcr.io
right away (possibly by freeze date - April 3rd). Some data is in the thread, but we will need to revisit the logs and come up with a clear set of images based on some criteria, announce their deletion as well. Note that these specific set of images will still be available in the new registryregistry.k8s.io
So folks will have to fix their kubernetes manifests / helm charts etc as we mentioned in the 3 urls above.Thought about deadline for deletion of k8s.gcr.io:
Since the freeze is on April 3rd 2023 (10 days before 1.27 is released) and we expect to send comms out at kubecon EU ( 18 – 21 APRIL ). How about we put the marker on end of June? (So we get 6 months of cost savings on the costs)
Risk: We will end up interrupting clusters that are working right now. Specifically given the traffic patterns, a bunch of these will be in AWS, but is very likely to be anyone who has an older working cluster that they haven't touched in a while.
What i have enumerated above is just the beginning of the discussion. Please feel free to add your thought below, so we can then draft a KEP around it.
The text was updated successfully, but these errors were encountered: