-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable grace period for TaskRun pods in ImagePullBackOff #5987
Comments
Hi, may I try implementing the TEP-0092? 👋 |
Hello @shuheiktgw, thank you for offering to implement this, that would be great. |
@shuheiktgw thank you for offering to implement -- @EmmaMunley has also started looking into implementing TEP-0092, maybe you can collaborate? |
Sure, I'm happy to collaborate🙂 Hi @EmmaMunley, how is the implementation going so far? I'd appreciate it if you would push any WIP changes so that I can see if there is anything I can help! |
Hi @shuheiktgw! Sure! I am working on implementing the scheduling timeout feature first as part of this issue: #4078. |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
Rotten issues close after 30d of inactivity. /close Send feedback to tektoncd/plumbing. |
@tekton-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks @dibyom for reopening this issue. We are running into this issue and looking for a potential solution. The issue is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning ( We have this issue reported by multiple end users: #7184 The problem statement for this issue is to be able to continue waiting when a step or sidecar container reports the container is waiting with TEP-0092 proposed to solve this issue. TEP-0092 is now marked as |
Revisiting the TEP-0132 , it might not be able to resolve this particular issue as the taskrun controller treats |
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 This is a manual cheery-pick of tektoncd#7666 Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. #5987 #7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 This is a manual cheery-pick of tektoncd#7666 Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. #5987 #7184 This is a manual cheery-pick of #7666 Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. #5987 #7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
Feature request
Introduce a configurable grace period for TaskRun pods to be in ImagePullBackOff without failing the TaskRun.
Use case
We are using Tekton to execute TaskRuns. In our use case images for TaskRun pods cannot be pulled directly because container registry credentials must not exist in the namespace for security reasons. Instead, there’s another component in the cluster that pulls images for TaskRun pods. In case of delays in image pulling a TaskRun’s pod may be subject to ImagePullBackOff and recovers after the image has been provisioned on the respective node.
With PR #4921 (fail TaskRuns on ImagePullBackOff) we now see sporadically failing TaskRuns.
We propose to introduce a configurable grace period where ImagePullBackOff is tolerated and does not fail a TaskRun.
The text was updated successfully, but these errors were encountered: