-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tekton shouldn't fail pipelinerun/taskrun for kubernetes container starting warning's. #7184
Comments
Thanks for the bug report.
From the issue description, it sounds like this is not what you're experiencing, it could be that the behaviour for About the "Failed to create subPath directory for volumeMount." - do you have an example |
@afrittoli sure small snippet for taskrun
$(context.taskRun.name) is generated name which will be different all the times. and this path will be created when the pod starts. Most of the time the subpath creation will pass. But in case if it fails kubernetes handles it and recreates the path but tekton taskrun will fail with |
/help-wanted @afrittoli to determine the exact error and document here, thanks! |
Thanks @sauravdey for sharing the |
@afrittoli |
@afrittoli any update on this. do you need more information ? |
Hey @afrittoli, we are running into this problem in our infrastructure (see comment in issue #5987). How about introducing an opt in functionality to avoid treating |
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 This is a manual cheery-pick of tektoncd#7666 Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. #5987 #7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 This is a manual cheery-pick of tektoncd#7666 Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. #5987 #7184 This is a manual cheery-pick of #7666 Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. #5987 #7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
Expected Behavior
Intermediate failures on kubernetes container start is failing the pipelinerun/taskrun.
Failure's listed below:
In both the warning the pod/container will eventually start and running state. But tekton fails the tasks immediately. Also the pod will eventually run and take up resources and most of the time completes successfully.
Actual Behavior
Right now in imagepullback the pipeline fails immediately. But the pod eventually starts.
Also Failed to create subPath directory for volumeMount eventually succeeds but pipeline is marked failure.
Steps to Reproduce the Problem
When lot of pods/containers start at a time in any kubernetes node we will hit the issue
Set the registryPullQPS, registryBurst, to a lower number and start multiple pipelinerun/taskrun where the image should be present locally in the node. The pull will succeed eventually but pr/tr will fail.
Also use nfs(pvc) and create multiple subpath by running lot of pipelinerun(Most of the time there will be no issue) But sometime subpath creation fails for any one container and taskrun will fail but kubernetes handles it and recreates the subpath.
Additional Info
If we can not handle the use cases we should have some config to ignore these warnings.
Kubernetes version:
Output of
kubectl version
:Tekton Pipeline version:
Output of
tkn version
orkubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
The text was updated successfully, but these errors were encountered: