Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tekton shouldn't fail pipelinerun/taskrun for kubernetes container starting warning's. #7184

Closed
sauravdey opened this issue Oct 8, 2023 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@sauravdey
Copy link

Expected Behavior

Intermediate failures on kubernetes container start is failing the pipelinerun/taskrun.
Failure's listed below:

  1. Failed to create subPath directory for volumeMount.
  2. ImagePullBack

In both the warning the pod/container will eventually start and running state. But tekton fails the tasks immediately. Also the pod will eventually run and take up resources and most of the time completes successfully.

Actual Behavior

Right now in imagepullback the pipeline fails immediately. But the pod eventually starts.
Also Failed to create subPath directory for volumeMount eventually succeeds but pipeline is marked failure.

Steps to Reproduce the Problem

When lot of pods/containers start at a time in any kubernetes node we will hit the issue
Set the registryPullQPS, registryBurst, to a lower number and start multiple pipelinerun/taskrun where the image should be present locally in the node. The pull will succeed eventually but pr/tr will fail.

Also use nfs(pvc) and create multiple subpath by running lot of pipelinerun(Most of the time there will be no issue) But sometime subpath creation fails for any one container and taskrun will fail but kubernetes handles it and recreates the subpath.

Additional Info

If we can not handle the use cases we should have some config to ignore these warnings.

  • Kubernetes version:

    Output of kubectl version:

Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.11", GitCommit:"8cfcba0b15c343a8dc48567a74c29ec4844e0b9e", GitTreeState:"clean", BuildDate:"2023-06-14T09:49:38Z", GoVersion:"go1.19.10", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

v0.47.2
@sauravdey sauravdey added the kind/bug Categorizes issue or PR as related to a bug. label Oct 8, 2023
@afrittoli
Copy link
Member

Thanks for the bug report.
Since v0.51 we handle InvalidImageName as a permanent error, which causes the Tekton to delete the Pod.

ImagePullBackOff is treated in the same way, because Tekton workloads are not restartable (like Pods), users cannot edit the image in a step and restart the same TaskRun, they must create a new one.
In some cases ImagePullBackOff may be caused by a temporary infrastructure issue (network, rate limiting or so), but since we have no way to distinguish those cases, we will always fail the TaskRun and kill the Pod on ImagePullBackOff.

From the issue description, it sounds like this is not what you're experiencing, it could be that the behaviour for ImagePullBackOff was introduced later than v0.47.2, I need to check in the code base.

About the "Failed to create subPath directory for volumeMount." - do you have an example Pod yaml with the failure that you can share?

@sauravdey
Copy link
Author

sauravdey commented Oct 10, 2023

@afrittoli sure small snippet for taskrun

kind: TaskRun
metadata:
  generateName: clone-test
spec:
  params:
  - name: repo-url
    value: git@github.com:test/test.git
  taskRef:
    name: tekton-git-clone
  podTemplate:
    nodeSelector:
      kubernetes.io/role: tekton-test
  serviceAccountName: tekton-test-svc
  timeout: 12h0m0s
  workspaces:
  - name: output
    persistentVolumeClaim:
      claimName: tekton-test-pvc
    subPath: builds/test/$(context.taskRun.name)

$(context.taskRun.name) is generated name which will be different all the times. and this path will be created when the pod starts.
tekton-test-pvc is a pvc which is backed by nfs

Most of the time the subpath creation will pass. But in case if it fails kubernetes handles it and recreates the path but tekton taskrun will fail with failed to create subPath directory for volumeMount "ws-24dfd" of container "test"

@pritidesai
Copy link
Member

/help-wanted

@afrittoli to determine the exact error and document here, thanks!

@afrittoli
Copy link
Member

Thanks @sauravdey for sharing the TaskRun.
I was hoping to see the Pod exact error message actually has that would help with the fix.

@sauravdey
Copy link
Author

@afrittoli
There is only a event when the container starts.
failed to create subPath directory for volumeMount <volume-name> of container <container-name>

@sauravdey
Copy link
Author

@afrittoli any update on this. do you need more information ?

@pritidesai
Copy link
Member

pritidesai commented Feb 5, 2024

In some cases ImagePullBackOff may be caused by a temporary infrastructure issue (network, rate limiting or so), but since we have no way to distinguish those cases, we will always fail the TaskRun and kill the Pod on ImagePullBackOff.

Hey @afrittoli, we are running into this problem in our infrastructure (see comment in issue #5987). How about introducing an opt in functionality to avoid treating ImagePullBackOff as a permanent error?

pritidesai added a commit to pritidesai/pipeline that referenced this issue Feb 13, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
pritidesai added a commit to pritidesai/pipeline that referenced this issue Feb 14, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
pritidesai added a commit to pritidesai/pipeline that referenced this issue Feb 14, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
pritidesai added a commit to pritidesai/pipeline that referenced this issue Feb 14, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
pritidesai added a commit to pritidesai/pipeline that referenced this issue Feb 14, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
pritidesai added a commit to pritidesai/pipeline that referenced this issue Feb 15, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

This is a manual cheery-pick of tektoncd#7666

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
tekton-robot pushed a commit that referenced this issue Feb 15, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

#5987
#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
tekton-robot pushed a commit to tekton-robot/pipeline that referenced this issue Feb 15, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
pritidesai added a commit to pritidesai/pipeline that referenced this issue Feb 15, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

This is a manual cheery-pick of tektoncd#7666

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
tekton-robot pushed a commit that referenced this issue Feb 26, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

#5987
#7184

This is a manual cheery-pick of #7666

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
tekton-robot pushed a commit that referenced this issue Feb 26, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

#5987
#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
l-qing pushed a commit to l-qing/pipeline that referenced this issue Mar 19, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
@vdemeester
Copy link
Member

Given that #7666 is merged (see docs), I'll go ahead and close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants