Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The TaskRun did not fail promptly after the Pod experienced OOM #8170

Closed
l-qing opened this issue Aug 2, 2024 · 0 comments · Fixed by #8171
Closed

The TaskRun did not fail promptly after the Pod experienced OOM #8170

l-qing opened this issue Aug 2, 2024 · 0 comments · Fixed by #8171
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@l-qing
Copy link
Contributor

l-qing commented Aug 2, 2024

Expected Behavior

After the Pod experiences an OOM, the TaskRun can fail promptly.

Actual Behavior

The TaskRun is in Running state for a long time.

Steps to Reproduce the Problem

  1. Execute TaskRun
cat <<'EOF' | kubectl replace --force -f -
---
apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
  name: taskrun-oom
spec:
  TaskSpec:
    steps:
      - name: step-1
        image: ubuntu
        computeResources:
          limits:
            memory: "10Mi"
            cpu: "1"
          requests:
            memory: "10Mi"
            cpu: "1"
        script: |
          #!/bin/bash
          echo "Hello, World!"
          str="-"
          while true; do
              str="${str} ${str} |"
          done

      - name: step-2
        image: ubuntu
        script: |
          #!/bin/bash
          echo "Hello, World!"
EOF
  1. TaskRun & Pod
$ kubectl get taskrun -w
taskrun-oom                      Unknown     Running            0s

taskrun-oom                      False       TaskRunTimeout     133m        73m

$ kubectl get pods -w
taskrun-oom-pod                            1/2     OOMKilled   0          5m45s
  1. Pod yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 834202389edeba60f4115048b6eef7fbf61357d3d28b26ffad55ae3190a1471f
    cni.projectcalico.org/podIP: 172.16.4.249/32
    cni.projectcalico.org/podIPs: 172.16.4.249/32
    pipeline.tekton.dev/release: 95fbf31
    tekton.dev/ready: READY
  creationTimestamp: "2024-08-02T05:47:44Z"
  labels:
    app.kubernetes.io/managed-by: tekton-pipelines
    tekton.dev/taskRun: taskrun-oom
  name: taskrun-oom-pod
  namespace: default
  ownerReferences:
    - apiVersion: tekton.dev/v1
      blockOwnerDeletion: true
      controller: true
      kind: TaskRun
      name: taskrun-oom
      uid: 353c03d2-0440-4790-a25d-60ee00a4b730
  resourceVersion: "13493752"
  uid: bc8f6968-bfd1-4297-9b68-1fba307b62c3
spec:
  activeDeadlineSeconds: 5400
  containers:
    - args:
        - -wait_file
        - /tekton/downward/ready
        - -wait_file_content
        - -post_file
        - /tekton/run/0/out
        - -termination_path
        - /tekton/termination
        - -step_metadata_dir
        - /tekton/run/0/status
        - -entrypoint
        - /tekton/scripts/script-0-8c4q6
        - --
      command:
        - /tekton/bin/entrypoint
      image: ubuntu
      imagePullPolicy: Always
      name: step-step-1
      resources:
        limits:
          cpu: "1"
          memory: 10Mi
        requests:
          cpu: "1"
          memory: 10Mi
      terminationMessagePath: /tekton/termination
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /tekton/scripts
          name: tekton-internal-scripts
          readOnly: true
        - mountPath: /tekton/downward
          name: tekton-internal-downward
          readOnly: true
        - mountPath: /tekton/creds
          name: tekton-creds-init-home-0
        - mountPath: /tekton/run/0
          name: tekton-internal-run-0
        - mountPath: /tekton/run/1
          name: tekton-internal-run-1
          readOnly: true
        - mountPath: /tekton/bin
          name: tekton-internal-bin
          readOnly: true
        - mountPath: /workspace
          name: tekton-internal-workspace
        - mountPath: /tekton/home
          name: tekton-internal-home
        - mountPath: /tekton/results
          name: tekton-internal-results
        - mountPath: /tekton/steps
          name: tekton-internal-steps
          readOnly: true
        - mountPath: /tekton/artifacts
          name: tekton-internal-artifacts
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-gkgh2
          readOnly: true
    - args:
        - -wait_file
        - /tekton/run/0/out
        - -post_file
        - /tekton/run/1/out
        - -termination_path
        - /tekton/termination
        - -step_metadata_dir
        - /tekton/run/1/status
        - -entrypoint
        - /tekton/scripts/script-1-xvplg
        - --
      command:
        - /tekton/bin/entrypoint
      image: ubuntu
      imagePullPolicy: Always
      name: step-step-2
      resources: {}
      terminationMessagePath: /tekton/termination
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /tekton/scripts
          name: tekton-internal-scripts
          readOnly: true
        - mountPath: /tekton/creds
          name: tekton-creds-init-home-1
        - mountPath: /tekton/run/0
          name: tekton-internal-run-0
          readOnly: true
        - mountPath: /tekton/run/1
          name: tekton-internal-run-1
        - mountPath: /tekton/bin
          name: tekton-internal-bin
          readOnly: true
        - mountPath: /workspace
          name: tekton-internal-workspace
        - mountPath: /tekton/home
          name: tekton-internal-home
        - mountPath: /tekton/results
          name: tekton-internal-results
        - mountPath: /tekton/steps
          name: tekton-internal-steps
          readOnly: true
        - mountPath: /tekton/artifacts
          name: tekton-internal-artifacts
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-gkgh2
          readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
    - command:
        - /ko-app/entrypoint
        - init
        - /ko-app/entrypoint
        - /tekton/bin/entrypoint
        - step-step-1
        - step-step-2
      image: gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/entrypoint:v0.62.0@sha256:dd24ff7543eaea98ae735820675f1a696956e19a9de4c3a960c8be44959aa930
      imagePullPolicy: IfNotPresent
      name: prepare
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /tekton/bin
          name: tekton-internal-bin
        - mountPath: /tekton/steps
          name: tekton-internal-steps
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-gkgh2
          readOnly: true
      workingDir: /
    - args:
        - -c
        - |
          scriptfile="/tekton/scripts/script-0-8c4q6"
          touch ${scriptfile} && chmod +x ${scriptfile}
          cat > ${scriptfile} << '_EOF_'
          IyEvYmluL2Jhc2gKZWNobyAiSGVsbG8sIFdvcmxkISIKc3RyPSItIgp3aGlsZSB0cnVlOyBkbwogICAgc3RyPSIke3N0cn0gJHtzdHJ9IHwiCmRvbmUK
          _EOF_
          /tekton/bin/entrypoint decode-script "${scriptfile}"
          scriptfile="/tekton/scripts/script-1-xvplg"
          touch ${scriptfile} && chmod +x ${scriptfile}
          cat > ${scriptfile} << '_EOF_'
          IyEvYmluL2Jhc2gKZWNobyAiSGVsbG8sIFdvcmxkISIK
          _EOF_
          /tekton/bin/entrypoint decode-script "${scriptfile}"
      command:
        - sh
      image: cgr.dev/chainguard/busybox@sha256:19f02276bf8dbdd62f069b922f10c65262cc34b710eea26ff928129a736be791
      imagePullPolicy: IfNotPresent
      name: place-scripts
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /tekton/scripts
          name: tekton-internal-scripts
        - mountPath: /tekton/bin
          name: tekton-internal-bin
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-gkgh2
          readOnly: true
  nodeName: 192.168.1.242
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
  volumes:
    - emptyDir: {}
      name: tekton-internal-workspace
    - emptyDir: {}
      name: tekton-internal-home
    - emptyDir: {}
      name: tekton-internal-results
    - emptyDir: {}
      name: tekton-internal-steps
    - emptyDir: {}
      name: tekton-internal-artifacts
    - emptyDir: {}
      name: tekton-internal-scripts
    - emptyDir: {}
      name: tekton-internal-bin
    - downwardAPI:
        defaultMode: 420
        items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.annotations['tekton.dev/ready']
            path: ready
      name: tekton-internal-downward
    - emptyDir:
        medium: Memory
      name: tekton-creds-init-home-0
    - emptyDir: {}
      name: tekton-internal-run-0
    - emptyDir:
        medium: Memory
      name: tekton-creds-init-home-1
    - emptyDir: {}
      name: tekton-internal-run-1
    - name: kube-api-access-gkgh2
      projected:
        defaultMode: 420
        sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              items:
                - key: ca.crt
                  path: ca.crt
              name: kube-root-ca.crt
          - downwardAPI:
              items:
                - fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
                  path: namespace
status:
  conditions:
    - lastProbeTime: null
      lastTransitionTime: "2024-08-02T05:47:58Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2024-08-02T05:48:03Z"
      message: "containers with unready status: [step-step-1]"
      reason: ContainersNotReady
      status: "False"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2024-08-02T05:48:03Z"
      message: "containers with unready status: [step-step-1]"
      reason: ContainersNotReady
      status: "False"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2024-08-02T05:47:44Z"
      status: "True"
      type: PodScheduled
  containerStatuses:
    - containerID: containerd://acd789ee5173528aded95d16dc40eb70d7e92e199729b74496cabe588a0d58b4
      image: docker.io/library/ubuntu:latest
      imageID: docker.io/library/ubuntu@sha256:2e863c44b718727c860746568e1d54afd13b2fa71b160f5cd9058fc436217b30
      lastState: {}
      name: step-step-1
      ready: false
      restartCount: 0
      started: false
      state:
        terminated:
          containerID: containerd://acd789ee5173528aded95d16dc40eb70d7e92e199729b74496cabe588a0d58b4
          exitCode: 137
          finishedAt: "2024-08-02T05:48:02Z"
          reason: OOMKilled
          startedAt: "2024-08-02T05:47:59Z"
    - containerID: containerd://d35107e6eb2724e9dcd9fee46338e516a2595f3d1b5dc3d9228505332f77005e
      image: docker.io/library/ubuntu:latest
      imageID: docker.io/library/ubuntu@sha256:2e863c44b718727c860746568e1d54afd13b2fa71b160f5cd9058fc436217b30
      lastState: {}
      name: step-step-2
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: "2024-08-02T05:48:01Z"
  hostIP: 192.168.1.242
  initContainerStatuses:
    - containerID: containerd://ee79f93fbd72f86add0726f9b1ce9c800201d3a2b5a0c45870c0395953a2d674
      image: sha256:925d21d9a65a4a22ec8f4ebb8502127f8528669a78358572496f37bc12b45d0f
      imageID: gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/entrypoint@sha256:dd24ff7543eaea98ae735820675f1a696956e19a9de4c3a960c8be44959aa930
      lastState: {}
      name: prepare
      ready: true
      restartCount: 0
      started: false
      state:
        terminated:
          containerID: containerd://ee79f93fbd72f86add0726f9b1ce9c800201d3a2b5a0c45870c0395953a2d674
          exitCode: 0
          finishedAt: "2024-08-02T05:47:51Z"
          reason: Completed
          startedAt: "2024-08-02T05:47:51Z"
    - containerID: containerd://164231046c1255c1ae760ada87685878583eccc311809a98fc462477e3d583f7
      image: sha256:ae017bc447b9b6c5a7451e9da3d74e489263327a55a71d73a121193a583e1a16
      imageID: cgr.dev/chainguard/busybox@sha256:19f02276bf8dbdd62f069b922f10c65262cc34b710eea26ff928129a736be791
      lastState: {}
      name: place-scripts
      ready: true
      restartCount: 0
      started: false
      state:
        terminated:
          containerID: containerd://164231046c1255c1ae760ada87685878583eccc311809a98fc462477e3d583f7
          exitCode: 0
          finishedAt: "2024-08-02T05:47:57Z"
          reason: Completed
          startedAt: "2024-08-02T05:47:57Z"
  phase: Running
  podIP: 172.16.4.249
  podIPs:
    - ip: 172.16.4.249
  qosClass: Burstable
  startTime: "2024-08-02T05:47:44Z"

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.8
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

Client version: 0.37.0
Pipeline version: v0.62.0
Triggers version: v0.27.0
Dashboard version: v0.48.0

This issue has been present in several recent LTS versions, including v0.53, v0.56, v0.59, and v0.62

@l-qing l-qing added the kind/bug Categorizes issue or PR as related to a bug. label Aug 2, 2024
l-qing added a commit to l-qing/pipeline that referenced this issue Aug 2, 2024
fix tektoncd#8170

When an OOM occurs in a Pod related to TaskRun, the TaskRun should be marked
as failed immediately instead of waiting for it to timeout.
l-qing added a commit to l-qing/pipeline that referenced this issue Aug 2, 2024
…od OOM

fix tektoncd#8170

When an OOM occurs in a Pod related to TaskRun, the TaskRun should be marked
as failed immediately instead of waiting for it to timeout.
l-qing added a commit to l-qing/pipeline that referenced this issue Aug 14, 2024
…od OOM

fix tektoncd#8170

When an OOM occurs in a Pod related to TaskRun, the TaskRun should be marked
as failed immediately instead of waiting for it to timeout.
l-qing added a commit to l-qing/pipeline that referenced this issue Aug 23, 2024
…od OOM

fix tektoncd#8170

When an OOM occurs in a Pod related to TaskRun, the TaskRun should be marked
as failed immediately instead of waiting for it to timeout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant