Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Encounter NIL Error when job in error stage with TTL value set #170

Open
mirocody opened this issue Oct 26, 2021 · 0 comments
Open

Encounter NIL Error when job in error stage with TTL value set #170

mirocody opened this issue Oct 26, 2021 · 0 comments

Comments

@mirocody
Copy link

mirocody commented Oct 26, 2021

Hi community,
I am trying to deploy a simple task using pytorchjob with the following yaml:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorchjob
  namespace: abc
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: 'false'
        spec:
          containers:
          - args:
            - |+
              echo "Hello World!"
              python -u exception.py 
            command:
            - /usr/bin/env
            - bash
            - -c
            env:
            - name: LOCAL_RANK
              value: '0'
            image: <centos>
            name: pytorch
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: 'false'
        spec:
          containers:
          - args:
            - |+
              echo "Hello World!"
              python -u exception.py 
            command:
            - /usr/bin/env
            - bash
            - -c
            env:
            - name: LOCAL_RANK
              value: '0'
            image: <centos>
            name: pytorch

  runPolicy:
    ttlSecondsAfterFinished: 864000

the scripy exception.py is nothing but just throw an exception to let the contaienr go to error status. Then the training operator pod logs the following:

E1026 03:50:23.343541       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 560 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x16da180, 0x27a0b00)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:48 +0x82
panic(0x16da180, 0x27a0b00)
        /usr/local/go/src/runtime/panic.go:969 +0x166
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).CleanupJob(0xc000e89320, 0xc000703618, 0xc000f19600, 0x3, 0x3, 0xc000818720, 0x0, 0x0, 0x0, 0x18987c0, ...)
        /go/pkg/mod/github.com/kubeflow/common@v0.3.7/pkg/controller.v1/common/job.go:401 +0xbd
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc000e89320, 0x18987c0, 0xc000703500, 0xc0008183c0, 0xc000f19600, 0x3, 0x3, 0xc000818720, 0x0, 0x0, ...)
        /go/pkg/mod/github.com/kubeflow/common@v0.3.7/pkg/controller.v1/common/job.go:147 +0x76d
github.com/kubeflow/tf-operator/pkg/controller.v1/pytorch.(*PyTorchJobReconciler).Reconcile(0xc000e89320, 0x1b88fa0, 0xc000818270, 0xc000624f60, 0x13, 0xc000a1b590, 0x28, 0xc000818270, 0x40903b, 0xc000030000, ...)
        /workspace/pkg/controller.v1/pytorch/pytorchjob_controller.go:159 +0x83c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000743ea0, 0x1b88ee0, 0xc000d26400, 0x1750a40, 0xc000348340)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:263 +0x2f1
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000743ea0, 0x1b88ee0, 0xc000d26400, 0x0)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:235 +0x202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x1b88ee0, 0xc000d26400)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00026c750)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001121f50, 0x1b46440, 0xc000818180, 0xc000d26401, 0xc000a36240)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00026c750, 0x3b9aca00, 0x0, 0x1, 0xc000a36240)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1b88ee0, 0xc000d26400, 0xc000c0eb10, 0x3b9aca00, 0x0, 0x1986d01)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1b88ee0, 0xc000d26400, 0xc000c0eb10, 0x3b9aca00)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:195 +0x4f6
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x14a257d]

It looks like the assumption in this line works that the completion time is not set when the clean up started.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant