Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trial stuck running - job failed backoff limit reached #845

Closed
jlewi opened this issue Oct 3, 2019 · 3 comments · Fixed by #847
Closed

Trial stuck running - job failed backoff limit reached #845

jlewi opened this issue Oct 3, 2019 · 3 comments · Fixed by #847

Comments

@jlewi
Copy link
Contributor

jlewi commented Oct 3, 2019

/kind bug

I submitted a Katib job. Some of the trials ended up launching trials for which the corresponding training job reached the backoff limit and thus will never succeed.

Yet the trial remains in the running state.

Here's the job spec. I elided some details but left the status to show the job is in the failed state.

**apiVersion: batch/v1
kind: Job
metadata:
  creationTimestamp: 2019-10-02T14:09:20Z
  ...
spec:
  backoffLimit: 6
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      controller-uid: 3a5714e2-e51e-11e9-b027-42010a8c00d0
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
      ...
    spec:
      containers:
        ...
        name: train
        resources:
          limits:
            cpu: "16"
            memory: 24Gi
          requests:
            cpu: "4"
            memory: 4Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /secret/gcp
          name: secret-volume
        - mountPath: /src
          name: source
      dnsPolicy: ClusterFirst
      initContainers:
        ...
        imagePullPolicy: IfNotPresent
        name: setup
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /secret/gcp
          name: secret-volume
        - mountPath: /src
          name: source
        workingDir: /app
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - name: secret-volume
        secret:
          defaultMode: 420
          secretName: user-gcp-sa
      - emptyDir: {}
        name: source
status:
  conditions:
  - lastProbeTime: 2019-10-02T14:32:37Z
    lastTransitionTime: 2019-10-02T14:32:37Z
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 6
  startTime: 2019-10-02T14:09:20Z**

Here's the corresponding trial

apiVersion: kubeflow.org/v1alpha2
kind: Trial
metadata:
  ...
status:
  conditions:
  - lastTransitionTime: 2019-10-02T14:09:20Z
    lastUpdateTime: 2019-10-02T14:09:20Z
    message: Trial is created
    reason: TrialCreated
    status: "True"
    type: Created
  - lastTransitionTime: 2019-10-02T14:09:20Z
    lastUpdateTime: 2019-10-02T14:09:20Z
    message: Trial is running
    reason: TrialRunning
    status: "True"
    type: Running
  startTime: 2019-10-02T14:09:20Z

So the status is stuck in running state but it should be marked as failed since the job will never succeed.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 3, 2019

Workaround

I deleted the stuck trials and Katib spawned new trials. So manually deleting the stuck trials appears to unjam the experiment.

@gaocegege
Copy link
Member

/assign

@gaocegege
Copy link
Member

Thanks for the issue. I will fix it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants