Submitted tfjobs cease to start running under unknown conditions #203

cwbeitel · 2017-12-06T20:18:05Z

See logs. Sometimes see a previously working deployment no longer launch tfjobs and upon re-deploying cluster the same attempted tfjob deployment then works. The next time it happens I can get log or config data, let me know what you'd need.

jlewi · 2017-12-07T00:27:11Z

Please provide the logs of the TfJob operator pod.
Also please provide the result of
kubectl get tfjobs -o yaml

Also if I have access to this cluster please leave it up and running when it happens so I can inspect it.

cwbeitel · 2017-12-07T16:42:03Z

Sure will do.

jlewi · 2017-12-11T22:02:14Z

I think I'm encountering this myself.

Controller pod has been running for 8 days
I have submitted multiple TfJobs that started successfully.
Currently there is a single TfJob (see tfjobs.yaml.txt)
- The TfJob doesn't have runtimeId which indicates it hasn't been updated by the controller.

Log for the [controller.log.txt]controller.log.txt shows

Added event is recieved

I1211 21:40:38.516560       1 controller.go:349] event: ADDED {
...
E1211 21:40:38.530912       1 training.go:112] TfJob failed to setup: tbReplicaSpec.LogDir must be 
   specified

I1211 21:40:38.706580 1 controller.go:349] event: MODIFIED {

So I think there is at least one bug with training.go:112 not updating the TfJob status to indicate that setup failed.
When I specified LogDir the job was successfully created.

tfjobs.yaml.txt

jlewi · 2017-12-12T19:16:02Z

I opened #218 for the specific issue I encountered. Chris we can continue to use this issue to track your particular problem.

cwbeitel · 2017-12-13T17:11:12Z

Sounds good.

jlewi · 2018-04-26T04:49:59Z

/lifecycle stale

jlewi mentioned this issue Dec 12, 2017

TfJob should be marked as failed if setup fails #218

Closed

jlewi closed this as completed Apr 26, 2018

k8s-ci-robot added the lifecycle/stale label Apr 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submitted tfjobs cease to start running under unknown conditions #203

Submitted tfjobs cease to start running under unknown conditions #203

cwbeitel commented Dec 6, 2017

jlewi commented Dec 7, 2017

cwbeitel commented Dec 7, 2017

jlewi commented Dec 11, 2017 •

edited

Loading

jlewi commented Dec 12, 2017

cwbeitel commented Dec 13, 2017

jlewi commented Apr 26, 2018

Submitted tfjobs cease to start running under unknown conditions #203

Submitted tfjobs cease to start running under unknown conditions #203

Comments

cwbeitel commented Dec 6, 2017

jlewi commented Dec 7, 2017

cwbeitel commented Dec 7, 2017

jlewi commented Dec 11, 2017 • edited Loading

jlewi commented Dec 12, 2017

cwbeitel commented Dec 13, 2017

jlewi commented Apr 26, 2018

jlewi commented Dec 11, 2017 •

edited

Loading