Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submitted tfjobs cease to start running under unknown conditions #203

Closed
cwbeitel opened this issue Dec 6, 2017 · 6 comments
Closed

Submitted tfjobs cease to start running under unknown conditions #203

cwbeitel opened this issue Dec 6, 2017 · 6 comments

Comments

@cwbeitel
Copy link
Contributor

cwbeitel commented Dec 6, 2017

See logs. Sometimes see a previously working deployment no longer launch tfjobs and upon re-deploying cluster the same attempted tfjob deployment then works. The next time it happens I can get log or config data, let me know what you'd need.

@jlewi
Copy link
Contributor

jlewi commented Dec 7, 2017

Please provide the logs of the TfJob operator pod.
Also please provide the result of
kubectl get tfjobs -o yaml

Also if I have access to this cluster please leave it up and running when it happens so I can inspect it.

@cwbeitel
Copy link
Contributor Author

cwbeitel commented Dec 7, 2017

Sure will do.

@jlewi
Copy link
Contributor

jlewi commented Dec 11, 2017

I think I'm encountering this myself.

  • Controller pod has been running for 8 days
  • I have submitted multiple TfJobs that started successfully.
  • Currently there is a single TfJob (see tfjobs.yaml.txt)
    • The TfJob doesn't have runtimeId which indicates it hasn't been updated by the controller.
  • Log for the [controller.log.txt]controller.log.txt shows
    • Added event is recieved

      I1211 21:40:38.516560       1 controller.go:349] event: ADDED {
      ...
      E1211 21:40:38.530912       1 training.go:112] TfJob failed to setup: tbReplicaSpec.LogDir must be 
         specified
      
    I1211 21:40:38.706580 1 controller.go:349] event: MODIFIED {
    
    
  • So I think there is at least one bug with training.go:112 not updating the TfJob status to indicate that setup failed.
  • When I specified LogDir the job was successfully created.

tfjobs.yaml.txt

@jlewi
Copy link
Contributor

jlewi commented Dec 12, 2017

I opened #218 for the specific issue I encountered. Chris we can continue to use this issue to track your particular problem.

@cwbeitel
Copy link
Contributor Author

Sounds good.

@jlewi
Copy link
Contributor

jlewi commented Apr 26, 2018

/lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants