-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add terminationPolicy to TfJobSpec #204
Conversation
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for the commit author(s). If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. |
Hi @lluunn. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -240,6 +253,14 @@ func (c *TfJobSpec) SetDefaults() error { | |||
r.setDefaultPSPodTemplateSpec(c.TfImage) | |||
} | |||
} | |||
if c.TerminationPolicy == nil { | |||
c.TerminationPolicy = &TerminationPolicySpec{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you setting the default policy here and not in
https://github.com/tensorflow/k8s/blob/master/pkg/spec/tf_job.go#L215?
In validate can we check that termination policy is a valid policy? So right now the only valid policy should be MASTER, index=0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in the SetDefaults(), I guess you thought it's in Validate because the code collapsed.
Added validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
@@ -240,6 +253,14 @@ func (c *TfJobSpec) SetDefaults() error { | |||
r.setDefaultPSPodTemplateSpec(c.TfImage) | |||
} | |||
} | |||
if c.TerminationPolicy == nil { | |||
c.TerminationPolicy = &TerminationPolicySpec{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in the SetDefaults(), I guess you thought it's in Validate because the code collapsed.
Added validation.
@@ -240,6 +253,14 @@ func (c *TfJobSpec) SetDefaults() error { | |||
r.setDefaultPSPodTemplateSpec(c.TfImage) | |||
} | |||
} | |||
if c.TerminationPolicy == nil { | |||
c.TerminationPolicy = &TerminationPolicySpec{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Resolved.
pkg/spec/tf_job.go
Outdated
return errors.New("invalid termination policy, Chief cannot be nil") | ||
} | ||
if c.TerminationPolicy.Chief.ReplicaName != "MASTER" || c.TerminationPolicy.Chief.ReplicaIndex != 0 { | ||
return errors.New("invaliad termination policy, Chief should be MASTER:0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling mistake "invaliad" -> "invalid". Also make the error message more explicit e.g.
"Chief should have replicaName=MASTER and index=0".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thanks,
I approved it but please fix the minor issue with the error message. |
/ok-to-test |
@jlewi We have |
@DjangoPeng TensorFlow uses chief so we want to start changing the terminology to be more consistent with what TF uses. MASTER is confusing because every TF replica has a gRPC server called the master that the client talks to. |
@jlewi Yep. That's what I'm confusing before. In the TensorFlow level, |
For this issue