-
Notifications
You must be signed in to change notification settings - Fork 731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add success policy #1506
feat: Add success policy #1506
Conversation
Signed-off-by: cegao <ce.gao@outlook.com>
/hold |
Signed-off-by: cegao <ce.gao@outlook.com>
It is blocked by kubeflow/common#181 |
Pull Request Test Coverage Report for Build 1562410664
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Pending approval from the PR in kubeflow/common.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege, terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided not to add the successpolicy into the common since we may move to SuccessCondition in the future. Thus we just implement it in PyTorch.Spec and TF.Spec.
/cc @terrytangyuan
// Leave a succeeded condition for the following two cases: | ||
// 1. If default success policy is used and worker 0 has completed. | ||
// 2. If `SuccessPolicyAllWorkers` success policy is used and all workers are succeeded. | ||
if expected == 0 || (worker0Completed && *pytorchjob.Spec.SuccessPolicy != pytorchv1.SuccessPolicyAllWorkers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L387 might take a bit of time to understand. Maybe it's because the order of the two conditions are not arranged as the comment lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments and conditions are copied from TF. I will refine it soon.
if !ok { | ||
return true, nil | ||
} | ||
podSlices, err := p.getPodSlices(job, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the slice approach definitely works. But what is the point to create a the getPodSlices
method while all we need is just the worker pod with index=0? And what kind of disadvantages will it have to use Get
function or List
with label selector to catch worker0 pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is to keep consistency with TF controller. The code is copied from it.
@@ -58,3 +58,15 @@ func GetSchedulerName(replicas map[commonv1.ReplicaType]*commonv1.ReplicaSpec) s | |||
} | |||
return "" | |||
} | |||
|
|||
// GetContainerExitCode gets the container exit code from the given pod. | |||
func GetContainerExitCode(pod *corev1.Pod, name string) int32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name
refers the containerName
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
@@ -26,6 +27,7 @@ import ( | |||
commonutil "github.com/kubeflow/common/pkg/util" | |||
"github.com/sirupsen/logrus" | |||
corev1 "k8s.io/api/core/v1" | |||
v1 "k8s.io/api/core/v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment.
Signed-off-by: cegao <ce.gao@outlook.com>
/assign @zw0610 |
Signed-off-by: cegao <ce.gao@outlook.com>
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Signed-off-by: cegao ce.gao@outlook.com
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: