-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/support pytorchjob set queue of volcano #1415
Feature/support pytorchjob set queue of volcano #1415
Conversation
Signed-off-by: bert.li <qiankun.li@qq.com>
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Hi @qiankunli. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@@ -155,13 +156,18 @@ func (r *PyTorchJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) | |||
// Set default priorities to pytorch job | |||
r.Scheme.Default(pytorchjob) | |||
|
|||
// parse volcano Queue from pytorchjob Annotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about other jobs?
@@ -155,13 +156,18 @@ func (r *PyTorchJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) | |||
// Set default priorities to pytorch job | |||
r.Scheme.Default(pytorchjob) | |||
|
|||
// parse volcano Queue from pytorchjob Annotation | |||
schedulingPolicy := &commonv1.SchedulingPolicy{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Pytorch spec embed runPolicy
, can we get scheduling policy directly from pytortchjob.Spec.RunPolicy.SchedulingPolicy
?
@qiankunli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use pytorchjob.Spec.RunPolicy
as the argument to reconcile the jobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now it is always nil for SchedulingPolicy in pytorch-operator, if SchedulingPolicy is seted , it is ok use pytortchjob.Spec.RunPolicy.SchedulingPolicy
directly
// github.com/kubeflow/tf-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go
runPolicy := &commonv1.RunPolicy{
CleanPodPolicy: pytorchjob.Spec.RunPolicy.CleanPodPolicy,
TTLSecondsAfterFinished: pytorchjob.Spec.RunPolicy.TTLSecondsAfterFinished,
ActiveDeadlineSeconds: pytorchjob.Spec.RunPolicy.ActiveDeadlineSeconds,
BackoffLimit: pytorchjob.Spec.RunPolicy.BackoffLimit,
SchedulingPolicy: nil,
}
// Use common to reconcile the job related pod and service
err = r.ReconcileJobs(pytorchjob, pytorchjob.Spec.PyTorchReplicaSpecs, pytorchjob.Status, runPolicy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jeffwan I update the pr
runPolicy := &commonv1.RunPolicy{
CleanPodPolicy: pytorchjob.Spec.RunPolicy.CleanPodPolicy,
TTLSecondsAfterFinished: pytorchjob.Spec.RunPolicy.TTLSecondsAfterFinished,
ActiveDeadlineSeconds: pytorchjob.Spec.RunPolicy.ActiveDeadlineSeconds,
BackoffLimit: pytorchjob.Spec.RunPolicy.BackoffLimit,
SchedulingPolicy: pytorchjob.Spec.RunPolicy.SchedulingPolicy,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you help update this for MXNet job as well?
- Actually, since pytorch.Spec.RunPolicy is &commonv1.RunPolicy. We can pass
pytorchjob.Spec.RunPolicy
instead of constructing a new one. See xgboost example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help update this for MXNet job as well?
Should we make it in another PR?
Signed-off-by: bert.li <qiankun.li@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Should we add some unit test cases?
@Jeffwan I update the pr
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Jeffwan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
/test kubeflow-tf-operator-presubmit |
* support pytorch use volcano-queue * support pytorch use volcano-queue Signed-off-by: bert.li <qiankun.li@qq.com> * set SchedulingPolicy for runPolicy Signed-off-by: bert.li <qiankun.li@qq.com> * use pytorchjob.Spec.RunPolicy directly
* support pytorch use volcano-queue * support pytorch use volcano-queue Signed-off-by: bert.li <qiankun.li@qq.com> * set SchedulingPolicy for runPolicy Signed-off-by: bert.li <qiankun.li@qq.com> * use pytorchjob.Spec.RunPolicy directly
* Feature/support pytorchjob set queue of volcano (#1415) * support pytorch use volcano-queue * support pytorch use volcano-queue Signed-off-by: bert.li <qiankun.li@qq.com> * set SchedulingPolicy for runPolicy Signed-off-by: bert.li <qiankun.li@qq.com> * use pytorchjob.Spec.RunPolicy directly * fix hyperlinks in the 'overview' section (#1418) hyperlinks now point to the latest api reference files. issue - #1411
I want set queue of podgroup created by pytorchjob, but there is not SchedulingPolicy in pytorchjob struct, so I try to set queue name in annotation
scheduling.volcano.sh/queue-name
of pytorchjob.it is related with issue #1414