-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a field SchedulerName to TFJob for specifying a scheduler #408
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to define a default scheduler for TFJob?
@gaocegege probably we don't need to define. If the field is empty, apiserver will supply the name of the default scheduler (default-scheduler) to a pod. |
/ok-to-test |
Why is scheduler name a property of the replica vs. a property of the job? I was expecting this PR to enable testing of the use of kube-arbitrartor. I was expecting that to be a job level setting as opposed to a replica level setting. |
I think all the TFJob's replicas will be scheduled by one scheduler, which may be kube-arbitrator. Then Job level SGTM. |
This PR provides a job level setting. The job level setting will be propagated to the replica level setting during the creation process. If I'm misunderstanding something, sorry for that and please point out. |
Thanks. Looks good; must have misread it before. Should we add a unittest to verify pod scheduler is set properly? |
Yes, the unittest would be valuable. I'll add it tomorrow. |
pkg/trainer/tensorboard.go
Outdated
@@ -178,6 +178,8 @@ func (s *TBReplicaSet) getDeploymentSpecTemplate(image string) v1.PodTemplateSpe | |||
|
|||
ps.Volumes = append(ps.Volumes, s.Spec.Volumes...) | |||
|
|||
ps.SchedulerName = s.Job.SchedulerName() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no TensorBoard now so we could remove the changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll remove it in the next update. Thanks.
6b621e2
to
0b807f3
Compare
@jlewi @gaocegege updated for
Could you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm |
You should run |
Thanks and I agree that. I will file an issue for it. |
Probably the CI is failing because of invalid format of python files. I created another PR for the issue: #429 |
@ScorpioCPH @mitake This change likely conflicts with #344. Which PR should be submitted first /hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mitake Thanks for your PR, LGTM for adding this field.
But i think we don't need a special Getter for SchedulerName
.
@@ -405,3 +405,7 @@ func (j *TrainingJob) name() string { | |||
func (j *TrainingJob) fullname() string { | |||
return j.job.ObjectMeta.GetNamespace() + ":" + j.job.ObjectMeta.GetName() | |||
} | |||
|
|||
func (j *TrainingJob) SchedulerName() string { | |||
return j.job.Spec.SchedulerName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a special Getter for SchedulerName?
We can just get it directly form j.job.Spec.SchedulerName
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because TrainingJob.job
is a private member so TrainingJob
should provide the accessor method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the current structure is a little complex:
- We have
TFJob
for API object Spec/Status - Then create
TrainingJob
andTFReplicaSet
which keep many cache of TFJob
And PodSpec already have a field SchedulerName, can we reuse it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodSpec.SchedulerName
isn't suitable for the purpose. See the discussion start with this message #398 (comment)
@mitake Thanks for your contribution! Could you rebase the master and I think we could merge it ASAP. |
This commit adds a new field SchedulerName to the definition of TFJob. The purpose of the field is specifying the scheduler name of the pods created by tf-operator and let the scheduler (which wouldn't be the default scheduler) handle them. It would be convenient for letting kube-batchd (a component of kube-arbitrator) handle the pods.
@gaocegege rebased on the latest master, PTAL |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege, jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ScorpioCPH Do you want to have another review? |
I will merge it soon since there is no more review. |
This commit adds a new field SchedulerName to the definition of TFJob.
The purpose of the field is specifying the scheduler name of the pods
created by tf-operator and let the scheduler (which wouldn't be the
default scheduler) handle them. It would be convenient for letting
kube-batchd (a component of kube-arbitrator) handle the pods.
/cc @jlewi this is a newer version of #398
This change is