-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to prefer using all gpus on a single node #781
Comments
Yeah, I agree with you. While it is not in our scope. We should support the feature via the scheduler kube-arbitrator. https://github.com/kubernetes-incubator/kube-arbitrator/ |
Thanks for the reply. How can I try out kube-arbitrator along with kubeflow?
…On Fri, Aug 17, 2018 at 8:33 AM Ce Gao ***@***.***> wrote:
Yeah, I agree with you. While it is not in our scope. We should support
the feature via the scheduler kube-arbitrator.
https://github.com/kubernetes-incubator/kube-arbitrator/
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#781 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAqJPWRyHbAmTSMutAxrFotEbiIZnd_uks5uRuJYgaJpZM4WAxIx>
.
|
You need to enable gang scheduling in tf operator and let kube-arbitrator to schedule the training jobs. |
How do I deploy kube-abitrator? I don't see it deployed with the default
kubeflow installation.
…On Fri, Aug 17, 2018 at 1:16 PM Ce Gao ***@***.***> wrote:
You need to enable gang scheduling in tf operator and let kube-arbitrator
to schedule the training jobs.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#781 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAqJPXZxFBZeHMKP6XmJJ2-_9_uXYqeFks5uRySUgaJpZM4WAxIx>
.
|
So first of all you can find or build your own kube-arbitrator image here after that you can use following yaml file (I forgot where to find the sample, so here is my own taml file)
NOTE: kube-arbitrator need to collect cluster information(such as Pod, Node, CRD, etc) for scheduing, so the service account used by the deployment must have permission to access those cluster resources, otherwise, kube-arbitrator will fail to startup. (from the README) On the tf-operator side, there is an option EnableGangScheduling you have to set to True. It should work like the following video. |
@ashahab |
@gaocegege Another thing is I also found there is no option for user to assign schedulerName in v1alpha2 tfjob spec like we did in v1alpha1. So it seems that we have to add this setting to all the PodSpec.
|
Got it. Thanks for the info.
Can I use the default scheduler with pod affinity to achieve what we need?
…On Fri, Aug 17, 2018, 8:51 PM Jack ***@***.***> wrote:
@gaocegege <https://github.com/gaocegege>
IMO, I don't think scheduling all the worker of tfjob together is also in
the scope of kube-arbitrator, since this requirement only happens in job
like distributed tensorflow training.
Another thing is I also found there is no option for user to assign
schedulerName in v1alpha2 tfjob spec like we did in v1alpha1. So it seems
that we have to add this setting to all the PodSpec.
in v1alpha1
//types.go
// SchedulerName specifies the name of scheduler which should handle the TFJob
SchedulerName string `json:"schedulerName,omitempty"`
// replica.go
pod.Spec.SchedulerName = s.Job.SchedulerName()
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#781 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAqJPfhVZCkkm-6mURKZuEYScxaDkjt-ks5uR49LgaJpZM4WAxIx>
.
|
Yes, you can. |
Agree with @ChanYiLin , I am closing the issue. If you have any question feel free to add new comments here. |
We are interested in having the ability in tf-operator to prefer a single node and use it's gpus if possible. That can dramatically increase training performance if the workers and ps don't have to talk over network.
The text was updated successfully, but these errors were encountered: