-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gpu support #3
Gpu support #3
Conversation
Note: Mesos should be configured as below: |
👍 @bhack Would you please try this? Since we don't have GPU cluster yet. |
Yes sure. /cc @mtamburrano @lenlen |
🍺 |
a major issue i found is TF always use the first a few GPU cores, so when u get several TF compute at the same time, they will all run in the same cores, with other cores in idle. i can not find a way to specify GPU UUIDs of a TF programs to run for now. |
I'm curious which mesos release you are working on for GPU resource. Seems mesos hasn't finished https://issues.apache.org/jira/browse/MESOS-4424 |
@wsxiaozhang we can add GPU cores as custom resources for Mesos before MESOS-4424 ships. |
@windreamer With manual placement GPU can be addresses with id i.e. in the form of /gpu:0. See https://www.tensorflow.org/versions/r0.8/how_tos/using_gpu/index.html |
@bhack yes, but imagine u have 2 TF programs running, ur machine have 4 GPUS. Ideally we allocate 2 GPUs to the first program as "gpu:0" and "gpu:1" and the other 2 GPUs to the other program also as "gpu:0" and "gpu:1". But in fact, the two program will both use GPU0 and GPU1, and GPU2, GPU3 will be idle. |
But the id of the GPU free resources offered by mesos don't change? Or Is it ignored by the sequential internal cycle of TF? |
it is ignored by TF, TF only cares about the count. u can see the TF code i mentioned. no gpu-id is aware there unfortunately. maybe a docker isolator can do the magic to reassign ids to allocated gpus? but i'm not optimistic. |
Do you mean i.e. that magically what it is presented as GPU 0 in the container it is GPU 3 in the host etc..? |
Probably @3XX0 could give us some hint on this for nvidia-docker. |
You can use http://c99.millennium.berkeley.edu/documentation/latest/gpu-support/ or if you need more information about which GPUs to pick, use the custom resource advertised by |
In the link I read that "Filesystem isolation must be disabled. This means No Docker Support at the moment. Only the mesos containerizer can currently be used." |
No Docker support through the Mesos containerizer. |
@3XX0 Does |
According this anwser: http://stackoverflow.com/a/26568684 |
@bhack some idea updated here. Note: mesos gpu resources should be defined as below:
|
As @3XX0 confirmed seems to me that with current limitation GPU cannot work in the case we set DOCKER_IMAGE env variable. Is this right? |
@bhack seems we can use Docker containerizer with |
"If you use the Docker containerizer with nvidia-docker or the Mesos containerizer without Docker and without filesystem isolation you're fine." This seems to me that on slave we cannot run Task inside docker or nvidia-docker containers. Do you have another interpretation? |
My interpretation is
or
so I think we can go with the first option. |
If your interpretation is correct how can we let the mesos slave to use nvidia-docker? I think that slave simply invoke simply docker command when your slave run with --containerizers=docker,mesos |
@bhack but I have no experience to build up a Mesos cluster with CUDA support, I am not sure about it. |
@windreamer I just added support for UUID in And the exact Mesos slave command is: mesos-slave --containerizers=docker,mesos \
--docker=/usr/bin/nvidia-docker \
--executor_environment_variables='{"NV_DOCKER": "/usr/bin/docker"}' \
$(curl -s localhost:3476/mesos/cli) |
@lenlen thanks for the feed back, I've fix the errors you mentioned and commit to this PR, please review my fix and try again if it is OK to you.
further more can you provide more information abort how the task is failed when more gpus are allocated ? |
@3XX0 bravo! this is really convenient. thank you for this work. but at present I may temporarily keep the code this way, I will simpfy the whole process when |
Here I have a concern to the above given solution. The above configuration will make all of containers deployed by mesos use the |
Yes and? It is a general solution until we upstream all the work in Mesos itself (which might take a while). |
@bhack @3XX0 finally we go some GPU instances in AWS and are able to test this PR. and I made some changes:
so this is what we currently get, and we are planning to deliver a simple script to help setup an CPU/GPU cluster of AWS G2 ubuntu nodes later. pls feel free to comment, any suggestion is welcome. |
I think that you can expose this news in the related TF ticket. |
@bhack sorry, anything related to this ? |
Sorry, I've accidentally removed the last digit from URL... It is https://issues.apache.org/jira/plugins/servlet/mobile#issue/MESOS-5401 |
@bhack OK, this is not a problem, MESOS-5401 is to resolve isolation problem in |
Probably it is part of unified containers that noto require Docker daemon anymore in 1.0. See http://mesos.apache.org/blog/mesos-1-0-0-released/ |
Hmm, |
Are you sure? I'havent tried yet new Mesos 1.0 but i read
|
And
|
from my [incomplete] code dig, YES, we still need docker installed. |
frankly speaking, i think everything is a MESS to me now: at present, i believe we should do containerization with |
@windreamer Mesos added native docker image support and nvidia-docker coverage so I think that with mesos 1.0 we don't require docker daemon anymore. |
OK, i think i found the so called this whole thing is too complicated, i do not think i want to give this a try. |
There are some appreciable motivations. Read the introduction at http://mesos.apache.org/documentation/latest/container-image/ |
@mckelvin pls review
@bhack this is what i can guess for a GPU cluster
cf: #1