-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU operation and data / model Parallelism #876
Comments
Note that training with multiple GPUs + data parallelism is also trivially possible with MPI - for model parallelism it is more nontrivial, though. |
Does the data parallelism suggests shared model parameters? If it doesn't and the data is split before hand, the data parallelism can be implemented with a shell script. |
Unfortunately, the weights of the replicas of the same layer have to be synchronized as described in the section 4.1 of [1]. [1] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997 [cs.NE] |
There are tens of thousands of lines in the diff between cuda-convnet2/cudaconvnet and cuda-convnet. It may not be plausible to reproduce all of Alex's work in a short time. Is it acceptable to just wrap it as @soumith did in cuda-convnet2.torch? The members of the Pylearn2 community have successfully wrapped cuda-convnet and are planning to upgrade to cuda-convnet2. |
cc me |
Hope make it real on Caffe soon. |
I just have a thought and post here for discussion. |
The training can be done by cuda-convnet2. It't only necessary to convert the model into Caffe's format. |
The "trivially possible data parallelism" can be implemented with CUDA Multi Process Service (MPS). |
@Yangqing, could you give us some hints how to integrate Caffe and CUDA MPS with MPI. Does the solver of each process have to communicate with each other? Or do they share the same CUDA context which automatically combines the memory of multiple GPUs into a single virtual address space? |
I don't know if on multiple hosts we could explore something with spark and caffe python bindings. There is also already a deep network experiments on spark |
Mixing multiple languages together is not a good idea. |
@kloudkl I'm talking about using python bindings already available in caffe in pyspark |
The remarks of Evan Sparks @etrain posted by @sergeyk prove that it is plausible to "use cuda-convnet2 to train the models offline, but use Caffe to parse/analyze them and apply the models to new input images". |
With the recent addition of cuDNN to the dev branch of caffe, are multiple gpu's now supported? The recent article on cuDNN indicates that cuDNN supports parallelism across gpu's, but doesn't mention whether this support is present in the Caffe wrap. |
Caffe and cuDNN alike are single-GPU libraries at the moment but they can Multi-GPU parallelism is still in development in Caffe. On Monday, September 8, 2014, Madison May notifications@github.com wrote:
|
In other words, multiple gpus can be used for tasks like hyperparameter selection but not to allow more efficient training of a single model? |
There is something interesting in TBB Graph flow parallelization. This a feature detector example. |
While #2114 and a series of related PRs #1535 (comment) have solved the data parallelism, there isn't yet a pull request dedicated to model parallelism. Here is Facebook's implementation just for reference. |
Looks like Nvidia has figured out how to train model on Caffe with multiple GPUs. This page https://developer.nvidia.com/digits showed performance comparison on training with different number of GPUs. In the video on the page, it showed how to use multiple GPUs with DIGITS. Is there any plan to have this feature in Caffe? |
@PhoenixDai 's comment is related to the digits github issue: NVIDIA/DIGITS#92. It seems they have forked their own caffe version which supports multiple gpus? |
The Nvidia branch uses #2114 |
Linking to SO related question. |
The master branch supports multi-GPU training. Please refer to the latest documents. |
thank you for your great job! Now I am training googlenet in my K80, as you know , K80 has 2 core, and I enable these 2 core by "-gpu 0,1", the training speed is faster! I know the cuda-convnet2 using the method introduced by "Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks.", Is that the mothod caffe using for Multi-GPU Parallelism? |
Hi ,@shelhamer. Does Caffe support 'model / hybrid parallelism' as you mentioned above? |
Hi, Is there any estimated date for a version update of caffe that will allow using multiple GPUs for test / inference? Thanks a lot |
We don't have a plan to add that right now but I would be happy to help.
You can split your dataset and test each part independently, or distribute
items to multiple nets as you go. Are you using the Python API?
…On Dec 7, 2016 4:07 AM, "Lisandro" ***@***.***> wrote:
Hi,
I am working on a project that requires the use of 4 GPUs on a server to
analyze images. I would like to do it in caffe (I prefer it over torch or
tensorflow) but it seems that multiple GPU is still not available for test
/ inference.
Is there any estimated date for a version update of caffe that will allow
using multiple GPUs for test / inference?
Thanks a lot
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#876 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA4RXnBquB4zqk7j_Vx4jdqo67nL-gfrks5rFiLjgaJpZM4CU9yK>
.
|
Hi Cypof, Thanks for your fast response. I am not sure I completely understand your suggestions. Let me first provide you with more details about the task. We have a web server that receives requests (image files) from users. This server has a queue of requests at any given time. Our workstation has 4 TitanX GPUS within a motherboard. So I need to use the 4 GPUs to speed up (four times) the processing time of my queue. The requests will be handled as follows: request 1 GPU:0; I am using caffe with the Python API. The problem comes with the selection of the GPU. If I select GPU= 0 with the first request of the queue caffe.set_mode_gpu() then I cannot select GPU=1 with the next request. Even If I load the caffe model and a model in torch, torch cannot use GPU=1 after caffe has set the device to '0' because caffe "locks" all other GPUs. So regarding your suggestions 1- "You can split your dataset and test each part independently". 2- "Distribute items to multiple nets as you go" Thank you very much for you help |
@Lisandro79 did you figure out a solution to this issue? I have a similar problem |
Hi @pythonanonuser, Unfortunately, I have to say that my solution will be to use TensorFlow. I could not find a solution to my problem in Caffe. In my opinion, the lack of parallelism for testing of Caffe is a major disadvantage for deployment of web applications. I would like to know what other members of the Caffe community have to say about this. I really like Caffe and I would prefer to use it over other libraries, but at the moment I find that the parallel capabilities of Tensorflow plus the use of tensorboard do make a difference for production. Best |
I meant to prototype something but haven't got to it. I think the easiest way would be to use multiprocessing, either a Queue, and one Process per GPU, or maybe a Pool so that you can call map() on your inputs and directly get your ouputs. It also depends on where you want to store the results etc. |
Closing as NCCL + pycaffe #4563 is an effective approach to data parallel training of any kind of Caffe net. More involved forms of parallelism can be left to further efforts. |
Multi-GPU operation and data / model / hybrid parallelism are planned and in development for Caffe. The purpose of the thread is to focus the conversation, since this has been asked here, there, and everywhere. There are several ways to approach parallelization, so feel free to discuss your own work to this end here.
Note that Caffe does work with multiple GPUs in a standalone fashion right now: you can train on one GPU while extracting features on another and so on.
The text was updated successfully, but these errors were encountered: