Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel / distributed training #1140

Closed
wants to merge 1 commit into from
Closed

Parallel / distributed training #1140

wants to merge 1 commit into from

Conversation

cypof
Copy link
Member

@cypof cypof commented Sep 23, 2014

A set of classes to synchronize SGD between multiple solvers. Based on the Hogwild paper, and our work at Flickr to extend the model to GPUs and distributed configurations by streaming gradients between solvers.

Features

  • Models can be trained in parallel without modification. Caffe’s training code is also mostly untouched.
  • Modular design. The code is broken down in simple components that synchronize one segment, e.g. CPU/GPU, CPU/LAN. They can be combined to create an architecture, either in-process or between processes by memory-mapping the weights to /dev/shm.
  • Works on commodity hardware. Apparently even on 1G Ethernet, at least for mnist. Synchronization and SGD run asynchronously, to keep both compute and networking resources fully utilized. Bandwidth and latency across machines are optimized using raw sockets and user-space networking.
  • No additional memory used on the GPUs.

Limitations

  • Only supports data-parallelism. Limited forms of model-parallelism should be possible using the same components but no work has been done.
  • Training is less stable than on a single GPU. In particular, disabling momentum at least at the beginning of training seems to help.
  • No deployment / monitoring tools. We are looking at integrating with IPython.parallel.

Tests

Early results on MNIST seem to show linear scaling. We tested on up to 6 machines with 4 solvers each for CPU, and 2 machines with 2 GPUs each. GPUs do not perform well on this small network but still seem to scale linearly.

mnist

In the weeks to come we plan to start testing on larger networks and clusters. Currently our GPU machines are connected through 1G Ethernet, please contact us if you are interested to help benchmarking on better hardware.

Architecture

We made the Caffe singleton thread-local, to allow multiple solvers to run in parallel on their own thread. Synchronization works by sharing the weight buffers between solvers in the same address space, and by asynchronously measuring and exchanging gradients between address spaces.

Bugs / Todos

  • If the bandwidth between the GPU and host is set too high, machines seem to hang.
  • Incorrect hyper-params schedule in the distributed case. The total count of iterations needs to be tracked, maybe through the monitoring tool.
  • Thread-local Caffe singletons are not destroyed, we need to design a proper shutdown strategy.

@Yangqing
Copy link
Member

I just want to say Kudos quickly! This is surely a great improvement :)

@shelhamer
Copy link
Member

Round of applause!

This is an excellent PR of a long-awaited feature (and then some, since this covers CPU, GPU, and node-to-node distributed computation). Accomplishing this while insulating the core Caffe code and parallelizing models without modification is certainly a strong plus too.

How about we promote this to a BVLC/caffe branch now to collaborate on the last steps to groom for a swift merge to dev?

@abhi2610
Copy link

Kudos! Long awaited feature!
I'll try to do some benchmarking on large CPU and reasonable GPU cluster.

@sguada
Copy link
Contributor

sguada commented Sep 23, 2014

Great PR!

On Monday, September 22, 2014, Abhinav Shrivastava notifications@github.com
wrote:

Kudos! Long awaited feature!
I'll try to do some benchmarking on large CPU and reasonable GPU cluster.


Reply to this email directly or view it on GitHub
#1140 (comment).

Sergio

@bhack
Copy link
Contributor

bhack commented Sep 23, 2014

Finally a PR on one of the top in caffe wishlist. IPython.parallel seems interesting in this schema.

@bhack bhack mentioned this pull request Sep 23, 2014
@BlGene
Copy link
Contributor

BlGene commented Sep 23, 2014

Sounds good! Ty for the PR!

@shelhamer shelhamer mentioned this pull request Sep 23, 2014
6 tasks
@shelhamer
Copy link
Member

@cypof in lieu of merging I promoted your commit to a BVLC feature branch to collaborate on review, grooming, and merge to dev. The new branch is BVLC/caffe:parallel.

Everyone please join #1148 to help prepare parallelism for merge!

@cypof
Copy link
Member Author

cypof commented Sep 23, 2014

That's great news, thanks! @abhi2610 I would love to help benchmarking.

@bug-fixed
Copy link
Contributor

Hi, @cypof , thank you very much for this great PR!
I have tested the gpus.bin with imagenet, the following is some output, it works well, but it seems longer processing time and needs more memory. The gpus.bin costs about 40GB memory and the hogwild.bin costs abount 57GB memory. Is this result right? It runs at a cluster system, the data locates at one storage node who is in the same network of the computing nodes, and the network is 1000M, maybe the data transfer is the bottleneck. Could you help give some advice, please? Thanks!
The hardware is 2 k20m with ECC on.
The software is CentOS 6.5 and CUDA 6.0 with driver 331.62.

qq 20140924072612

@shelhamer
Copy link
Member

Closing in favor of feature branch for review and collaboration: see #1148.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants