Parallel / distributed training #1140

cypof · 2014-09-23T02:17:27Z

A set of classes to synchronize SGD between multiple solvers. Based on the Hogwild paper, and our work at Flickr to extend the model to GPUs and distributed configurations by streaming gradients between solvers.

Features

Models can be trained in parallel without modification. Caffe’s training code is also mostly untouched.
Modular design. The code is broken down in simple components that synchronize one segment, e.g. CPU/GPU, CPU/LAN. They can be combined to create an architecture, either in-process or between processes by memory-mapping the weights to /dev/shm.
Works on commodity hardware. Apparently even on 1G Ethernet, at least for mnist. Synchronization and SGD run asynchronously, to keep both compute and networking resources fully utilized. Bandwidth and latency across machines are optimized using raw sockets and user-space networking.
No additional memory used on the GPUs.

Limitations

Only supports data-parallelism. Limited forms of model-parallelism should be possible using the same components but no work has been done.
Training is less stable than on a single GPU. In particular, disabling momentum at least at the beginning of training seems to help.
No deployment / monitoring tools. We are looking at integrating with IPython.parallel.

Tests

Early results on MNIST seem to show linear scaling. We tested on up to 6 machines with 4 solvers each for CPU, and 2 machines with 2 GPUs each. GPUs do not perform well on this small network but still seem to scale linearly.

In the weeks to come we plan to start testing on larger networks and clusters. Currently our GPU machines are connected through 1G Ethernet, please contact us if you are interested to help benchmarking on better hardware.

Architecture

We made the Caffe singleton thread-local, to allow multiple solvers to run in parallel on their own thread. Synchronization works by sharing the weight buffers between solvers in the same address space, and by asynchronously measuring and exchanging gradients between address spaces.

Bugs / Todos

If the bandwidth between the GPU and host is set too high, machines seem to hang.
Incorrect hyper-params schedule in the distributed case. The total count of iterations needs to be tracked, maybe through the monitoring tool.
Thread-local Caffe singletons are not destroyed, we need to design a proper shutdown strategy.

Yangqing · 2014-09-23T03:14:44Z

I just want to say Kudos quickly! This is surely a great improvement :)

shelhamer · 2014-09-23T03:33:09Z

Round of applause!

This is an excellent PR of a long-awaited feature (and then some, since this covers CPU, GPU, and node-to-node distributed computation). Accomplishing this while insulating the core Caffe code and parallelizing models without modification is certainly a strong plus too.

How about we promote this to a BVLC/caffe branch now to collaborate on the last steps to groom for a swift merge to dev?

abhi2610 · 2014-09-23T04:03:22Z

Kudos! Long awaited feature!
I'll try to do some benchmarking on large CPU and reasonable GPU cluster.

sguada · 2014-09-23T04:41:18Z

Great PR!

On Monday, September 22, 2014, Abhinav Shrivastava notifications@github.com
wrote:

Kudos! Long awaited feature!
I'll try to do some benchmarking on large CPU and reasonable GPU cluster.

—
Reply to this email directly or view it on GitHub
#1140 (comment).

Sergio

bhack · 2014-09-23T08:48:10Z

Finally a PR on one of the top in caffe wishlist. IPython.parallel seems interesting in this schema.

BlGene · 2014-09-23T12:01:20Z

Sounds good! Ty for the PR!

shelhamer · 2014-09-23T23:09:17Z

@cypof in lieu of merging I promoted your commit to a BVLC feature branch to collaborate on review, grooming, and merge to dev. The new branch is BVLC/caffe:parallel.

Everyone please join #1148 to help prepare parallelism for merge!

cypof · 2014-09-23T23:29:54Z

That's great news, thanks! @abhi2610 I would love to help benchmarking.

bug-fixed · 2014-09-24T00:14:35Z

Hi, @cypof , thank you very much for this great PR!
I have tested the gpus.bin with imagenet, the following is some output, it works well, but it seems longer processing time and needs more memory. The gpus.bin costs about 40GB memory and the hogwild.bin costs abount 57GB memory. Is this result right? It runs at a cluster system, the data locates at one storage node who is in the same network of the computing nodes, and the network is 1000M, maybe the data transfer is the bottleneck. Could you help give some advice, please? Thanks!
The hardware is 2 k20m with ECC on.
The software is CentOS 6.5 and CUDA 6.0 with driver 331.62.

shelhamer · 2014-09-25T07:48:57Z

Closing in favor of feature branch for review and collaboration: see #1148.

Parallel / distributed training

20c5095

bhack mentioned this pull request Sep 23, 2014

Socket Data Layer #238

Closed

shelhamer mentioned this pull request Sep 23, 2014

Parallel / distributed training #1148

Closed

6 tasks

shelhamer closed this Sep 25, 2014

shelhamer mentioned this pull request Nov 29, 2014

GPU Parallelization device limit? #1404

Closed

ih4cku mentioned this pull request Jun 1, 2016

caffe multiple card ih4cku/caffe-notes#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel / distributed training #1140

Parallel / distributed training #1140

cypof commented Sep 23, 2014

Yangqing commented Sep 23, 2014

shelhamer commented Sep 23, 2014

abhi2610 commented Sep 23, 2014

sguada commented Sep 23, 2014

bhack commented Sep 23, 2014

BlGene commented Sep 23, 2014

shelhamer commented Sep 23, 2014

cypof commented Sep 23, 2014

bug-fixed commented Sep 24, 2014

shelhamer commented Sep 25, 2014

Parallel / distributed training #1140

Parallel / distributed training #1140

Conversation

cypof commented Sep 23, 2014

Features

Limitations

Tests

Architecture

Bugs / Todos

Yangqing commented Sep 23, 2014

shelhamer commented Sep 23, 2014

abhi2610 commented Sep 23, 2014

sguada commented Sep 23, 2014

bhack commented Sep 23, 2014

BlGene commented Sep 23, 2014

shelhamer commented Sep 23, 2014

cypof commented Sep 23, 2014

bug-fixed commented Sep 24, 2014

shelhamer commented Sep 25, 2014