Multi-GPU #2870

shelhamer · 2015-08-06T07:01:25Z

This is my packaging of #2114 for merge. I figured @cypof @thatguymike and company had made plenty of revisions and that I could help.

This PR is ready to use for data parallel training of networks but

parallel IO is only coordinated for lmdb / leveldb through a DataReader
it cannot be satisfactorily tested with the current design

which are resolved for merge by #2903.

@cypof @thatguymike @longjon @jeffdonahue please take a look.
@cdoersch could you fire up your parallel training test again?

Reviews and testing by the community are welcome!

thatguymike · 2015-08-06T16:45:23Z

All of my quick sanity tests are passing. Despite knowing that by default this is weak scaling, e.g. the specified batch size in the train_val.prototxt is multiplied by the number of GPUs you choose to run on, I forgot that when validating accuracy graphs. I still fear that is going to bite users.

cypof · 2015-08-06T22:24:08Z

OK, training works for me. The thread launch code is much better without the fields, that's great.

shelhamer · 2015-08-06T23:39:49Z

Thanks for testing @thatguymike and @cypof. My short test worked so once we hear from @cdoersch about the ec2 test I think this is ready to merge.

buaaliyi · 2015-08-07T01:47:57Z

src/caffe/solver.cpp

+    }
+    for (int i = 0; i < callbacks_.size(); ++i) {
+      callbacks_[i]->on_start(&timer, &timing);
+    }
    const bool display = param_.display() && iter_ % param_.display() == 0;


must add 'timer.Start();' here to restart timer, or line 266 that timing for grads maybe incorrect.

Added to line 224 before forward + backward, thanks.

cdoersch · 2015-08-07T03:03:23Z

Training seems to be working fine on ec2.

shelhamer · 2015-08-07T21:30:24Z

After discussion with @longjon we decided the timing code is too intrusive to bundle in this change. I have stripped it but archived the branch with timing at shelhamer/caffe:time-multi_gpu. It could be re-introduced in a future PR.

jeffdonahue · 2015-08-07T22:47:21Z

src/caffe/solver.cpp

@@ -211,7 +228,9 @@ void Solver<Dtype>::Step(int iters) {
      losses[idx] = loss;
    }
    if (display) {
-      LOG(INFO) << "Iteration " << iter_ << ", loss = " << smoothed_loss;
+      if (Caffe::root_solver()) {


probably a bit late to comment on this, and to me not necessary for merge, but these conditional LOG(INFO) calls could be made a bit more compact using LOG_IFs, e.g. LOG_IF(INFO, Caffe::root_solver()) << "Iteration..."

jeffdonahue · 2015-08-08T02:23:47Z

src/caffe/parallel.cpp

+
+template<typename Dtype>
+Params<Dtype>::Params(shared_ptr<Solver<Dtype> > root_solver)
+    : size_(total_size<Dtype>(root_solver->net()->params())),


This call to params() and the two other calls below should be replaced with learnable_params() after #2866, I think? (I was debating whether the public params() method should just be removed, or if params() should just return learnable_params_, or...)

Agreed. Making the switch seems to have no effect though and I have the same test failures before and after.

shelhamer · 2015-08-08T03:07:38Z

@cypof @thatguymike it turns out #2114 was not rigorously checking solver updates; see #2114 (comment). Fixing the test net targets reveals that all the LeastSquaresUpdate multi-GPU tests fail. I still need to peer at this more closely to see if the issue is with multi-GPU itself or a subtlety of RNG in the tests, but this is now blocked on figuring it out. That is, the multiple solvers could be drawing targets that are not equivalent to the sequential order.

Apart from the tests, my experiments to check parallel training on real nets make progress so there's hope.

#2866 is not the problem as the same failures show up in the multi-GPU branch before the latest rebase when the test is fixed. This can be seen in shelhamer/caffe:old-multi_gpu.

shelhamer · 2015-08-08T19:54:24Z

I'm fairly positive this is a test artifact due to the random Gaussian targets. The multiple solvers can't reproduce random draws equivalent to the single solver sequence:

The worker solvers don't inherit the root seed in the current test since the seed is set in the singleton, and not the solver param (test_gradient_based_solver.cpp:183). They make their own nondeterministic draws.
If the test is rewritten to make the workers inherit the seed, they'll all draw the same targets and still be wrong w.r.t. to single solver equivalent.

The solution seems to be making the solver tests take fixed external data, such as the hdf5 data used in the sample_data.h5 used in the HDF5DataLayerTest. See #2887

shelhamer · 2015-08-08T21:42:57Z

This is now based on #2887 but the multi-GPU solver tests still fail. I believe this is because DataReader only knows lmdb + leveldb so the hdf5 inputs are identical among the WorkerSolvers. This should be checked, and then DataReader could be extended to hdf5. However this does raise issues with the DataReader design since layers that do / do not support it will have different behavior for parallelism.

- Interrupt the thread before waiting on join - Provide a method for looping threads to exit on demand - CHECK if start and stop succeed instead of returning an error

- Make sure each solver accesses a different subset of the data - Sequential reading of DB for performance - Prefetch a configurable amount of data to host memory - Distribute data to solvers in round-robin way for determinism

@thatguymike

thanks to discussion by @thatguymike and @flx42

@thatguymike

- Parallelize batches among GPUs and tree-reduce the gradients - The effective batch size scales with the number of devices - Batch size is multiplied by the number of devices - Split batches between GPUs, and tree-reduce the gradients - Detect machine topology (twin-GPU boards, P2P connectivity) - Track device in syncedmem (thanks @thatguymike) - Insert a callback in the solver for minimal code change - Accept list for gpu flag of caffe tool, e.g. '-gpu 0,1' or '-gpu all'. Run on default GPU if no ID given. - Add multi-GPU solver test - Deterministic architecture for reproducible runs

- Start with distant nodes in broadcast - Fix outside loop to loop for full tree depth

cypof · 2015-08-11T19:33:12Z

I was off yesterday, but looking at it now.

shelhamer · 2015-08-11T20:16:25Z

Everyone see #2903 for the rigorously tested and passing multi-GPU branch. @ronghanghu has developed a parallel data layer solution.

ronghanghu · 2015-08-13T21:01:40Z

Merged in #2903

PR for Multi-GPU has been merged into the master branch of Caffe. BVLC/caffe#2870

shelhamer added focus speed-up ready for review labels Aug 6, 2015

This was referenced Aug 6, 2015

Multi-GPU #2114

Closed

Persistent prefetch thread #2368

Closed

Thread local state for caffe instead of singleton #2367

Closed

blocking_queue #2366

Closed

shelhamer force-pushed the multi_gpu branch 2 times, most recently from ba35568 to e46996b Compare August 6, 2015 21:32

buaaliyi reviewed Aug 7, 2015
View reviewed changes

shelhamer force-pushed the multi_gpu branch 2 times, most recently from f165d86 to 2b51a08 Compare August 7, 2015 21:28

jeffdonahue reviewed Aug 7, 2015
View reviewed changes

shelhamer force-pushed the multi_gpu branch 6 times, most recently from 87a69ea to 53a4dca Compare August 8, 2015 01:37

ronghanghu mentioned this pull request Aug 8, 2015

Adaptive Solvers: AdaDelta, RMSprop, and ADAM #2860

Closed

3 tasks

jeffdonahue reviewed Aug 8, 2015
View reviewed changes

shelhamer force-pushed the multi_gpu branch 2 times, most recently from 80dbdaa to dd3e064 Compare August 8, 2015 03:02

shelhamer added the in progress label Aug 8, 2015

shelhamer force-pushed the multi_gpu branch 2 times, most recently from bb75c36 to 186d453 Compare August 8, 2015 21:26

shelhamer force-pushed the multi_gpu branch from 186d453 to 1410c3f Compare August 8, 2015 21:57

jeffdonahue mentioned this pull request Aug 8, 2015

Test solvers on fixed hdf5 data #2887

Merged

cypof and others added 9 commits August 9, 2015 15:13

Add BlockingQueue for inter-thread communication

d94ca3f

Thread-local Caffe

45d792e

Change the way threads are started and stopped

73b3d13

- Interrupt the thread before waiting on join - Provide a method for looping threads to exit on demand - CHECK if start and stop succeed instead of returning an error

Persistent prefetch thread

ddcdc9d

Add DataReader for parallel training with one DB session

bcc8f50

- Make sure each solver accesses a different subset of the data - Sequential reading of DB for performance - Prefetch a configurable amount of data to host memory - Distribute data to solvers in round-robin way for determinism

Allocate host memory through cudaMallocHost

d2f0457

thanks to discussion by @thatguymike and @flx42

Detect topology corner cases and improve broadcast order

335bee7

- Start with distant nodes in broadcast - Fix outside loop to loop for full tree depth

[docs] add multi-gpu usage note to interfaces

8771d0f

shelhamer force-pushed the multi_gpu branch from 1410c3f to 8771d0f Compare August 9, 2015 22:16

This was referenced Aug 10, 2015

[Don't Merge] Rebase and Clean up Hdf5DataLayer Prefetch #2892

Open

Multi-GPU Data Parallelism (with Parallel Data Layers) #2903

Merged

shelhamer removed focus in progress labels Aug 11, 2015

ronghanghu merged commit 8771d0f into BVLC:master Aug 13, 2015

hido added a commit to chainer/chainer that referenced this pull request Aug 25, 2015

Update comparison.rst

0a658d1

PR for Multi-GPU has been merged into the master branch of Caffe. BVLC/caffe#2870

hido mentioned this pull request Aug 25, 2015

Update comparison.rst chainer/chainer#349

Merged

shelhamer deleted the multi_gpu branch August 25, 2015 23:56

futurely mentioned this pull request Oct 29, 2015

The case for splitting model class apache/mxnet#417

Closed

lukeyeager mentioned this pull request Nov 11, 2015

Batch size issue in multi-GPU system NVIDIA/DIGITS#413

Closed

ih4cku mentioned this pull request Jul 31, 2016

caffe multiple card ih4cku/caffe-notes#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU #2870

Multi-GPU #2870

shelhamer commented Aug 6, 2015

thatguymike commented Aug 6, 2015

cypof commented Aug 6, 2015

shelhamer commented Aug 6, 2015

buaaliyi Aug 7, 2015

shelhamer Aug 7, 2015

cdoersch commented Aug 7, 2015

shelhamer commented Aug 7, 2015

jeffdonahue Aug 7, 2015

jeffdonahue Aug 8, 2015

shelhamer Aug 8, 2015

shelhamer commented Aug 8, 2015

shelhamer commented Aug 8, 2015

shelhamer commented Aug 8, 2015

cypof commented Aug 11, 2015

shelhamer commented Aug 11, 2015

ronghanghu commented Aug 13, 2015

Multi-GPU #2870

Multi-GPU #2870

Conversation

shelhamer commented Aug 6, 2015

thatguymike commented Aug 6, 2015

cypof commented Aug 6, 2015

shelhamer commented Aug 6, 2015

buaaliyi Aug 7, 2015

Choose a reason for hiding this comment

shelhamer Aug 7, 2015

Choose a reason for hiding this comment

cdoersch commented Aug 7, 2015

shelhamer commented Aug 7, 2015

jeffdonahue Aug 7, 2015

Choose a reason for hiding this comment

jeffdonahue Aug 8, 2015

Choose a reason for hiding this comment

shelhamer Aug 8, 2015

Choose a reason for hiding this comment

shelhamer commented Aug 8, 2015

shelhamer commented Aug 8, 2015

shelhamer commented Aug 8, 2015

cypof commented Aug 11, 2015

shelhamer commented Aug 11, 2015

ronghanghu commented Aug 13, 2015