-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU #2870
Multi-GPU #2870
Conversation
All of my quick sanity tests are passing. Despite knowing that by default this is weak scaling, e.g. the specified batch size in the train_val.prototxt is multiplied by the number of GPUs you choose to run on, I forgot that when validating accuracy graphs. I still fear that is going to bite users. |
ba35568
to
e46996b
Compare
OK, training works for me. The thread launch code is much better without the fields, that's great. |
Thanks for testing @thatguymike and @cypof. My short test worked so once we hear from @cdoersch about the ec2 test I think this is ready to merge. |
} | ||
for (int i = 0; i < callbacks_.size(); ++i) { | ||
callbacks_[i]->on_start(&timer, &timing); | ||
} | ||
const bool display = param_.display() && iter_ % param_.display() == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must add 'timer.Start();' here to restart timer, or line 266 that timing for grads maybe incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to line 224 before forward + backward, thanks.
Training seems to be working fine on ec2. |
f165d86
to
2b51a08
Compare
After discussion with @longjon we decided the timing code is too intrusive to bundle in this change. I have stripped it but archived the branch with timing at |
@@ -211,7 +228,9 @@ void Solver<Dtype>::Step(int iters) { | |||
losses[idx] = loss; | |||
} | |||
if (display) { | |||
LOG(INFO) << "Iteration " << iter_ << ", loss = " << smoothed_loss; | |||
if (Caffe::root_solver()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably a bit late to comment on this, and to me not necessary for merge, but these conditional LOG(INFO)
calls could be made a bit more compact using LOG_IF
s, e.g. LOG_IF(INFO, Caffe::root_solver()) << "Iteration..."
87a69ea
to
53a4dca
Compare
|
||
template<typename Dtype> | ||
Params<Dtype>::Params(shared_ptr<Solver<Dtype> > root_solver) | ||
: size_(total_size<Dtype>(root_solver->net()->params())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call to params()
and the two other calls below should be replaced with learnable_params()
after #2866, I think? (I was debating whether the public params()
method should just be removed, or if params()
should just return learnable_params_
, or...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Making the switch seems to have no effect though and I have the same test failures before and after.
80dbdaa
to
dd3e064
Compare
@cypof @thatguymike it turns out #2114 was not rigorously checking solver updates; see #2114 (comment). Fixing the test net targets reveals that all the Apart from the tests, my experiments to check parallel training on real nets make progress so there's hope. #2866 is not the problem as the same failures show up in the multi-GPU branch before the latest rebase when the test is fixed. This can be seen in |
I'm fairly positive this is a test artifact due to the random Gaussian targets. The multiple solvers can't reproduce random draws equivalent to the single solver sequence:
The solution seems to be making the solver tests take fixed external data, such as the hdf5 data used in the |
bb75c36
to
186d453
Compare
This is now based on #2887 but the multi-GPU solver tests still fail. I believe this is because |
- Interrupt the thread before waiting on join - Provide a method for looping threads to exit on demand - CHECK if start and stop succeed instead of returning an error
- Make sure each solver accesses a different subset of the data - Sequential reading of DB for performance - Prefetch a configurable amount of data to host memory - Distribute data to solvers in round-robin way for determinism
thanks to discussion by @thatguymike and @flx42
- Parallelize batches among GPUs and tree-reduce the gradients - The effective batch size scales with the number of devices - Batch size is multiplied by the number of devices - Split batches between GPUs, and tree-reduce the gradients - Detect machine topology (twin-GPU boards, P2P connectivity) - Track device in syncedmem (thanks @thatguymike) - Insert a callback in the solver for minimal code change - Accept list for gpu flag of caffe tool, e.g. '-gpu 0,1' or '-gpu all'. Run on default GPU if no ID given. - Add multi-GPU solver test - Deterministic architecture for reproducible runs
- Start with distant nodes in broadcast - Fix outside loop to loop for full tree depth
I was off yesterday, but looking at it now. |
Everyone see #2903 for the rigorously tested and passing multi-GPU branch. @ronghanghu has developed a parallel data layer solution. |
Merged in #2903 |
PR for Multi-GPU has been merged into the master branch of Caffe. BVLC/caffe#2870
This is my packaging of #2114 for merge. I figured @cypof @thatguymike and company had made plenty of revisions and that I could help.
This PR is ready to use for data parallel training of networks but
DataReader
which are resolved for merge by #2903.
in place of@ronghanghuDataReader
CHECK(false)
withLOG(FATAL)
@cypof @thatguymike @longjon @jeffdonahue please take a look.
@cdoersch could you fire up your parallel training test again?
Reviews and testing by the community are welcome!