Adam solver #2856

PatWie · 2015-08-04T09:52:26Z

This commit implements the Adam solver by Kingma et. al for CPU and
GPU. All solver parameters are defined in the caffe.proto. This also
adds an example for the MNIST dataset.

see issue #2827

Before merging, please review the code. I will add changes to this branch (and rebase) if there should be something to change.

shelhamer · 2015-08-04T18:31:48Z

@philkr could you review this if you have a chance?

philkr · 2015-08-04T18:50:39Z

src/caffe/proto/caffe.proto

  optional float delta = 31 [default = 1e-8];
+  // parameters for the Adam solver
+  optional float beta1 = 37 [default = 0.9];


Why not use momentum here, to be consistent with other solvers?

shelhamer · 2015-08-04T19:09:09Z

@PatWie thanks for the solver! All SGD solvers need gradient checks. See for instance the AdaGrad tests https://github.com/BVLC/caffe/blob/master/src/caffe/test/test_gradient_based_solver.cpp#L431-L483

philkr · 2015-08-04T19:09:39Z

src/caffe/solver.cpp

+  const Dtype beta1 = this->param_.beta1();
+  const Dtype beta2 = this->param_.beta2();
+
+  const int t = this->iter_ / this->param_.stepsize() +1;


Why divide by stepsize here?

I think t is the epoch rather than the iteration by the definition of Caffe.

PatWie · 2015-08-04T19:14:26Z

@shelhamer Ah. i didn't realized that there is already a unit-test. I will add one of course.

philkr · 2015-08-04T19:21:14Z

src/caffe/solver.cpp

+    caffe_add(N,
+        this->val_t_[param_id]->cpu_data(),
+        this->val_m_[param_id]->cpu_data(),
+        this->val_m_[param_id]->mutable_cpu_data());


The three commands above can be written as a single caffe_cpu_axpby using beta1 instead of 0 and val_m_ instead of val_t_.

Ah I see. Blas is completely new for me. I will change this tomorrow at all places in my code.

Nanne · 2015-08-05T12:40:42Z

I think there's some confusion going on in the code due to the usage of the stepsize param. Which is made worse by using lr_policy "step" in the MNIST example.

The alpha/stepsize from the paper should be set via the base_lr, and used with lr_policy: "fixed" as I don't see any recommendations for changing alpha during training. This way you can also get rid of gamma and power in the prototxt (the latter wasn't being used anyway).

The stepsize param should only be used together with lr_policy "step", and if we already set the alpha via base_lr then it is not needed at all. t can just be iter_ +1, as it's afaik not needed to compute the effective stepsize. This also removes the need for a check if stepsize > 0 in the header.

Moreover, it makes sense to change the MNIST example to use the recommended value from the paper for the base_lr (0.001) and explicitly set momentum and momentum2 to 0.9 and 0.999 respectively, rather than relying on the default values.

PatWie · 2015-08-05T15:12:29Z

I applied all changes in memory usage, solver_mnist proto, t is now the current iteration instead of epoche. And there is a unit test.
There are still some trivial things todo:

update solver page in the wiki
implement learning rate heuristic 1/sqrt(t) for reproducing results from the paper

philkr · 2015-08-05T17:03:43Z

src/caffe/solver.cpp

+  const Dtype beta2 = this->param_.momentum2();
+
+  // we create alias for memory from the SGD for convenience
+  shared_ptr<Blob<Dtype> > &val_m = this->history_[param_id];


A reference to a shared pointer is never good idea, just copy the pointer.

philkr · 2015-08-05T17:07:00Z

The solver looks good to me now. I haven't tested it though. I do think that caffe_ipow should be it's own PR if we really want to add it. Currently it just bloats this PR request and doesn't add any benefit (see https://en.wikipedia.org/wiki/Amdahl%27s_law).

jeffdonahue · 2015-08-05T20:20:08Z

src/caffe/solver.cpp

+
+    // update v <- \beta_2 m_{t-1} + (1-\beta_2)g_t^2
+    caffe_mul(N,
+      net_params[param_id]->cpu_diff(),


please fix indentation -- use 4 space indents when continuing from previous lines (https://google-styleguide.googlecode.com/svn/trunk/cppguide.html#Spaces_vs._Tabs)

ronghanghu · 2015-08-06T07:10:50Z

@PatWie Thank you for this great PR! I just added some comment on the code. Please fix indent to 4 space indents when continuing from previous lines, and add more test cases. After that, squash into one commit and I can merge.

ronghanghu · 2015-08-09T07:48:59Z

#2836 and #2866 introduced new conflicts to be resolved.

ronghanghu · 2015-08-09T19:24:48Z

src/caffe/solver.cpp

+
+template <typename Dtype>
+void AdamSolver<Dtype>::ComputeUpdateValue(int param_id, Dtype rate) {
+  const vector<shared_ptr<Blob<Dtype> > >& net_params = this->net_->params();


To be consistent with #2866, use const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params(); instead.

This commit implements the Adam solver by Kingma et. al for CPU and GPU. All solver parameters are defined in the caffe.proto. This also adds an example for the MNIST dataset.

PatWie · 2015-08-10T10:55:22Z

I rebased to the latest master branch commit and fixed up the conflicts with #2836 and #2866.

It seems to be difficult for me to write the other tests (although I am pretty sure, that the implementation is correct) without rewriting fairly large amount of code (mostly duplicating code). In addition, it is not clear what's the favored way and some details like, should the code prevent the usage of weight decay, regularization and other parameters. I think only the following tests make sense (checked the existing tests)

TestLeastSquaresSingle
TestLeastSquaresMultiIter
TestSnapshot
TestSnapshotShare

One has to deal with the following issues:

The snapshot function has to write/read 2 additional parameters (m,v) from the paper which depends on the actual iteration. Currently, it only supports the history-blob.
The test-implementation has to save m, v for testing the derivative. This depends on the actual iteration.
I misused the history_, update_, temp_ members, since Adam does not really rely on them directly. The current commit use again its own members from the Adam-class. How to handle this? Using the initialised members from SDG or use members from the Adam-class?

Possible solutions would be:

Add two additional fields to the SolverState blob and add members to SDGSolver, which would not be used by the other solvers

or

Copy and paste most of the SDGSolver class for Adam

This seems to be a serious design issue. Implementing the solver is fairly easy, but writing nearly the same code again and squeezing the code into the current testing class needs hacky solutions.

To refactore the unit test cases one can use the curiously recurring template pattern to put everything from the solvers into its derived classes and just refer to the members in the base class. But then again the testing method should use the solver as a template not just as a member field

Since, this needs profound changes in the code it would suggest to let a BVLC maintainer decide the next steps or how to rewrite the solver-interface w.r.t. to these issues.

I am glad to help, but don't want to rewrite mostly everything in reference to #2890.

ronghanghu · 2015-08-10T19:00:29Z

@PatWie Thanks for your update!

I checked the math and am quite confident with your solver implementation. However, at this point the PR is still not working, because snapshot currently won't work for AdamSolver.

In addition, it is not clear what's the favored way and some details like, should the code prevent the usage of weight decay, regularization and other parameters.

Since right now all the weight decay for AdamSolver is handled in SGDSolver::Regularize, these weight decay tests become less important, thought it is still good to have them. It is also good have accumulation tests to test whether the gradient accumulation is working correctly.

The snapshot function has to write/read 2 additional parameters (m,v) from the paper which depends on the actual iteration. Currently, it only supports the history-blob.

The original design is to use SGDSolver::history_ to store all history blobs to be saved and recovered during snapshot. The size of history vector does not have to be same as number of learnable parameters . In Adam solver, you can twice as many history blobs as number of learnable parameters, in order to store both m and v. This can be done in a AdamPreSolve function. You may take a look at corresponding implementation in AdaDelta solver #2782 (which also stores two history variables) for how this is done.

https://github.com/matthiasplappert/caffe/blob/adadelta/src/caffe/solver.cpp#L937-L947
https://github.com/matthiasplappert/caffe/blob/adadelta/src/caffe/solver.cpp#L975

The test-implementation has to save m, v for testing the derivative. This depends on the actual iteration.

Test-implementation can have access to solver's history, you may also take a look at AdaDelta #2782 implementation: https://github.com/matthiasplappert/caffe/blob/adadelta/src/caffe/test/test_gradient_based_solver.cpp#L299-L307

I misused the history_, update_, temp_ members, since Adam does not really rely on them directly. The current commit use again its own members from the Adam-class. How to handle this? Using the initialised members from SDG or use members from the Adam-class?

You should use SGD::history_ for all history variables that need to be stored and recovered. See above.

To refactore the unit test cases one can use the curiously recurring template pattern to put everything from the solvers into its derived classes and just refer to the members in the base class. But then again the testing method should use the solver as a template not just as a member field
Since, this needs profound changes in the code it would suggest to let a BVLC maintainer decide the next steps or how to rewrite the solver-interface w.r.t. to these issues. I am glad to help, but don't want to rewrite mostly everything in reference to #2890.

I'll handle #2890 after merging Adam and AdaDelta solvers, and I would like to refactor these solvers and look into curiously recurring template pattern.

For now to get this PR merged, we need to make the snapshot work. So let's simply put val_m and val_v into SGD::history_, and read them during test implementation as in #2782, and complete the other 3 tests you think is reasonable. This shouldn't involve too much change in your current code. After that, we can merge this and I can handle the rest in #2890.

jeffdonahue · 2015-08-10T21:00:18Z

Rather than putting both vectors into history_ (which I think would make the indexing pretty messy and unnatural), I'd recommend keeping the two vectors separate and overriding SnapshotSolverState to serialize both of them.

ronghanghu · 2015-08-10T22:25:19Z

@PatWie after a private discussion with @jeffdonahue, I still feel using existing history (and expanding it, like what is done in #2782) would be easier to implement.

Rather than putting both vectors into history_ (which I think would make the indexing pretty messy and unnatural), I'd recommend keeping the two vectors separate and overriding SnapshotSolverState to serialize both of them.

For the indexing, you can create a reference variable like val_m_vec = this->history_; and val_v_vec = this->history_ + net_params.size(); (either in AdamSolver or its member functions as you like), and just use history_ as a storage buffer. In this way the current snapshot functions still work.

After merging this PR and AdaDelta, I can address #2890 and refactor solvers afterwards.

ronghanghu · 2015-08-13T21:36:46Z

Another rebase is needed here after #2909 and #2903

ronghanghu · 2015-08-14T04:28:13Z

carried on in #2918, @PatWie please take a look when you have time.

shelhamer added the focus label Aug 4, 2015

philkr reviewed Aug 4, 2015
View reviewed changes

shelhamer mentioned this pull request Aug 4, 2015

Adaptive Solvers: AdaDelta, RMSprop, and ADAM #2860

Closed

3 tasks

philkr reviewed Aug 4, 2015
View reviewed changes

philkr reviewed Aug 5, 2015
View reviewed changes

ronghanghu added the RH label Aug 5, 2015

jeffdonahue reviewed Aug 5, 2015
View reviewed changes

ronghanghu mentioned this pull request Aug 7, 2015

RMSprop clean up and rebase #2867

Merged

ronghanghu mentioned this pull request Aug 9, 2015

AdaDelta Solver (v3) #2782

Merged

shelhamer mentioned this pull request Aug 9, 2015

Provide an Adam solver #2827

Closed

ronghanghu reviewed Aug 9, 2015
View reviewed changes

Adam solver

260d277

This commit implements the Adam solver by Kingma et. al for CPU and GPU. All solver parameters are defined in the caffe.proto. This also adds an example for the MNIST dataset.

shelhamer mentioned this pull request Aug 10, 2015

Separate solvers by string type and separate source files #2890

Closed

ronghanghu mentioned this pull request Aug 14, 2015

Adam solver #2918

Merged

ronghanghu closed this Aug 14, 2015

Belinda-great mentioned this pull request Nov 12, 2018

Test net output a lot of detection_eval #6605

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adam solver #2856

Adam solver #2856

PatWie commented Aug 4, 2015

shelhamer commented Aug 4, 2015

philkr Aug 4, 2015

shelhamer commented Aug 4, 2015

philkr Aug 4, 2015

PatWie Aug 4, 2015

PatWie commented Aug 4, 2015

philkr Aug 4, 2015

PatWie Aug 4, 2015

Nanne commented Aug 5, 2015

PatWie commented Aug 5, 2015

philkr Aug 5, 2015

philkr commented Aug 5, 2015

jeffdonahue Aug 5, 2015

ronghanghu commented Aug 6, 2015

ronghanghu commented Aug 9, 2015

ronghanghu Aug 9, 2015

PatWie commented Aug 10, 2015

ronghanghu commented Aug 10, 2015

jeffdonahue commented Aug 10, 2015

ronghanghu commented Aug 10, 2015

ronghanghu commented Aug 13, 2015

ronghanghu commented Aug 14, 2015

Adam solver #2856

Adam solver #2856

Conversation

PatWie commented Aug 4, 2015

shelhamer commented Aug 4, 2015

Choose a reason for hiding this comment

shelhamer commented Aug 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PatWie commented Aug 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nanne commented Aug 5, 2015

PatWie commented Aug 5, 2015

Choose a reason for hiding this comment

philkr commented Aug 5, 2015

Choose a reason for hiding this comment

ronghanghu commented Aug 6, 2015

ronghanghu commented Aug 9, 2015

Choose a reason for hiding this comment

PatWie commented Aug 10, 2015

ronghanghu commented Aug 10, 2015

jeffdonahue commented Aug 10, 2015

ronghanghu commented Aug 10, 2015

ronghanghu commented Aug 13, 2015

ronghanghu commented Aug 14, 2015