From fcaff126952211da1d2109be52a4f73a54f520fd Mon Sep 17 00:00:00 2001 From: Kavya Srinet Date: Tue, 28 Nov 2017 13:03:18 -0800 Subject: [PATCH 1/2] Updating the Latex equation for Adagrad --- paddle/operators/adagrad_op.cc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/paddle/operators/adagrad_op.cc b/paddle/operators/adagrad_op.cc index d6686e3ef31659..d19602244bc015 100644 --- a/paddle/operators/adagrad_op.cc +++ b/paddle/operators/adagrad_op.cc @@ -80,8 +80,8 @@ Adaptive Gradient Algorithm (Adagrad). The update is done as follows: -$$momentOut = moment + grad * grad \break -paramOut = param - learningRate * grad / ($\sqrt{momentOut}$ + \epsilon) \break +$$moment\_out = moment + grad * grad \\ +param\_out = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon} $$ The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) From 5e7f59d73ecdf900d5ce13c4d3a76e8e5f2f723d Mon Sep 17 00:00:00 2001 From: Kavya Srinet Date: Thu, 30 Nov 2017 14:05:01 -0800 Subject: [PATCH 2/2] Fixing Latex euqations for adadelta, adam and adamax --- paddle/operators/adadelta_op.cc | 12 ++++++------ paddle/operators/adam_op.cc | 12 +++++++----- paddle/operators/adamax_op.cc | 10 ++++++---- 3 files changed, 19 insertions(+), 15 deletions(-) diff --git a/paddle/operators/adadelta_op.cc b/paddle/operators/adadelta_op.cc index 16a7794d5b7bf1..29434a0ee221d1 100644 --- a/paddle/operators/adadelta_op.cc +++ b/paddle/operators/adadelta_op.cc @@ -92,12 +92,12 @@ for gradient descent. Adadelta updates are as follows: -$$avgSquaredGradOut = \rho * avgSquaredGrad + (1 - \rho) * grad * grad \break -paramUpdate = - $\sqrt{((avgSquaredUpdate + \epsilon) / - (avgSquaredGrad_out + \epsilon))}$ * grad \break -avgSquaredUpdateOut = \rho * avgSquaredUpdate + (1 - \rho) * - {(paramUpdate)}^2 \break -paramOut = param + paramUpdate$$ +$$ +avg\_squared\_grad\_out = \rho * avg\_squared\_grad + (1 - \rho) * grad * grad \\ +param\_update = - \sqrt{\frac{avg\_squared\_update + \epsilon}{avg\_squared\_grad\_out + \epsilon}} * grad \\ +avg\_squared\_update\_out = \rho * avg\_squared\_update + (1 - \rho) * {param\_update}^2 \\ +param\_out = param + param\_update +$$ )DOC"); } diff --git a/paddle/operators/adam_op.cc b/paddle/operators/adam_op.cc index 03faa2a7c5a486..a268d054842476 100644 --- a/paddle/operators/adam_op.cc +++ b/paddle/operators/adam_op.cc @@ -112,11 +112,13 @@ adaptive estimates of lower-order moments. Adam updates: -$$moment_1_{out} = \beta_1 * moment_1 + (1 - \beta_1) * grad \break -moment_2_{out} = \beta_2 * moment_2 + (1 - \beta_2) * grad * grad \break -learningRate = learningRate * - $\sqrt{(1 - \beta_2_{pow})}$ / (1 - \beta_1_{pow}) \break -paramOut = param - learningRate * moment_1/ ($\sqrt{(moment_2)} + \epsilon)$$ +$$ +moment\_1\_out = \beta_1 * moment\_1 + (1 - \beta_1) * grad \\ +moment\_2_\out = \beta_2 * moment\_2 + (1 - \beta_2) * grad * grad \\ +learning\_rate = learning\_rate * + \frac{\sqrt{1 - \beta_{2\_pow}}}{1 - \beta_{1\_pow}} \\ +param\_out = param - learning\_rate * \frac{moment\_1}{\sqrt{moment\_2} + \epsilon} +$$ )DOC"); } diff --git a/paddle/operators/adamax_op.cc b/paddle/operators/adamax_op.cc index d5bbc672e18f39..9e7576c961bf72 100644 --- a/paddle/operators/adamax_op.cc +++ b/paddle/operators/adamax_op.cc @@ -107,10 +107,12 @@ Adam algorithm based on the infinity norm. Adamax updates: -$$momentOut = \beta_1 * moment + (1 - \beta_1) * grad \break -infNormOut = max(\beta_2 * infNorm + \epsilon, |grad|) \break -learningRate = learningRate /(1 - \beta_1_{pow}) \break -paramOut = param - learningRate * momentPut / infNormOut$$ +$$ +moment\_out = \beta_1 * moment + (1 - \beta_1) * grad \\ +inf\_norm\_out = max(\beta_2 * inf\_norm + \epsilon, |grad|) \\ +learning\_rate = \frac{learning\_rate}{1 - \beta_{1\_pow}} \\ +param\_out = param - learning\_rate * \frac{moment\_out}{inf\_norm\_out} +$$ The original paper does not have an epsilon attribute. However, it is added here for numerical stability to prevent the