Skip to content

Conversation

@staple
Copy link
Contributor

@staple staple commented Mar 6, 2015

An implementation of accelerated gradient descent, a first order optimization algorithm with faster asymptotic convergence than standard gradient descent.

Design discussion and benchmark results at
https://issues.apache.org/jira/browse/SPARK-1503

If the implementation seems promising it may make sense to add:

  • documentation about the algorithm
  • usage examples

@SparkQA
Copy link

SparkQA commented Mar 6, 2015

Test build #28355 has finished for PR 4934 at commit a121bd0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is quite hard to choose a proper stepSize in practice, because it depends on the Lipschitz constant, which is usually unknown. It may be better if we can implement a line search method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr Thanks for taking a look. I was advised by Reza Zadeh to implement a version without line search, at least for the initial implementation.

Please see discussion here: https://issues.apache.org/jira/browse/SPARK-1503?focusedCommentId=14225295, and in the following comments. I also attached some optimization benchmarks to the jira, which include performance of both backtracking line search and non line search implementations. Per your suggestion that it's hard to choose a proper stepSize I can attest that, anecdotally, acceleration seems somewhat more sensitive to diverging with nominal stepSize than the existing gradient descent.

@rezazadeh
Copy link
Contributor

Thank you for this PR @staple !

@mengxr I suggested to @staple to first implement without backtracking to keep the PR as simple as possible. According to his plots (see JIRA), even without backtracking, this PR achieves fewer iterations with the same cost per iteration.

Note that backtracking requires several additional map-reduces per iteration. This makes it unclear when backtracking is best used. So I suggested to first merge the case that is a clear win (fewer iterations in the same cost per iteration). I think we should merge this without backtracking, and then have another PR to properly evaluate how backtracking affects total cost with the goal of also merging backtracking.

It seems @staple has already implemented backtracking (because he has results in the JIRA), but kept them out of this PR to keep it simple, so we can tackle that afterwards.

@mengxr
Copy link
Contributor

mengxr commented Mar 10, 2015

Line search helps if you don't know the Lipschitz constant. With accelerated gradient, it is very easy to blow up if the step size is wrong. I'm okay with not having line search in this version. But we need to consider how the APIs are going to change after we add line search. For example, if we add line search option, what is the semantic of agd.setStepSize(1.0).useLineSearch()?

Btw, I don't think we need to stick to the current GradientDescent API. The accelerated gradient takes a smooth convex function which provides gradient and optionally the Lipschitz constant. The implementation of Nesterov's method doesn't need to know RDDs.

@staple
Copy link
Contributor Author

staple commented Mar 11, 2015

Hi, replying to some of the statements above:

It seems @staple has already implemented backtracking (because he has results in the JIRA), but kept them out of this PR to keep it simple, so we can tackle that afterwards.

I wrote a backtracking implementation (and checked that it performs the same as the tfocs implementation). Currently it is just a port of the tfocs version. I’d need a little time to make it scala / spark idiomatic, but the turnaround would be pretty fast.

For example, if we add line search option, what is the semantic of agd.setStepSize(1.0).useLineSearch()

TFOCS supports a suggested initial lipschitz value (variable named ‘L’), which is just a starting point for line search, so a corresponding behavior would be to use the step size as just an initial suggestion when line search is enabled. It may be desirable to use a parameter name like ‘L’ instead of ‘stepSize’ to make the meaning clearer.

In TFOCS you can disable backtracking line search by setting several parameters (L, Lexact, alpha, and beta) which individually control different aspects of the backtracking implementation.
For spark it may make sense to provide backtracking modes that are configured explicitly, for example fixed lipshitz bound (no backtracking), or backtracking line search based on the TFOCS implementation, or possibly an alternative line search implementation that is more conservative about performing round trip aggregations. Then there could be a setBacktrackingMode() setter to configure which mode is used.

Moving forward there may be a need to support additional acceleration algorithms in addition to Auslender and Teboulle. These might be configurable via a setAlgorithm() function.

Btw, I don't think we need to stick to the current GradientDescent API. The accelerated gradient takes a smooth convex function which provides gradient and optionally the Lipschitz constant. The implementation of Nesterov's method doesn't need to know RDDs.

This is good to know. I had been assuming we would stick with the existing GradientDescent api including Gradient and Updater delegates. Currently the applySmooth and applyProjector functions (named the same as corresponding TFOCS functions) serve as a bridge between the acceleration implementation (relatively unaware of RDDs) and spark specific RDD aggregations.

This seems like a good time to mention that the backtracking implementation in TFOCS uses a system of caching the (expensive to compute) linear operator component of the objective function, which significantly reduces the cost of backtracking. A similar implementation is possible in spark, though the performance benefit may not be as significant because two round trips would still be required per iteration. (See p. 3 of my design doc linked in the jira for some more detail.) One reason I suggested not implementing linear operator caching in the design doc is because it’s incompatible with the existing Gradient interface. But if we are considering an alternative interface it may be worth revisiting this issue.

The objective function “interface” used by TFOCS involves the functions applyLinear (linear operator component of objective), applySmooth (smooth portion of objective), and applyProjector (nonsmooth portion of objective). In addition there are a number of numeric and categorical parameters. Theoretically we could adopt a similar interface (with or without applyLinear, depending) where RDD specific operations are encapsulated within the various apply* functions.

Finally, I wanted to mention that I live in the bay area and am happy to meet in person to discuss this project if that would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops looks like a typo: 'avaialble'

@srowen
Copy link
Member

srowen commented Jul 28, 2015

Likewise is this one stale? I'm not sure this is going to move forward.

@mengxr
Copy link
Contributor

mengxr commented Jul 28, 2015

We refactored the implementation. You can find the latest version at https://github.com/databricks/spark-tfocs. We will send a new PR when the implementation is ready.

@staple Could you close this PR for now?

@asfgit asfgit closed this in 423cdfd Aug 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants