-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-1503][MLLIB] Initial AcceleratedGradientDescent implementation. #4934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #28355 has finished for PR 4934 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is quite hard to choose a proper stepSize in practice, because it depends on the Lipschitz constant, which is usually unknown. It may be better if we can implement a line search method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mengxr Thanks for taking a look. I was advised by Reza Zadeh to implement a version without line search, at least for the initial implementation.
Please see discussion here: https://issues.apache.org/jira/browse/SPARK-1503?focusedCommentId=14225295, and in the following comments. I also attached some optimization benchmarks to the jira, which include performance of both backtracking line search and non line search implementations. Per your suggestion that it's hard to choose a proper stepSize I can attest that, anecdotally, acceleration seems somewhat more sensitive to diverging with nominal stepSize than the existing gradient descent.
|
Thank you for this PR @staple ! @mengxr I suggested to @staple to first implement without backtracking to keep the PR as simple as possible. According to his plots (see JIRA), even without backtracking, this PR achieves fewer iterations with the same cost per iteration. Note that backtracking requires several additional map-reduces per iteration. This makes it unclear when backtracking is best used. So I suggested to first merge the case that is a clear win (fewer iterations in the same cost per iteration). I think we should merge this without backtracking, and then have another PR to properly evaluate how backtracking affects total cost with the goal of also merging backtracking. It seems @staple has already implemented backtracking (because he has results in the JIRA), but kept them out of this PR to keep it simple, so we can tackle that afterwards. |
|
Line search helps if you don't know the Lipschitz constant. With accelerated gradient, it is very easy to blow up if the step size is wrong. I'm okay with not having line search in this version. But we need to consider how the APIs are going to change after we add line search. For example, if we add line search option, what is the semantic of Btw, I don't think we need to stick to the current |
|
Hi, replying to some of the statements above:
I wrote a backtracking implementation (and checked that it performs the same as the tfocs implementation). Currently it is just a port of the tfocs version. I’d need a little time to make it scala / spark idiomatic, but the turnaround would be pretty fast.
TFOCS supports a suggested initial lipschitz value (variable named ‘L’), which is just a starting point for line search, so a corresponding behavior would be to use the step size as just an initial suggestion when line search is enabled. It may be desirable to use a parameter name like ‘L’ instead of ‘stepSize’ to make the meaning clearer. In TFOCS you can disable backtracking line search by setting several parameters (L, Lexact, alpha, and beta) which individually control different aspects of the backtracking implementation. Moving forward there may be a need to support additional acceleration algorithms in addition to Auslender and Teboulle. These might be configurable via a setAlgorithm() function.
This is good to know. I had been assuming we would stick with the existing GradientDescent api including Gradient and Updater delegates. Currently the applySmooth and applyProjector functions (named the same as corresponding TFOCS functions) serve as a bridge between the acceleration implementation (relatively unaware of RDDs) and spark specific RDD aggregations. This seems like a good time to mention that the backtracking implementation in TFOCS uses a system of caching the (expensive to compute) linear operator component of the objective function, which significantly reduces the cost of backtracking. A similar implementation is possible in spark, though the performance benefit may not be as significant because two round trips would still be required per iteration. (See p. 3 of my design doc linked in the jira for some more detail.) One reason I suggested not implementing linear operator caching in the design doc is because it’s incompatible with the existing Gradient interface. But if we are considering an alternative interface it may be worth revisiting this issue. The objective function “interface” used by TFOCS involves the functions applyLinear (linear operator component of objective), applySmooth (smooth portion of objective), and applyProjector (nonsmooth portion of objective). In addition there are a number of numeric and categorical parameters. Theoretically we could adopt a similar interface (with or without applyLinear, depending) where RDD specific operations are encapsulated within the various apply* functions. Finally, I wanted to mention that I live in the bay area and am happy to meet in person to discuss this project if that would be helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops looks like a typo: 'avaialble'
|
Likewise is this one stale? I'm not sure this is going to move forward. |
|
We refactored the implementation. You can find the latest version at https://github.com/databricks/spark-tfocs. We will send a new PR when the implementation is ready. @staple Could you close this PR for now? |
An implementation of accelerated gradient descent, a first order optimization algorithm with faster asymptotic convergence than standard gradient descent.
Design discussion and benchmark results at
https://issues.apache.org/jira/browse/SPARK-1503
If the implementation seems promising it may make sense to add: