-
-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive hyper-parameter optimization #161
Comments
Computing the next guess could be done e.g. with scikit-optimize; see in particular the Bayesian optimization approach in gp_minimize. Also see MOE though it's a heavier dependency and it's Python packaging doesn't seem to be ideal yet. There are also the Tree-structured Parzen Estimator (TPE) approaches. For instance, this NeuPy blog does a general overview of this topic (it also includes a good list of references at the end). I don't know whether any of this is robust enough to consistently outperform random hyper-parameter search for the algorithms used in dask-ml though. |
Here is another good candidate algorithm for the sequential learning part: http://blog.dlib.net/2017/12/a-global-optimization-algorithm-worth.html Most (all?) of those sequential learning algorithms will yield one next suggestion at a time though. They have to be tweaked to be cast into: given current state give me the next n_workers to explore approximately in parallel next. It's probably doable to hack some tabu search-like heuristics such as:
As soon as we get the true result of any evaluation, let's update state |
https://arxiv.org/pdf/1206.2944.pdf gives one strategy for choosing the next point given
|
Interesting. My variant is probably going to explore more than what is proposed in "Practical Bayesian Optimization of Machine Learning Algorithms". |
I think this is a critical issue. Earlier this year, I spent many hours tuning hyperparams, and didn't have the compute resources available to waste time on poor performing models. Eventually, I had to reduce my parameter space to a very simple space (which cost me at least two months of research effort).
Hyperband treats hyperparam optimization as a resource allocation problem, not one of fitting + optimizing some function. They try to speed up random evaluation of hyperparams and do not try to perform any sort of optimization. They do this by devoting compute time to well-performing models. This is possible by taking advantage of the iterative training process for ML models. At first, Hyperband trains many models for a few iterations. It then stops or "kills" some fraction of the worst performing and repeats. i.e., part of Hyperband trains 100 models for 1 iteration, then the best 50 models for 2 iterations, then the best 25 models for 4 iterations and so on. The theory backing this reformulation (the multi-armed bandit problem) is natural and well-studied (but takes effort to apply it here). They show it's competitive with other Bayesian methods (spearmint, TPE, SMAC) as well as random search, and also random search with twice the computational budget in figures 8 (which I think is their most realistic example): I think the input of computational budget not configs to evaluate makes it even more attractive. I'm planning on working to verify any gains, then implementing this algorithm in Dask. I'll be on this soon: I'll be more free after May 4th, and certainly after the semester ends. |
A while back @mrocklin mentioned that building a framework for adaptive hyperparam selection could allow easier integration into dask-searchcv. I think it'd look something like alg = AdaptiveHyperParameterSearch()
losses = None
while True:
models = [model.set_params(config) for config in alg.get_configs(models, losses)]
losses = [validation_loss(model) for model in models]
models = alg.filter(models, losses)
if alg.stop:
return alg.best_model, alg.best_config This is the framework required for adaptive algorithms. It allows for both getting a query with |
But is initial_guesses = [...]
problem_definition = ...
optimizer = MyOptimizer(problem_definition)
futures = [client.submit(fit_and_score, params)
for params in optimizer.initial_guesses(n_workers)]
seq = as_completed(futures, with_results=True)
for future, result in seq:
optimizer.add_result(result)
if optimizer.has_converged():
break
params = optimizer.compute_next_guess()
new_future = client.submit(optimizer.step, params)
seq.add(new_future) |
I've been following this off/on for a while and very excited to see this become available. |
I believe Scott is planning to finish it off in a couple weeks.
…On Thu, Mar 7, 2019 at 3:36 PM Jimmy Wan ***@***.***> wrote:
I've been following this off/on for a while and very excited to see this
become available.
Any update on when the work might be complete?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#161 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIipch4mxlZ6YA1WEwvP2oJanC5vlks5vUYZtgaJpZM4TblQo>
.
|
Yup, that's the plan with #221. |
Saw that #221 just got merged. Woohoo! Can't wait to use this! |
https://distill.pub/2020/bayesian-optimization has some good background reading on Bayesian optimization. |
Thanks! I like the simple explanation. Hyperband and bayesian optimization work well together, and has seen some previous work (https://ml.informatik.uni-freiburg.de/papers/18-ICML-BOHB.pdf). Hyperband samples parameters randomly, and runs the different brackets completely in parallel. Instead, why not refine the parameters space between brackets? The most aggressive bracket is run first to narrow the search space; then a slightly less aggressive bracket with a more refined search space, and so on. Here's their algorithm: # bracket 0 has no early stopping; bracket 5 has most early stopping
brackets = reversed(range(6))
params = {...} # user-defined
for bracket in brackets:
search = SuccessiveHalvingSearchCV(model, params)
search.fit(...)
params = refine(params, search)
# return best performing model The implementation backing this is at https://automl.github.io/HpBandSter/build/html/optimizers/bohb.html. |
Current options for hyper-parameter optimization (grid search, random search) construct a task graph to search many options, submit that graph, and then wait for the result. However, we might instead submit only enough options to saturate our available workers, get some of those results back and then based on those results submit more work asynchronously. This might explore the parameter space more efficiently.
This would deviate from vanilla task graphs and would probably require the futures API (and the as_completed function specifically) which enforces a strict requirement on the dask.distributed scheduler.
Here is a simplified and naive implementation to show the use of
dask.distributed.as_completed
I strongly suspect that scholarly work exists in this area to evaluate stopping criteria and compute optimial next guesses. What is the state of the art?
Additionally, I suspect that this problem is further complicated by pipelines. The hierarchical / tree-like structure of pipeline/gridsearch means that it probably makes sense to alter the parameters of the latter stages of the pipeline more often than the earlier stages. My simple code example above probably doesn't do the full problem justice.
cc @ogrisel @stsievert @GaelVaroquaux
The text was updated successfully, but these errors were encountered: