Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive hyper-parameter optimization #161

Open
mrocklin opened this issue Apr 19, 2018 · 13 comments
Open

Adaptive hyper-parameter optimization #161

mrocklin opened this issue Apr 19, 2018 · 13 comments

Comments

@mrocklin
Copy link
Member

Current options for hyper-parameter optimization (grid search, random search) construct a task graph to search many options, submit that graph, and then wait for the result. However, we might instead submit only enough options to saturate our available workers, get some of those results back and then based on those results submit more work asynchronously. This might explore the parameter space more efficiently.

This would deviate from vanilla task graphs and would probably require the futures API (and the as_completed function specifically) which enforces a strict requirement on the dask.distributed scheduler.

Here is a simplified and naive implementation to show the use of dask.distributed.as_completed

initial_guesses = [...]
futures = [client.submit(fit_and_score, params) for params in initial_guesses]

seq = as_completed(futures, with_results=True)
for future, result in seq:
    if good_enough():
        break
    params = compute_next_guess()
    new_future = client.submit(fit_and_score, params)
    seq.add(new_future)

I strongly suspect that scholarly work exists in this area to evaluate stopping criteria and compute optimial next guesses. What is the state of the art?

Additionally, I suspect that this problem is further complicated by pipelines. The hierarchical / tree-like structure of pipeline/gridsearch means that it probably makes sense to alter the parameters of the latter stages of the pipeline more often than the earlier stages. My simple code example above probably doesn't do the full problem justice.

cc @ogrisel @stsievert @GaelVaroquaux

@rth
Copy link
Contributor

rth commented Apr 20, 2018

I strongly suspect that scholarly work exists in this area to evaluate stopping criteria and compute optimial next guesses. What is the state of the art?

Computing the next guess could be done e.g. with scikit-optimize; see in particular the Bayesian optimization approach in gp_minimize. Also see MOE though it's a heavier dependency and it's Python packaging doesn't seem to be ideal yet.

There are also the Tree-structured Parzen Estimator (TPE) approaches. For instance, this NeuPy blog does a general overview of this topic (it also includes a good list of references at the end).

I don't know whether any of this is robust enough to consistently outperform random hyper-parameter search for the algorithms used in dask-ml though.

@ogrisel
Copy link

ogrisel commented Apr 20, 2018

Here is another good candidate algorithm for the sequential learning part: http://blog.dlib.net/2017/12/a-global-optimization-algorithm-worth.html

Most (all?) of those sequential learning algorithms will yield one next suggestion at a time though. They have to be tweaked to be cast into: given current state give me the next n_workers to explore approximately in parallel next. It's probably doable to hack some tabu search-like heuristics such as:

  • give state s_t, what is the next best guess g_(t, 0) = make_guess(s_t), schedule eval(g_(t, 0)),
  • while waiting for eval(g_(t, 0)) to happen, let's pretend that it has returned some average/median value e_(t, 0) and ask for the next "second best guess" g_(t, 1) = compute_next_guess(s_t \union e_(t, 0)) and schedule eval(g_(t, 1)) to run in parallel of eval(g_(t, 0))
  • continue pretending that you did get immediate but average results of all the recently scheduled evaluation as long as there are free workers by generating third, fourth, ... best guesses.

As soon as we get the true result of any evaluation, let's update state s_t into s_(t+1) and use that to generate the subsequent evaluation candidate.

@TomAugspurger
Copy link
Member

Most (all?) of those sequential learning algorithms will yield one next suggestion at a time though.

https://arxiv.org/pdf/1206.2944.pdf gives one strategy for choosing the next point given N finished optimizations and J pending ones (section 3.2)

Instead we propose a sequential strategy that takes advantage of the
tractable inference properties of the Gaussian process to compute Monte Carlo estimates of
the acquisiton function under different possible results from pending function evaluations.

@ogrisel
Copy link

ogrisel commented Apr 20, 2018

Interesting. My variant is probably going to explore more than what is proposed in "Practical Bayesian Optimization of Machine Learning Algorithms".

@stsievert
Copy link
Member

I think this is a critical issue. Earlier this year, I spent many hours tuning hyperparams, and didn't have the compute resources available to waste time on poor performing models. Eventually, I had to reduce my parameter space to a very simple space (which cost me at least two months of research effort).

to evaluate stopping criteria and compute optimial next guesses.

Hyperband treats hyperparam optimization as a resource allocation problem, not one of fitting + optimizing some function. They try to speed up random evaluation of hyperparams and do not try to perform any sort of optimization. They do this by devoting compute time to well-performing models.

This is possible by taking advantage of the iterative training process for ML models. At first, Hyperband trains many models for a few iterations. It then stops or "kills" some fraction of the worst performing and repeats. i.e., part of Hyperband trains 100 models for 1 iteration, then the best 50 models for 2 iterations, then the best 25 models for 4 iterations and so on. The theory backing this reformulation (the multi-armed bandit problem) is natural and well-studied (but takes effort to apply it here).

They show it's competitive with other Bayesian methods (spearmint, TPE, SMAC) as well as random search, and also random search with twice the computational budget in figures 8 (which I think is their most realistic example):

screen shot 2018-04-20 at 8 09 03 am

I think the input of computational budget not configs to evaluate makes it even more attractive. I'm planning on working to verify any gains, then implementing this algorithm in Dask. I'll be on this soon: I'll be more free after May 4th, and certainly after the semester ends.

@stsievert
Copy link
Member

stsievert commented May 24, 2018

A while back @mrocklin mentioned that building a framework for adaptive hyperparam selection could allow easier integration into dask-searchcv. I think it'd look something like

alg = AdaptiveHyperParameterSearch()

losses = None
while True:
    models = [model.set_params(config) for config in alg.get_configs(models, losses)]
    losses = [validation_loss(model) for model in models]
    models = alg.filter(models, losses)
    if alg.stop:
        return alg.best_model, alg.best_config

This is the framework required for adaptive algorithms. It allows for both getting a query with get_configs and processing the answer with filter.

@ogrisel
Copy link

ogrisel commented May 28, 2018

But is alg.get_config a blocking code that waits for all the currently scheduled tasks to complete before returning (therefore implementing a generation based loop)? I think I like to have main loop use the raw dask.distributed API explicitly rather than wrapping it into the class.

initial_guesses = [...]
problem_definition = ...
optimizer = MyOptimizer(problem_definition)
futures = [client.submit(fit_and_score, params)
                for params in optimizer.initial_guesses(n_workers)]

seq = as_completed(futures, with_results=True)
for future, result in seq:
    optimizer.add_result(result)
    if optimizer.has_converged():
        break
    params = optimizer.compute_next_guess()
    new_future = client.submit(optimizer.step, params)
    seq.add(new_future)

@jimmywan
Copy link

jimmywan commented Mar 7, 2019

I've been following this off/on for a while and very excited to see this become available.
Any update on when the work might be complete?

@TomAugspurger
Copy link
Member

TomAugspurger commented Mar 7, 2019 via email

@stsievert
Copy link
Member

stsievert commented Mar 8, 2019

I believe Scott is planning to finish it off in a couple weeks.

Yup, that's the plan with #221.

@jimmywan
Copy link

Saw that #221 just got merged. Woohoo! Can't wait to use this!

@TomAugspurger
Copy link
Member

TomAugspurger commented May 6, 2020

https://distill.pub/2020/bayesian-optimization has some good background reading on Bayesian optimization.

@stsievert
Copy link
Member

stsievert commented May 7, 2020

https://distill.pub/2020/bayesian-optimizations has some good background

Thanks! I like the simple explanation.

Hyperband and bayesian optimization work well together, and has seen some previous work (https://ml.informatik.uni-freiburg.de/papers/18-ICML-BOHB.pdf). Hyperband samples parameters randomly, and runs the different brackets completely in parallel. Instead, why not refine the parameters space between brackets? The most aggressive bracket is run first to narrow the search space; then a slightly less aggressive bracket with a more refined search space, and so on.

Here's their algorithm:

# bracket 0 has no early stopping; bracket 5 has most early stopping
brackets = reversed(range(6))
params = {...}  # user-defined
for bracket in brackets:
    search = SuccessiveHalvingSearchCV(model, params)
    search.fit(...)

    params = refine(params, search)

# return best performing model

The implementation backing this is at https://automl.github.io/HpBandSter/build/html/optimizers/bohb.html.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants