Adaptive hyper-parameter optimization #161

mrocklin · 2018-04-19T11:00:14Z

Current options for hyper-parameter optimization (grid search, random search) construct a task graph to search many options, submit that graph, and then wait for the result. However, we might instead submit only enough options to saturate our available workers, get some of those results back and then based on those results submit more work asynchronously. This might explore the parameter space more efficiently.

This would deviate from vanilla task graphs and would probably require the futures API (and the as_completed function specifically) which enforces a strict requirement on the dask.distributed scheduler.

Here is a simplified and naive implementation to show the use of dask.distributed.as_completed

initial_guesses = [...]
futures = [client.submit(fit_and_score, params) for params in initial_guesses]

seq = as_completed(futures, with_results=True)
for future, result in seq:
    if good_enough():
        break
    params = compute_next_guess()
    new_future = client.submit(fit_and_score, params)
    seq.add(new_future)

I strongly suspect that scholarly work exists in this area to evaluate stopping criteria and compute optimial next guesses. What is the state of the art?

Additionally, I suspect that this problem is further complicated by pipelines. The hierarchical / tree-like structure of pipeline/gridsearch means that it probably makes sense to alter the parameters of the latter stages of the pipeline more often than the earlier stages. My simple code example above probably doesn't do the full problem justice.

cc @ogrisel @stsievert @GaelVaroquaux

The text was updated successfully, but these errors were encountered:

rth · 2018-04-20T06:27:49Z

I strongly suspect that scholarly work exists in this area to evaluate stopping criteria and compute optimial next guesses. What is the state of the art?

Computing the next guess could be done e.g. with scikit-optimize; see in particular the Bayesian optimization approach in gp_minimize. Also see MOE though it's a heavier dependency and it's Python packaging doesn't seem to be ideal yet.

There are also the Tree-structured Parzen Estimator (TPE) approaches. For instance, this NeuPy blog does a general overview of this topic (it also includes a good list of references at the end).

I don't know whether any of this is robust enough to consistently outperform random hyper-parameter search for the algorithms used in dask-ml though.

ogrisel · 2018-04-20T08:06:30Z

Here is another good candidate algorithm for the sequential learning part: http://blog.dlib.net/2017/12/a-global-optimization-algorithm-worth.html

Most (all?) of those sequential learning algorithms will yield one next suggestion at a time though. They have to be tweaked to be cast into: given current state give me the next n_workers to explore approximately in parallel next. It's probably doable to hack some tabu search-like heuristics such as:

give state s_t, what is the next best guess g_(t, 0) = make_guess(s_t), schedule eval(g_(t, 0)),
while waiting for eval(g_(t, 0)) to happen, let's pretend that it has returned some average/median value e_(t, 0) and ask for the next "second best guess" g_(t, 1) = compute_next_guess(s_t \union e_(t, 0)) and schedule eval(g_(t, 1)) to run in parallel of eval(g_(t, 0))
continue pretending that you did get immediate but average results of all the recently scheduled evaluation as long as there are free workers by generating third, fourth, ... best guesses.

As soon as we get the true result of any evaluation, let's update state s_t into s_(t+1) and use that to generate the subsequent evaluation candidate.

TomAugspurger · 2018-04-20T12:30:41Z

Most (all?) of those sequential learning algorithms will yield one next suggestion at a time though.

https://arxiv.org/pdf/1206.2944.pdf gives one strategy for choosing the next point given N finished optimizations and J pending ones (section 3.2)

Instead we propose a sequential strategy that takes advantage of the
tractable inference properties of the Gaussian process to compute Monte Carlo estimates of
the acquisiton function under different possible results from pending function evaluations.

ogrisel · 2018-04-20T14:06:21Z

Interesting. My variant is probably going to explore more than what is proposed in "Practical Bayesian Optimization of Machine Learning Algorithms".

stsievert · 2018-04-20T16:26:59Z

I think this is a critical issue. Earlier this year, I spent many hours tuning hyperparams, and didn't have the compute resources available to waste time on poor performing models. Eventually, I had to reduce my parameter space to a very simple space (which cost me at least two months of research effort).

to evaluate stopping criteria and compute optimial next guesses.

Hyperband treats hyperparam optimization as a resource allocation problem, not one of fitting + optimizing some function. They try to speed up random evaluation of hyperparams and do not try to perform any sort of optimization. They do this by devoting compute time to well-performing models.

This is possible by taking advantage of the iterative training process for ML models. At first, Hyperband trains many models for a few iterations. It then stops or "kills" some fraction of the worst performing and repeats. i.e., part of Hyperband trains 100 models for 1 iteration, then the best 50 models for 2 iterations, then the best 25 models for 4 iterations and so on. The theory backing this reformulation (the multi-armed bandit problem) is natural and well-studied (but takes effort to apply it here).

They show it's competitive with other Bayesian methods (spearmint, TPE, SMAC) as well as random search, and also random search with twice the computational budget in figures 8 (which I think is their most realistic example):

I think the input of computational budget not configs to evaluate makes it even more attractive. I'm planning on working to verify any gains, then implementing this algorithm in Dask. I'll be on this soon: I'll be more free after May 4th, and certainly after the semester ends.

stsievert · 2018-05-24T04:14:20Z

A while back @mrocklin mentioned that building a framework for adaptive hyperparam selection could allow easier integration into dask-searchcv. I think it'd look something like

alg = AdaptiveHyperParameterSearch()

losses = None
while True:
    models = [model.set_params(config) for config in alg.get_configs(models, losses)]
    losses = [validation_loss(model) for model in models]
    models = alg.filter(models, losses)
    if alg.stop:
        return alg.best_model, alg.best_config

This is the framework required for adaptive algorithms. It allows for both getting a query with get_configs and processing the answer with filter.

ogrisel · 2018-05-28T22:49:10Z

But is alg.get_config a blocking code that waits for all the currently scheduled tasks to complete before returning (therefore implementing a generation based loop)? I think I like to have main loop use the raw dask.distributed API explicitly rather than wrapping it into the class.

initial_guesses = [...]
problem_definition = ...
optimizer = MyOptimizer(problem_definition)
futures = [client.submit(fit_and_score, params)
                for params in optimizer.initial_guesses(n_workers)]

seq = as_completed(futures, with_results=True)
for future, result in seq:
    optimizer.add_result(result)
    if optimizer.has_converged():
        break
    params = optimizer.compute_next_guess()
    new_future = client.submit(optimizer.step, params)
    seq.add(new_future)

jimmywan · 2019-03-07T21:36:45Z

I've been following this off/on for a while and very excited to see this become available.
Any update on when the work might be complete?

TomAugspurger · 2019-03-07T21:56:04Z

I believe Scott is planning to finish it off in a couple weeks.

…

On Thu, Mar 7, 2019 at 3:36 PM Jimmy Wan ***@***.***> wrote: I've been following this off/on for a while and very excited to see this become available. Any update on when the work might be complete? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#161 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIipch4mxlZ6YA1WEwvP2oJanC5vlks5vUYZtgaJpZM4TblQo> .

stsievert · 2019-03-08T05:40:23Z

I believe Scott is planning to finish it off in a couple weeks.

Yup, that's the plan with #221.

jimmywan · 2019-06-15T00:04:28Z

Saw that #221 just got merged. Woohoo! Can't wait to use this!

TomAugspurger · 2020-05-06T23:13:47Z

https://distill.pub/2020/bayesian-optimization has some good background reading on Bayesian optimization.

stsievert · 2020-05-07T01:17:26Z

https://distill.pub/2020/bayesian-optimizations has some good background

Thanks! I like the simple explanation.

Hyperband and bayesian optimization work well together, and has seen some previous work (https://ml.informatik.uni-freiburg.de/papers/18-ICML-BOHB.pdf). Hyperband samples parameters randomly, and runs the different brackets completely in parallel. Instead, why not refine the parameters space between brackets? The most aggressive bracket is run first to narrow the search space; then a slightly less aggressive bracket with a more refined search space, and so on.

Here's their algorithm:

# bracket 0 has no early stopping; bracket 5 has most early stopping
brackets = reversed(range(6))
params = {...}  # user-defined
for bracket in brackets:
    search = SuccessiveHalvingSearchCV(model, params)
    search.fit(...)

    params = refine(params, search)

# return best performing model

The implementation backing this is at https://automl.github.io/HpBandSter/build/html/optimizers/bohb.html.

mrocklin mentioned this issue Apr 20, 2018

Dask scikit-optimize/scikit-optimize#661

Open

stsievert mentioned this issue May 6, 2018

Algorithms for distributed training scisprints/2018_05_sklearn_skimage_dask#4

Open

stsievert mentioned this issue May 25, 2018

WIP: ENH: implement hyperband dask/dask-searchcv#71

Closed

stsievert mentioned this issue Jun 3, 2018

ENH: Hyperband implementation dask/dask-searchcv#72

Closed

3 tasks

This was referenced Jun 15, 2018

Customizing BaseSearchCV is hard scikit-learn/scikit-learn#9499

Closed

ENH: Hyperband implementation #221

Merged

TomAugspurger mentioned this issue Jun 29, 2018

bayesian optimization for hyperparameter tuning dask/dask-searchcv#58

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive hyper-parameter optimization #161

Adaptive hyper-parameter optimization #161

mrocklin commented Apr 19, 2018

rth commented Apr 20, 2018 •

edited

Loading

ogrisel commented Apr 20, 2018 •

edited

Loading

TomAugspurger commented Apr 20, 2018

ogrisel commented Apr 20, 2018

stsievert commented Apr 20, 2018

stsievert commented May 24, 2018 •

edited

Loading

ogrisel commented May 28, 2018 •

edited

Loading

jimmywan commented Mar 7, 2019

TomAugspurger commented Mar 7, 2019 via email

stsievert commented Mar 8, 2019 •

edited

Loading

jimmywan commented Jun 15, 2019

TomAugspurger commented May 6, 2020 •

edited

Loading

stsievert commented May 7, 2020 •

edited

Loading

Adaptive hyper-parameter optimization #161

Adaptive hyper-parameter optimization #161

Comments

mrocklin commented Apr 19, 2018

rth commented Apr 20, 2018 • edited Loading

ogrisel commented Apr 20, 2018 • edited Loading

TomAugspurger commented Apr 20, 2018

ogrisel commented Apr 20, 2018

stsievert commented Apr 20, 2018

stsievert commented May 24, 2018 • edited Loading

ogrisel commented May 28, 2018 • edited Loading

jimmywan commented Mar 7, 2019

TomAugspurger commented Mar 7, 2019 via email

stsievert commented Mar 8, 2019 • edited Loading

jimmywan commented Jun 15, 2019

TomAugspurger commented May 6, 2020 • edited Loading

stsievert commented May 7, 2020 • edited Loading

rth commented Apr 20, 2018 •

edited

Loading

ogrisel commented Apr 20, 2018 •

edited

Loading

stsievert commented May 24, 2018 •

edited

Loading

ogrisel commented May 28, 2018 •

edited

Loading

stsievert commented Mar 8, 2019 •

edited

Loading

TomAugspurger commented May 6, 2020 •

edited

Loading

stsievert commented May 7, 2020 •

edited

Loading