Skip to content

Commit

Permalink
Add post draft.
Browse files Browse the repository at this point in the history
  • Loading branch information
aterenin committed Nov 14, 2024
1 parent dde0961 commit 131ba73
Show file tree
Hide file tree
Showing 4 changed files with 163 additions and 6 deletions.
4 changes: 2 additions & 2 deletions content/2023-12-10-Stochastic-Gradient-Descent-GP/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ In the paper, use stochastic gradient descent with Nesterov momentum, gradient c
Let's see how this algorithm performs, in particular how it is affected by observation noise in the likelihood.

{% figure(alt=["Convergence of stochastic gradient descent for the Gaussian process mean"] src=["exact_metrics.svg"] dark_invert=[true]) %}
**Figure 2.** Convergence of stochastic gradient descent for the Gaussian process posterior mean, in terms of training and test error, along with Euclidean error for the representer weights
**Figure 2.** Convergence of stochastic gradient descent for the Gaussian process posterior mean, in terms of training and test error, along with Euclidean error for the representer weights.
{% end %}

From this plot, it is clear that stochastic gradient descent does not converge approximately to the correct representer weights.
Expand Down Expand Up @@ -151,7 +151,7 @@ This suggests the benign non-convergence, which we previously saw in one dimensi

# Conclusion

In this, we explored using stochastic gradient descent to approximately compute Gaussian process posteriors, by way of means and function samples.
In this work, we explored using stochastic gradient descent to approximately compute Gaussian process posteriors, by way of means and function samples.
We examined how to derive appropriate stochastic optimization objectives for doing so, and showed that SGD can produce accurate predictions even in cases where it does not converge to the respective optimum under the given compute budget.
We developed a spectral characterization of the effect of non-convergence in terms of the spectral basis functions.
We showed that, on a Thompson sampling benchmark where well-calibrated uncertainty is critical, SGD matches or exceeds the performance of more computationally expensive baselines.
Expand Down
163 changes: 159 additions & 4 deletions content/2024-12-10-Pandoras-Box-BayesOpt/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,161 @@
+++
title = "Redirect"
template = "redirect.html"
title = "Cost-aware Bayesian Optimization via the Pandora's Box Gittins Index"
[extra]
redirect_to = "https://arxiv.org/abs/2406.20062"
+++
authors = [
{name = "Qian Xie", url = "https://qianjanexie.github.io"},
{name = "Raul Astudillo", url = "https://raulastudillo.netlify.app"},
{name = "Peter I. Frazier", url = "https://people.orie.cornell.edu/pfrazier/"},
{name = "Ziv Scully", url = "https://ziv.codes"},
{name = "Alexander Terenin", url = "https://avt.im/"},
]
venue = {name = "NeurIPS", date = 2024-12-10, url = "https://neurips.cc/Conferences/2024"}
buttons = [
{name = "Paper", url = "https://openreview.net/forum?id=Ouc1F0Sfb7"},
{name = "PDF", url = "https://arxiv.org/pdf/2406.20062"},
{name = "Code", url = "https://github.com/QianJaneXie/PandoraBayesOpt"},
]
katex = true
large_card = true
+++

Bayesian optimization is everywhere: from hyperparameter tuning, to AI for science applications, to systems like AlphaGo[^alphago] that have captured the imagination of the general public, Bayesian optimization is well-recognized as an effective way to perform black-box global optimization in settings where efficiency is key.
At the core of Bayesian optimization algorithms is the question of *how one should design the acquisition function*, particularly in realistic settings which incorporate evaluation costs and other factors that go beyond the classical framework.

In this work, we introduce a novel cost-aware acquisition function design framework.
To do so, we connect a certain variant of cost-aware Bayesian optimization with the *Pandora's Box* problem from economics—a decision problem whose solution can be reinterpreted as an acquisition function.
Using this connection, we propose the *Pandora's Box Gittins Index* acquisition function for general cost-aware Bayesian optimization problems, which we name so due to its connection with Gittins Index Theory, an abstract framework for deriving acquisition-function-like decision rules for certain classes of decision problems.
We show this acquisition function performs strongly, especially on problems of moderate-to-high dimension.
Our work takes a first step towards bringing ideas from Gittins Index Theory into Bayesian optimization, which we optimistically believe can have significant scope as an acquisition function design framework for cost-aware problems.


# Cost-aware Bayesian Optimization

In Bayesian optimization, we are interested black-box global optimization of an unknown function `$f:X\to\mathbb{R}$`, which we model using a Gaussian process.
One can formulate a number of *cost-aware* variants of Bayesian optimization, which additionally incorporate a cost function `$c:X\to\mathbb{R}_+$` which models the cost of obtaining another sample.
For instance, in the *expected budget-constrained* variant of the problem, we are interested in algorithms which achieve a small expected *simple regret*
```
$$
\mathbb{E} \sup_{x\in X} f(x) - \mathbb{E} \sup_{1\leq t\leq T} f(x_t)
$$
```
subject to the budget constraint `$\mathbb{E} \sum_{t=1}^T c(x_t) \leq B$`, which holds in expectation.
We allow the algorithm to decide when to stop sampling: we denote the stopping time by `$T$`, and once the algorithm stops it returns the best point observed so far.

Our starting point is to ask: *are there simplified settings where one can analytically derive the optimal algorithm—that is, the one that achieves the smallest expected simple regret*?
As an example, in the non-cost-aware setting, one can derive the classical *expected improvement* acquisition function by considering a one-step greedy approximation to a dynamic program defined using simple regret.
In our paper, we show there is a second, different simplification that leads to an analytic solution—of a spatial rather than temporal character—which we now describe.


# Pandora's Box

Suppose that a decision-making agent is presented with a collection of closed boxes, labeled `$X=\{1,..,N\}$`.
Each box has a reward `$f(x)$` inside, with a known distribution—for instance, Gaussian with mean `$\mu(x)$` and variance `$\sigma^2(x)$`.
The rewards inside all boxes are independent.
The agent is allowed to pay a cost `$c(x)$` to open any of the closed boxes, at which point the reward inside the box is revealed.
The agent is also allowed to take a reward from at most one open box, which ends the decision-problem.
The agent's total value is therefore
```
$$
\mathbb{E} \sup_{1\leq t\leq T} f(x_t) - \sum_{t=1}^T c(x_t)
$$
```
where `$T+1$` is the time at which the agent decided to take a reward from an open box.[^bestopenbox]
Figure 1 illustrates the setup.


{% figure(alt=["Pandora's Box"] src=["pandoras_box.svg"] dark_invert=[true]) %}
**Figure 1.** A visual illustration of the Pandora's Box problem, with two closed boxes and one open box.
{% end %}

This already looks a lot like expected budget-constrained optimization, but with two changes:

1. In Bayesian optimization, the set `$X$` need not be finite, and the objective function values `$f(x)$` and `$f(x')$` for `$x \neq x'$` can be correlated.
2. Instead of incorporating costs using an expected budget constraint, we add them to the simple regret objective.

Of these differences, the first is significant: correlations are what allow us to perform Bayesian optimization with Gaussian processes which model smooth functions.
Given the importance of smoothness, this difference therefore indicates we've departed some distance from the Bayesian optimization setup we started with.
So, why should one even consider what happens without correlations?
One reason, stated in our notation, is as follows.

**Theorem (Weitzman 1979).**
The optimal policy of the Pandora's Box problem takes the form of maximizing the acquisition function `$\alpha^\star$`, defined as
```
$$
\alpha^\star(x) = g \quad\text{where}\ g\ \text{solves}\quad \operatorname{EI}_f(x;g) = c(x)
$$
```
where `$\operatorname{EI}_\psi(x;y) = \mathbb{E} \max(0, \psi(x) - y)$` is the expected improvement function.

This means that the optimal policy in the Pandora's Box problem *takes the form of maximizing an acquisition function*.
This specific acquisition function `$\alpha^\star(x)$` is defined in terms of a root-finding problem whose objective resembles the classical expected improvement acquisition function.

Before proceeding, let us also address the difference between the expected budget-constrained setup we started with, and the cost-per-sample setup used in the Pandora's Box problem.
In short, if tie-breaking is handled correctly, both problems have the same solution as consequence of Lagrangian duality: we show one can solve the cost-per-sample Pandora's Box with costs `$\lambda c(\cdot)$`, where `$\lambda$` is the Lagrange multiplier, and obtain an optimal policy for an expected-budget-constrained Pandora's Box.[^lagrangian-duality]
This means correlations are the essential difference between the two setups.

# Solving Pandora's Box

Why does the solution to Pandora's Box take the form of maximizing an acquisition function?
What is the meaning of the root-finding problem which defines `$\alpha^\star$`?
One way to approach these questions is to first consider a simpler problem, involving just two boxes: one closed box with reward `$f$` and cost `$c$`, and one open box with reward `$g$`. Figure 2 illustrates this.

{% figure(alt=["Pandora's Box"] src=["pandoras_box_comparison.svg"] dark_invert=[true]) %}
**Figure 2.** A simplified Pandora's Box problem, with one closed and one open box.
{% end %}

Now, there's only one decision: we can either open the box, or leave it closed.
Let's see what the value is in both cases:

1. If we open the box, our reward is `$\mathbb{E} \max(f, g) - c$`.
2. If we don't open the box, our reward is `$g$`.

In the former expression, the maximum appears because we can only take one reward, and always choose to take the larger one.
From these expressions, it is easy to see that it is optimal to open the box if `$\mathbb{E}\max(f-g,0) \geq c$`.
In other words, it is optimal to open the closed box if its expected improvement is larger than its cost to open.

Now, we ask: *what would `$g$` need to be in order for both actions to be optimal?*
In such a situation, one can think of the closed box and open box as equivalent to one another, since the same expected reward is obtained no matter what decision is made.
As consequence, the value of `$g$` for which `$\mathbb{E}\max(f-g,0) = c$` can be thought of as a kind of *fair price* or *fair value* for the closed box, depending on the sign convention in use.

The insight of Weitzman—and indeed, of Gittins, who discovered the same idea in a much more general setting—is that the original Pandora's Box problem can be solved by replacing closed boxes with equivalent open boxes one-by-one without affecting the optimal decision.
Using this procedure, the optimal decision is to order all boxes by their fair value, and open boxes according to this order until all remaining fair values are smaller than the best observed reward.

# The Pandora's Box Gittins Index Acquisition Function

Our paper's primary contribution is to bring these ideas to Bayesian optimization.
For this, we need to handle correlations, which we propose to do in the obvious way: by plugging in the correlated posterior distribution into `$\alpha^\star$`.
This gives the *Pandora's Box Gittins Index* acquisition function
```
$$
\alpha^{\operatorname{PBGI}}(x) = g \quad\text{where}\ g\ \text{solves}\quad \operatorname{EI}_{f\mid y}(x;g) = c(x)
$$
```
defined as the solution of a root-finding problem whose objective involves the posterior distribution and cost function.
One can use this in both cost-aware and classical settings, in the latter case by having the costs be a constant function whose value is treated as a hyperparameter.
Our paper describes a number of properties of this acquisition function, such as its gradient, and connections with other acquisition functions such as UCB and expected improvement.
Let's see how this acquisition function performs.


# Experiments

Baselines

# Future Work

Beyond vanilla cost-aware Bayesian optimization, we think approaches like ours can give rise to a broad class of acquisition functions for settings with more complicated forms of feedback.
For instance, one can consider analogues of the one-closed-box and one-open-box setup, and replace the closed box with a general stochastic process.
In such settings, Gittins index theory applies as well, and the optimal policy for the discrete, uncorrelated problem takes the form of an acquisition function.
The challenge becomes how to compute what the acquisition function is.
We think this approach may be a promising angle of attack for various more complex forms of cost-aware Bayesian optimization, such as freeze-thaw, multi-fidelity, and other related setups.

# Conclusions

In this work, we connected cost-aware Bayesian optimization

# References

[^alphago]: See [this tweet](https://x.com/NandoDF/status/1791204574004498729) by Nando de Freitas, which calls Bayesian optimization one of the "secret ingredients" of AlphaGo—one which improved its win rate from 50% to 66.5% in self-play games through better hyperparameter tuning.

[^bestopenbox]: If there is more than one open box, we assume the agent always takes the largest reward among all open boxes, since this leads to a higher value and is therefore the optimal decision in this situation. We also need to handle tie-breaking: see the paper's appendix for details on this.

[^lagrangian-duality]: Our results extend the work of Aminian et al., who prove the same result in a certain extension Pandora's Box under the assumption of discrete support. We prove the result for only the Pandora's Box model, but our argument allows for continuous support—a detail which makes the argument significantly more challenging due to subtleties involving envelope theorems. A discussion on this can be found in the paper.
1 change: 1 addition & 0 deletions content/2024-12-10-Pandoras-Box-BayesOpt/pandoras_box.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 131ba73

Please sign in to comment.