Skip to content

Commit

Permalink
Enable inference regret (#2782)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: #2782

# Context:

Currently, the benchmarks compute an "oracle" value for each point seen, which evaluates the point noiselessly and at the target task and fidelity, or in a way specified by the `BenchmarkRunner`. This produces an `optimization_trace` used for measuring performance. (For MOO, the hypervolume of all points tested is computed.)

While this trace does a good job of capturing whether a good point has been tested, it does not capture *inference regret*: The difference between the value of the the point the model would recommend and that of the best point.  This distinction becomes important (both for getting a good measure of absolute performance and for comparing methods) in contexts such as
* Bandit problems (in a noisy and discrete space), where the best point will be seen quickly; the question is when the model identifies it
* Multi-fidelity problems, where simply evaluating as many small arms as possible maximizes the current metric for optimization value
* Noisy problems, if different best-point selection strategies are being considered.

# Open questions
* Should inference value always be computed? My take: Yes, it needen't add much computational overhead, as long as evaluating the same parameterization a second time isn't expensive, because we can use a best-pont selection strategy of "empirical best." Current implementation: Always computes this.
* Should the "oracle trace" (the status quo behavior) always be computed? My take: Yes, because people say they find it helpful, and for consistency with the past. Current implementation: Always computes this.
* If we want both, should we tag one of the two traces as "the" trace, for backwards compatibility? The current implementation does this; `BenchmarkResult.optimization_trace` is one of the `inference_value_trace` and the `oracle_trace`, with the `BenchmarkProblem` specifying which one.
* Set of best points returned for MOO: Is choosing K points and then evaluating them by hypervolume what we want?
* To what degree do we want to rely on Ax's `BestPointMixin` functionality, which is pretty stale, missing functionality we want, requires constructing dummy `Experiment`s, and won't do the right thing for multi-fidelity and multi-task methods? An alternative approach would be to support this for MBM only, which would address or enable addressing all these issues.
* When should the trace be updated in async settings?
* This diff adds support for SOO and MOO and for `n_best_points`, but only supports SOO with 1 best point. That's a lot of infra for raising `NotImplementedError`s. Is this what we want?
* In sample and out of sample: Currently, I'm not using these terms at all since they are confusing in multi-task and multi-fidelity contexts. Is that what we want?
* When people develop best-point functionality in the future, would they do it be updating or adding options to `BestPointMixin._get_trace`? I wrote this under the assumption that they would either do that or use a similar method that consumes an `experiment` and `optimization_config` and can access the `generation_strategy` used.

# This diff

## High-level changes

Technically, this adds "inference value" rather than "inference regret", because it is not relative to the optimum. That gives it the same sign as the default optimization_trace. It is always computed and returned on the `BenchmarkResult`. The old trace is renamed the `oracle_trace`. `optimization_trace` continues to exist; it can be either the `oracle_trace` (default) or the `inference_trace`, depending on what the `BenchmarkProblem` specifies. The `BenchmarkMethod` is responsible for specifying a best-point selector. This currently relies heavily on Ax's best-point functionality, but this can be overridden.

There are major limitations:
* *The ideal approach for MOO isn't supported yet, so MOO isn't supported at all with inference value*: The `BenchmarkProblem` specifies `n_best_points`, how many points are returned as the best, and for MOO, we would want `n_best_points > 1` and to take the hypervolume of the oracle values at those points. That is the only way it makes sense to set this up if we want to compare best-point selectors. If we use hypervolume and don't cap `n_best_points`, the ideal best-point selector would give every point. Metrics other than hypervolume, such as the fraction of "best" points actually on the Pareto frontier, would also be odd.  However, there is no Ax functionality generically hooked up for getting `k` points to maximize expected hypervolume.
* Different best-point selectors can be compared by using a different `BenchmarkMethod`, either by passing different `best_point_kwargs` to the `BenchmarkMethod` or by subclassing `BenchmarkMethod` and overriding `get_best_parameters`.

## Detailed changes

### BenchmarkResult
Docstrings ought to be self-explanatory.
* The old `optimization_trace` becomes `oracle_trace`
* It always has an `inference_value_trace` as well as an `oracle_trace`
* The `optimization_trace` can be either, depending on what the `BenchmarkProblem` specifies.

### `benchmark_replication`
* Computes inference value after each time the scheduler generates a trial. Note that incomplete trials can thus be used, since this could then happen before the trial completes.
* For MOO, should find K pareto-optimal parameters (according to the model), get their oracle values, and get the hypervolume of those oracle values in the following manner: constructs a new experiment with one BatchTrial whose arms are K pareto-optimal parameters and whose metrics are oracle values, and uses Ax's best-point functionality to get the hypervolume. This is done to avoid re-implementing inference of objective thresholds, using constraints, weighting, etc. HOWEVER, MOO is currently unsupported because we don't have a way of getting the K best.
* For SOO, finds K best parameters (according to the model) and gets their oracle value. HOWEVER, K>1 is currently unsupported.

### `BenchmarkProblem`
* Gets an attribute `report_inference_value_as_trace` that
* Makes the BenchmarkResult's `optimization_trace` be inference value when the problem specifies that inference value should be used. Docstrings should be self-explanatory.
* Adds  to BenchmarkProblem. Docstrings should be self-explanatory.

### `BenchmarkMethod`
* Adds a method `get_best_parameters` and an attribute `best_point_kwargs`. If not overridden, `get_best_parameters` uses `BestPointMixin._get_trace` and passes it the `best_point_kwargs`.
* Currently, the only supported argument in `best_point_kwargs` is "use_model_predictions".

Reviewed By: Balandat

Differential Revision: D61930178
  • Loading branch information
esantorella authored and facebook-github-bot committed Sep 24, 2024
1 parent 8ba8ce3 commit e0a1512
Show file tree
Hide file tree
Showing 10 changed files with 408 additions and 25 deletions.
77 changes: 71 additions & 6 deletions ax/benchmark/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,15 @@
from collections.abc import Iterable
from itertools import product
from logging import Logger
from time import time
from time import monotonic, time

import numpy as np

from ax.benchmark.benchmark_method import BenchmarkMethod
from ax.benchmark.benchmark_problem import BenchmarkProblem
from ax.benchmark.benchmark_result import AggregatedBenchmarkResult, BenchmarkResult
from ax.core.experiment import Experiment
from ax.core.types import TParameterization
from ax.core.utils import get_model_times
from ax.service.scheduler import Scheduler
from ax.service.utils.best_point_mixin import BestPointMixin
Expand Down Expand Up @@ -93,12 +94,23 @@ def benchmark_replication(
method: BenchmarkMethod,
seed: int,
) -> BenchmarkResult:
"""Runs one benchmarking replication (equivalent to one optimization loop).
"""
Run one benchmarking replication (equivalent to one optimization loop).
After each trial, the `method` gets the best parameter(s) found so far, as
evaluated based on empirical data. After all trials are run, the `problem`
gets the oracle values of each "best" parameter; this yields the ``inference
trace``. The cumulative maximum of the oracle value of each parameterization
tested is the ``oracle_trace``.
Args:
problem: The BenchmarkProblem to test against (can be synthetic or real)
method: The BenchmarkMethod to test
seed: The seed to use for this replication.
Return:
``BenchmarkResult`` object.
"""

experiment = Experiment(
Expand All @@ -113,19 +125,70 @@ def benchmark_replication(
generation_strategy=method.generation_strategy.clone_reset(),
options=method.scheduler_options,
)
timeout_hours = scheduler.options.timeout_hours

# list of parameters for each trial
best_params_by_trial: list[list[TParameterization]] = []

is_mf_or_mt = len(problem.runner.target_fidelity_and_task) > 0
# Run the optimization loop.
timeout_hours = scheduler.options.timeout_hours
with with_rng_seed(seed=seed):
scheduler.run_n_trials(max_trials=problem.num_trials)
start = monotonic()
for _ in range(problem.num_trials):
next(
scheduler.run_trials_and_yield_results(
max_trials=1, timeout_hours=timeout_hours
)
)
if timeout_hours is not None:
elapsed_hours = (monotonic() - start) / 3600
timeout_hours = timeout_hours - elapsed_hours
if timeout_hours <= 0:
break

if problem.is_moo or is_mf_or_mt:
# Inference trace is not supported for MOO.
# It's also not supported for multi-fidelity or multi-task
# problems, because Ax's best-point functionality doesn't know
# to predict at the target task or fidelity.
continue

best_params = method.get_best_parameters(
experiment=experiment,
optimization_config=problem.optimization_config,
n_points=problem.n_best_points,
)
best_params_by_trial.append(best_params)

# Construct inference trace from best parameters
inference_trace = np.full(problem.num_trials, np.nan)
for trial_index, best_params in enumerate(best_params_by_trial):
if len(best_params) == 0:
inference_trace[trial_index] = np.nan
continue
# Construct an experiment with one BatchTrial
best_params_oracle_experiment = problem.get_oracle_experiment_from_params(
{0: {str(i): p for i, p in enumerate(best_params)}}
)
# Get the optimization trace. It will have only one point.
inference_trace[trial_index] = BestPointMixin._get_trace(
experiment=best_params_oracle_experiment,
optimization_config=problem.optimization_config,
)[0]

oracle_experiment = problem.get_oracle_experiment_from_experiment(
actual_params_oracle_experiment = problem.get_oracle_experiment_from_experiment(
experiment=experiment
)
optimization_trace = np.array(
oracle_trace = np.array(
BestPointMixin._get_trace(
experiment=oracle_experiment,
experiment=actual_params_oracle_experiment,
optimization_config=problem.optimization_config,
)
)
optimization_trace = (
inference_trace if problem.report_inference_value_as_trace else oracle_trace
)

try:
# Catch any errors that may occur during score computation, such as errors
Expand Down Expand Up @@ -155,6 +218,8 @@ def benchmark_replication(
name=scheduler.experiment.name,
seed=seed,
experiment=scheduler.experiment,
oracle_trace=oracle_trace,
inference_trace=inference_trace,
optimization_trace=optimization_trace,
score_trace=score_trace,
fit_time=fit_time,
Expand Down
82 changes: 76 additions & 6 deletions ax/benchmark/benchmark_method.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,20 @@

# pyre-strict

import logging
from dataclasses import dataclass
from dataclasses import dataclass, field

from ax.core.experiment import Experiment
from ax.core.optimization_config import (
MultiObjectiveOptimizationConfig,
OptimizationConfig,
)
from ax.core.types import TParameterization

from ax.modelbridge.generation_strategy import GenerationStrategy
from ax.service.utils.best_point_mixin import BestPointMixin
from ax.service.utils.scheduler_options import SchedulerOptions, TrialType
from ax.utils.common.base import Base
from ax.utils.common.logger import get_logger


logger: logging.Logger = get_logger("BenchmarkMethod")
from pyre_extensions import none_throws


@dataclass(frozen=True)
Expand All @@ -36,12 +40,78 @@ class BenchmarkMethod(Base):
`get_benchmark_scheduler_options`.
distribute_replications: Indicates whether the replications should be
run in a distributed manner. Ax itself does not use this attribute.
best_point_kwargs: Arguments passed to `get_pareto_optimal_parameters`
(if multi-objective) or `BestPointMixin._get_best_trial` (if
single-objective). Currently, the only supported argument is
`use_model_predictions`. However, note that if multi-objective,
best-point selection is not currently supported and
`get_pareto_optimal_parameters` will raise a `NotImplementedError`.
"""

name: str
generation_strategy: GenerationStrategy
scheduler_options: SchedulerOptions
distribute_replications: bool = False
best_point_kwargs: dict[str, bool] = field(
default_factory=lambda: {"use_model_predictions": False}
)

def get_best_parameters(
self,
experiment: Experiment,
optimization_config: OptimizationConfig,
n_points: int,
) -> list[TParameterization]:
"""
Get ``n_points`` promising points. NOTE: Only SOO with n_points = 1 is
supported.
The expected use case is that these points will be evaluated against an
oracle for hypervolume (if multi-objective) or for the value of the best
parameter (if single-objective).
For multi-objective cases, ``n_points > 1`` is needed. For SOO, ``n_points > 1``
reflects setups where we can choose some points which will then be
evaluated noiselessly or at high fidelity and then use the best one.
Args:
experiment: The experiment to get the data from. This should contain
values that would be observed in a realistic setting and not
contain oracle values.
optimization_config: The ``optimization_config`` for the corresponding
``BenchmarkProblem``.
n_points: The number of points to return.
"""
if isinstance(optimization_config, MultiObjectiveOptimizationConfig):
raise NotImplementedError(
"BenchmarkMethod.get_pareto_optimal_parameters is not currently "
"supported for multi-objective problems."
)

if n_points != 1:
raise NotImplementedError(
f"Currently only n_points=1 is supported. Got {n_points=}."
)

# SOO, n=1 case.
# Note: This has the same effect Scheduler.get_best_parameters
result = BestPointMixin._get_best_trial(
experiment=experiment,
generation_strategy=self.generation_strategy,
optimization_config=optimization_config,
# pyre-fixme: Incompatible parameter type [6]: In call
# `get_pareto_optimal_parameters`, for 4th positional argument,
# expected `Optional[Iterable[int]]` but got `bool`.
**self.best_point_kwargs,
)
if result is None:
# This can happen if no points are predicted to satisfy all outcome
# constraints.
return []

i, params, prediction = none_throws(result)
return [params]


def get_benchmark_scheduler_options(
Expand Down
24 changes: 24 additions & 0 deletions ax/benchmark/benchmark_problem.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,13 @@ class BenchmarkProblem(Base):
search_space: The search space.
runner: The Runner that will be used to generate data for the problem,
including any ground-truth data stored as tracking metrics.
report_inference_value_as_trace: Whether the ``optimization_trace`` on a
``BenchmarkResult`` should use the ``oracle_trace`` (if False,
default) or the ``inference_trace``. See ``BenchmarkResult`` for
more information. Currently, this is only supported for
single-objective problems.
n_best_points: Number of points for a best-point selector to recommend.
Currently, only ``n_best_points=1`` is supported.
"""

name: str
Expand All @@ -84,6 +91,17 @@ class BenchmarkProblem(Base):

search_space: SearchSpace = field(repr=False)
runner: BenchmarkRunner = field(repr=False)
report_inference_value_as_trace: bool = False
n_best_points: int = 1

def __post_init__(self) -> None:
if self.n_best_points != 1:
raise NotImplementedError("Only `n_best_points=1` is currently supported.")
if self.report_inference_value_as_trace and self.is_moo:
raise NotImplementedError(
"Inference trace is not supported for MOO. Please set "
"`report_inference_value_as_trace` to False."
)

def get_oracle_experiment_from_params(
self,
Expand Down Expand Up @@ -285,6 +303,7 @@ def create_problem_from_botorch(
lower_is_better: bool = True,
observe_noise_sd: bool = False,
search_space: SearchSpace | None = None,
report_inference_value_as_trace: bool = False,
) -> BenchmarkProblem:
"""
Create a `BenchmarkProblem` from a BoTorch `BaseTestProblem`.
Expand All @@ -308,6 +327,10 @@ def create_problem_from_botorch(
search_space: If provided, the `search_space` of the `BenchmarkProblem`.
Otherwise, a `SearchSpace` with all `RangeParameter`s is created
from the bounds of the test problem.
report_inference_value_as_trace: If True, indicates that the
``optimization_trace`` on a ``BenchmarkResult`` ought to be the
``inference_trace``; otherwise, it will be the ``oracle_trace``.
See ``BenchmarkResult`` for more information.
"""
# pyre-fixme [45]: Invalid class instantiation
test_problem = test_problem_class(**test_problem_kwargs)
Expand Down Expand Up @@ -364,4 +387,5 @@ def create_problem_from_botorch(
num_trials=num_trials,
observe_noise_stds=observe_noise_sd,
optimal_value=optimal_value,
report_inference_value_as_trace=report_inference_value_as_trace,
)
43 changes: 34 additions & 9 deletions ax/benchmark/benchmark_result.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,38 @@ class BenchmarkResult(Base):
name: Name of the benchmark. Should make it possible to determine the
problem and the method.
seed: Seed used for determinism.
optimization_trace: For single-objective problems, element i of the
optimization trace is the oracle value of the "best" point, computed
after the first i trials have been run. For multi-objective
problems, element i of the optimization trace is the hypervolume of
oracle values at a set of points, also computed after the first i
trials (even if these were ``BatchTrials``). Oracle values are
typically ground-truth (rather than noisy) and evaluated at the
target task and fidelity.
oracle_trace: For single-objective problems, element i of the
optimization trace is the best oracle value of the arms evaluated
after the first i trials. For multi-objective problems, element i
of the optimization trace is the hypervolume of the oracle values of
the arms in the first i trials (which may be ``BatchTrial``s).
Oracle values are typically ground-truth (rather than noisy) and
evaluated at the target task and fidelity.
inference_trace: Inference trace comes from choosing a "best" point
based only on data that would be observable in realistic settings
and then evaluating the oracle value of that point. For
multi-objective problems, we find a Pareto set and evaluate its
hypervolume.
There are several ways of specifying the "best" point: One could
pick the point with the best observed value, or the point with the
best model prediction, and could consider the whole search space,
the set of trials completed so far, etc. How the inference trace is
computed is specified by a best-point selector, which is an
attribute of the `BenchmarkMethod`.
Note: This is not "inference regret", which is a lower-is-better value
that is relative to the best possible value. The inference value
trace is higher-is-better if the problem is a maximization problem
or if the problem is multi-objective (in which case hypervolume is
used). Hence, it is signed the same as ``oracle_trace`` and
``optimization_trace``. ``score_trace`` is higher-is-better and
relative to the optimum.
optimization_trace: Either the ``oracle_trace`` or the
``inference_trace``, depending on whether the ``BenchmarkProblem``
specifies ``report_inference_value``. Having ``optimization_trace``
specified separately is useful when we need just one value to
evaluate how well the benchmark went.
score_trace: The scores associated with the problem, typically either
the optimization_trace or inference_value_trace normalized to a
0-100 scale for comparability between problems.
Expand All @@ -56,6 +79,8 @@ class BenchmarkResult(Base):
name: str
seed: int

oracle_trace: ndarray
inference_trace: ndarray
optimization_trace: ndarray
score_trace: ndarray

Expand Down
3 changes: 3 additions & 0 deletions ax/benchmark/methods/modular_botorch.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ def get_sobol_botorch_modular_acquisition(
name: Optional[str] = None,
num_sobol_trials: int = 5,
model_gen_kwargs: Optional[dict[str, Any]] = None,
best_point_kwargs: dict[str, bool] | None = None,
) -> BenchmarkMethod:
"""Get a `BenchmarkMethod` that uses Sobol followed by MBM.
Expand All @@ -64,6 +65,7 @@ def get_sobol_botorch_modular_acquisition(
`BatchTrial`s.
model_gen_kwargs: Passed to the BoTorch `GenerationStep` and ultimately
to the BoTorch `Model`.
best_point_kwargs: Passed to the created `BenchmarkMethod`.
Example:
>>> # A simple example
Expand Down Expand Up @@ -138,4 +140,5 @@ def get_sobol_botorch_modular_acquisition(
generation_strategy=generation_strategy,
scheduler_options=scheduler_options or get_benchmark_scheduler_options(),
distribute_replications=distribute_replications,
best_point_kwargs={} if best_point_kwargs is None else best_point_kwargs,
)
Loading

0 comments on commit e0a1512

Please sign in to comment.