Enable inference regret (facebook#2782)

Summary: Pull Request resolved: facebook#2782 # Context: Currently, the benchmarks compute an "oracle" value for each point seen, which evaluates the point noiselessly and at the target task and fidelity, or in a way specified by the `BenchmarkRunner`. This produces an `optimization_trace` used for measuring performance. (For MOO, the hypervolume of all points tested is computed.) While this trace does a good job of capturing whether a good point has been tested, it does not capture *inference regret*: The difference between the value of the the point the model would recommend and that of the best point. This distinction becomes important (both for getting a good measure of absolute performance and for comparing methods) in contexts such as * Bandit problems (in a noisy and discrete space), where the best point will be seen quickly; the question is when the model identifies it * Multi-fidelity problems, where simply evaluating as many small arms as possible maximizes the current metric for optimization value * Noisy problems, if different best-point selection strategies are being considered. # Open questions * Should inference value always be computed? My take: Yes, it needen't add much computational overhead, as long as evaluating the same parameterization a second time isn't expensive, because we can use a best-pont selection strategy of "empirical best." Current implementation: Always computes this. * Should the "oracle trace" (the status quo behavior) always be computed? My take: Yes, because people say they find it helpful, and for consistency with the past. Current implementation: Always computes this. * If we want both, should we tag one of the two traces as "the" trace, for backwards compatibility? The current implementation does this; `BenchmarkResult.optimization_trace` is one of the `inference_value_trace` and the `oracle_trace`, with the `BenchmarkProblem` specifying which one. * Set of best points returned for MOO: Is choosing K points and then evaluating them by hypervolume what we want? * To what degree do we want to rely on Ax's `BestPointMixin` functionality, which is pretty stale, missing functionality we want, requires constructing dummy `Experiment`s, and won't do the right thing for multi-fidelity and multi-task methods? An alternative approach would be to support this for MBM only, which would address or enable addressing all these issues. * When should the trace be updated in async settings? * This diff adds support for SOO and MOO and for `n_best_points`, but only supports SOO with 1 best point. That's a lot of infra for raising `NotImplementedError`s. Is this what we want? * In sample and out of sample: Currently, I'm not using these terms at all since they are confusing in multi-task and multi-fidelity contexts. Is that what we want? * When people develop best-point functionality in the future, would they do it be updating or adding options to `BestPointMixin._get_trace`? I wrote this under the assumption that they would either do that or use a similar method that consumes an `experiment` and `optimization_config` and can access the `generation_strategy` used. # This diff ## High-level changes Technically, this adds "inference value" rather than "inference regret", because it is not relative to the optimum. That gives it the same sign as the default optimization_trace. It is always computed and returned on the `BenchmarkResult`. The old trace is renamed the `oracle_trace`. `optimization_trace` continues to exist; it can be either the `oracle_trace` (default) or the `inference_trace`, depending on what the `BenchmarkProblem` specifies. The `BenchmarkMethod` is responsible for specifying a best-point selector. This currently relies heavily on Ax's best-point functionality, but this can be overridden. There are major limitations: * *The ideal approach for MOO isn't supported yet, so MOO isn't supported at all with inference value*: The `BenchmarkProblem` specifies `n_best_points`, how many points are returned as the best, and for MOO, we would want `n_best_points > 1` and to take the hypervolume of the oracle values at those points. That is the only way it makes sense to set this up if we want to compare best-point selectors. If we use hypervolume and don't cap `n_best_points`, the ideal best-point selector would give every point. Metrics other than hypervolume, such as the fraction of "best" points actually on the Pareto frontier, would also be odd. However, there is no Ax functionality generically hooked up for getting `k` points to maximize expected hypervolume. * Different best-point selectors can be compared by using a different `BenchmarkMethod`, either by passing different `best_point_kwargs` to the `BenchmarkMethod` or by subclassing `BenchmarkMethod` and overriding `get_best_parameters`. ## Detailed changes ### BenchmarkResult Docstrings ought to be self-explanatory. * The old `optimization_trace` becomes `oracle_trace` * It always has an `inference_value_trace` as well as an `oracle_trace` * The `optimization_trace` can be either, depending on what the `BenchmarkProblem` specifies. ### `benchmark_replication` * Computes inference value after each time the scheduler generates a trial. Note that incomplete trials can thus be used, since this could then happen before the trial completes. * For MOO, should find K pareto-optimal parameters (according to the model), get their oracle values, and get the hypervolume of those oracle values in the following manner: constructs a new experiment with one BatchTrial whose arms are K pareto-optimal parameters and whose metrics are oracle values, and uses Ax's best-point functionality to get the hypervolume. This is done to avoid re-implementing inference of objective thresholds, using constraints, weighting, etc. HOWEVER, MOO is currently unsupported because we don't have a way of getting the K best. * For SOO, finds K best parameters (according to the model) and gets their oracle value. HOWEVER, K>1 is currently unsupported. ### `BenchmarkProblem` * Gets an attribute `report_inference_value_as_trace` that * Makes the BenchmarkResult's `optimization_trace` be inference value when the problem specifies that inference value should be used. Docstrings should be self-explanatory. * Adds to BenchmarkProblem. Docstrings should be self-explanatory. ### `BenchmarkMethod` * Adds a method `get_best_parameters` and an attribute `best_point_kwargs`. If not overridden, `get_best_parameters` uses `BestPointMixin._get_trace` and passes it the `best_point_kwargs`. * Currently, the only supported argument in `best_point_kwargs` is "use_model_predictions". Reviewed By: Balandat Differential Revision: D61930178
esantorella · Sep 24, 2024 · e0a1512 · e0a1512
1 parent 8ba8ce3
commit e0a1512
Show file tree

Hide file tree

Showing 10 changed files with 408 additions and 25 deletions.
diff --git a/ax/benchmark/benchmark.py b/ax/benchmark/benchmark.py
@@ -22,14 +22,15 @@
 from collections.abc import Iterable
 from itertools import product
 from logging import Logger
-from time import time
+from time import monotonic, time
 
 import numpy as np
 
 from ax.benchmark.benchmark_method import BenchmarkMethod
 from ax.benchmark.benchmark_problem import BenchmarkProblem
 from ax.benchmark.benchmark_result import AggregatedBenchmarkResult, BenchmarkResult
 from ax.core.experiment import Experiment
+from ax.core.types import TParameterization
 from ax.core.utils import get_model_times
 from ax.service.scheduler import Scheduler
 from ax.service.utils.best_point_mixin import BestPointMixin
@@ -93,12 +94,23 @@ def benchmark_replication(
     method: BenchmarkMethod,
     seed: int,
 ) -> BenchmarkResult:
-    """Runs one benchmarking replication (equivalent to one optimization loop).
+    """
+    Run one benchmarking replication (equivalent to one optimization loop).
+
+    After each trial, the `method` gets the best parameter(s) found so far, as
+    evaluated based on empirical data. After all trials are run, the `problem`
+    gets the oracle values of each "best" parameter; this yields the ``inference
+    trace``. The cumulative maximum of the oracle value of each parameterization
+    tested is the ``oracle_trace``.
+
 
     Args:
         problem: The BenchmarkProblem to test against (can be synthetic or real)
         method: The BenchmarkMethod to test
         seed: The seed to use for this replication.
+
+    Return:
+        ``BenchmarkResult`` object.
     """
 
     experiment = Experiment(
@@ -113,19 +125,70 @@ def benchmark_replication(
         generation_strategy=method.generation_strategy.clone_reset(),
         options=method.scheduler_options,
     )
+    timeout_hours = scheduler.options.timeout_hours
 
+    # list of parameters for each trial
+    best_params_by_trial: list[list[TParameterization]] = []
+
+    is_mf_or_mt = len(problem.runner.target_fidelity_and_task) > 0
+    # Run the optimization loop.
+    timeout_hours = scheduler.options.timeout_hours
     with with_rng_seed(seed=seed):
-        scheduler.run_n_trials(max_trials=problem.num_trials)
+        start = monotonic()
+        for _ in range(problem.num_trials):
+            next(
+                scheduler.run_trials_and_yield_results(
+                    max_trials=1, timeout_hours=timeout_hours
+                )
+            )
+            if timeout_hours is not None:
+                elapsed_hours = (monotonic() - start) / 3600
+                timeout_hours = timeout_hours - elapsed_hours
+                if timeout_hours <= 0:
+                    break
+
+            if problem.is_moo or is_mf_or_mt:
+                # Inference trace is not supported for MOO.
+                # It's also not supported for multi-fidelity or multi-task
+                # problems, because Ax's best-point functionality doesn't know
+                # to predict at the target task or fidelity.
+                continue
+
+            best_params = method.get_best_parameters(
+                experiment=experiment,
+                optimization_config=problem.optimization_config,
+                n_points=problem.n_best_points,
+            )
+            best_params_by_trial.append(best_params)
+
+    # Construct inference trace from best parameters
+    inference_trace = np.full(problem.num_trials, np.nan)
+    for trial_index, best_params in enumerate(best_params_by_trial):
+        if len(best_params) == 0:
+            inference_trace[trial_index] = np.nan
+            continue
+        # Construct an experiment with one BatchTrial
+        best_params_oracle_experiment = problem.get_oracle_experiment_from_params(
+            {0: {str(i): p for i, p in enumerate(best_params)}}
+        )
+        # Get the optimization trace. It will have only one point.
+        inference_trace[trial_index] = BestPointMixin._get_trace(
+            experiment=best_params_oracle_experiment,
+            optimization_config=problem.optimization_config,
+        )[0]
 
-    oracle_experiment = problem.get_oracle_experiment_from_experiment(
+    actual_params_oracle_experiment = problem.get_oracle_experiment_from_experiment(
         experiment=experiment
     )
-    optimization_trace = np.array(
+    oracle_trace = np.array(
         BestPointMixin._get_trace(
-            experiment=oracle_experiment,
+            experiment=actual_params_oracle_experiment,
             optimization_config=problem.optimization_config,
         )
     )
+    optimization_trace = (
+        inference_trace if problem.report_inference_value_as_trace else oracle_trace
+    )
 
     try:
         # Catch any errors that may occur during score computation, such as errors
@@ -155,6 +218,8 @@ def benchmark_replication(
         name=scheduler.experiment.name,
         seed=seed,
         experiment=scheduler.experiment,
+        oracle_trace=oracle_trace,
+        inference_trace=inference_trace,
         optimization_trace=optimization_trace,
         score_trace=score_trace,
         fit_time=fit_time,

diff --git a/ax/benchmark/benchmark_method.py b/ax/benchmark/benchmark_method.py
@@ -5,16 +5,20 @@
 
 # pyre-strict
 
-import logging
-from dataclasses import dataclass
+from dataclasses import dataclass, field
+
+from ax.core.experiment import Experiment
+from ax.core.optimization_config import (
+    MultiObjectiveOptimizationConfig,
+    OptimizationConfig,
+)
+from ax.core.types import TParameterization
 
 from ax.modelbridge.generation_strategy import GenerationStrategy
+from ax.service.utils.best_point_mixin import BestPointMixin
 from ax.service.utils.scheduler_options import SchedulerOptions, TrialType
 from ax.utils.common.base import Base
-from ax.utils.common.logger import get_logger
-
-
-logger: logging.Logger = get_logger("BenchmarkMethod")
+from pyre_extensions import none_throws
 
 
 @dataclass(frozen=True)
@@ -36,12 +40,78 @@ class BenchmarkMethod(Base):
             `get_benchmark_scheduler_options`.
         distribute_replications: Indicates whether the replications should be
             run in a distributed manner. Ax itself does not use this attribute.
+        best_point_kwargs: Arguments passed to `get_pareto_optimal_parameters`
+            (if multi-objective) or `BestPointMixin._get_best_trial` (if
+            single-objective). Currently, the only supported argument is
+            `use_model_predictions`. However, note that if multi-objective,
+            best-point selection is not currently supported and
+            `get_pareto_optimal_parameters` will raise a `NotImplementedError`.
     """
 
     name: str
     generation_strategy: GenerationStrategy
     scheduler_options: SchedulerOptions
     distribute_replications: bool = False
+    best_point_kwargs: dict[str, bool] = field(
+        default_factory=lambda: {"use_model_predictions": False}
+    )
+
+    def get_best_parameters(
+        self,
+        experiment: Experiment,
+        optimization_config: OptimizationConfig,
+        n_points: int,
+    ) -> list[TParameterization]:
+        """
+        Get ``n_points`` promising points. NOTE: Only SOO with n_points = 1 is
+        supported.
+
+        The expected use case is that these points will be evaluated against an
+        oracle for hypervolume (if multi-objective) or for the value of the best
+        parameter (if single-objective).
+
+        For multi-objective cases, ``n_points > 1`` is needed. For SOO, ``n_points > 1``
+        reflects setups where we can choose some points which will then be
+        evaluated noiselessly or at high fidelity and then use the best one.
+
+
+        Args:
+            experiment: The experiment to get the data from. This should contain
+                values that would be observed in a realistic setting and not
+                contain oracle values.
+            optimization_config: The ``optimization_config`` for the corresponding
+                ``BenchmarkProblem``.
+            n_points: The number of points to return.
+        """
+        if isinstance(optimization_config, MultiObjectiveOptimizationConfig):
+            raise NotImplementedError(
+                "BenchmarkMethod.get_pareto_optimal_parameters is not currently "
+                "supported for multi-objective problems."
+            )
+
+        if n_points != 1:
+            raise NotImplementedError(
+                f"Currently only n_points=1 is supported. Got {n_points=}."
+            )
+
+        # SOO, n=1 case.
+        # Note: This has the same effect Scheduler.get_best_parameters
+        result = BestPointMixin._get_best_trial(
+            experiment=experiment,
+            generation_strategy=self.generation_strategy,
+            optimization_config=optimization_config,
+            # pyre-fixme: Incompatible parameter type [6]: In call
+            # `get_pareto_optimal_parameters`, for 4th positional argument,
+            # expected `Optional[Iterable[int]]` but got `bool`.
+            **self.best_point_kwargs,
+        )
+        if result is None:
+            # This can happen if no points are predicted to satisfy all outcome
+            # constraints.
+            return []
+
+        i, params, prediction = none_throws(result)
+        return [params]
 
 
 def get_benchmark_scheduler_options(

diff --git a/ax/benchmark/benchmark_problem.py b/ax/benchmark/benchmark_problem.py
@@ -74,6 +74,13 @@ class BenchmarkProblem(Base):
         search_space: The search space.
         runner: The Runner that will be used to generate data for the problem,
             including any ground-truth data stored as tracking metrics.
+        report_inference_value_as_trace: Whether the ``optimization_trace`` on a
+            ``BenchmarkResult`` should use the ``oracle_trace`` (if False,
+            default) or the ``inference_trace``. See ``BenchmarkResult`` for
+            more information. Currently, this is only supported for
+            single-objective problems.
+        n_best_points: Number of points for a best-point selector to recommend.
+            Currently, only ``n_best_points=1`` is supported.
     """
 
     name: str
@@ -84,6 +91,17 @@ class BenchmarkProblem(Base):
 
     search_space: SearchSpace = field(repr=False)
     runner: BenchmarkRunner = field(repr=False)
+    report_inference_value_as_trace: bool = False
+    n_best_points: int = 1
+
+    def __post_init__(self) -> None:
+        if self.n_best_points != 1:
+            raise NotImplementedError("Only `n_best_points=1` is currently supported.")
+        if self.report_inference_value_as_trace and self.is_moo:
+            raise NotImplementedError(
+                "Inference trace is not supported for MOO. Please set "
+                "`report_inference_value_as_trace` to False."
+            )
 
     def get_oracle_experiment_from_params(
         self,
@@ -285,6 +303,7 @@ def create_problem_from_botorch(
     lower_is_better: bool = True,
     observe_noise_sd: bool = False,
     search_space: SearchSpace | None = None,
+    report_inference_value_as_trace: bool = False,
 ) -> BenchmarkProblem:
     """
     Create a `BenchmarkProblem` from a BoTorch `BaseTestProblem`.
@@ -308,6 +327,10 @@ def create_problem_from_botorch(
         search_space: If provided, the `search_space` of the `BenchmarkProblem`.
             Otherwise, a `SearchSpace` with all `RangeParameter`s is created
             from the bounds of the test problem.
+        report_inference_value_as_trace: If True, indicates that the
+            ``optimization_trace`` on a ``BenchmarkResult`` ought to be the
+            ``inference_trace``; otherwise, it will be the ``oracle_trace``.
+            See ``BenchmarkResult`` for more information.
     """
     # pyre-fixme [45]: Invalid class instantiation
     test_problem = test_problem_class(**test_problem_kwargs)
@@ -364,4 +387,5 @@ def create_problem_from_botorch(
         num_trials=num_trials,
         observe_noise_stds=observe_noise_sd,
         optimal_value=optimal_value,
+        report_inference_value_as_trace=report_inference_value_as_trace,
     )
diff --git a/ax/benchmark/benchmark_result.py b/ax/benchmark/benchmark_result.py
@@ -33,15 +33,38 @@ class BenchmarkResult(Base):
         name: Name of the benchmark. Should make it possible to determine the
             problem and the method.
         seed: Seed used for determinism.
-        optimization_trace: For single-objective problems, element i of the
-            optimization trace is the oracle value of the "best" point, computed
-            after the first i trials have been run. For multi-objective
-            problems, element i of the optimization trace is the hypervolume of
-            oracle values at a set of points, also computed after the first i
-            trials (even if these were ``BatchTrials``).  Oracle values are
-            typically ground-truth (rather than noisy) and evaluated at the
-            target task and fidelity.
-
+        oracle_trace: For single-objective problems, element i of the
+            optimization trace is the best oracle value of the arms evaluated
+            after the first i trials.  For multi-objective problems, element i
+            of the optimization trace is the hypervolume of the oracle values of
+            the arms in the first i trials (which may be ``BatchTrial``s).
+            Oracle values are typically ground-truth (rather than noisy) and
+            evaluated at the target task and fidelity.
+        inference_trace: Inference trace comes from choosing a "best" point
+            based only on data that would be observable in realistic settings
+            and then evaluating the oracle value of that point. For
+            multi-objective problems, we find a Pareto set and evaluate its
+            hypervolume.
+
+            There are several ways of specifying the "best" point: One could
+            pick the point with the best observed value, or the point with the
+            best model prediction, and could consider the whole search space,
+            the set of trials completed so far, etc. How the inference trace is
+            computed is specified by a best-point selector, which is an
+            attribute of the `BenchmarkMethod`.
+
+            Note: This is not "inference regret", which is a lower-is-better value
+            that is relative to the best possible value. The inference value
+            trace is higher-is-better if the problem is a maximization problem
+            or if the problem is multi-objective (in which case hypervolume is
+            used). Hence, it is signed the same as ``oracle_trace`` and
+            ``optimization_trace``. ``score_trace`` is higher-is-better and
+            relative to the optimum.
+        optimization_trace: Either the ``oracle_trace`` or the
+            ``inference_trace``, depending on whether the ``BenchmarkProblem``
+            specifies ``report_inference_value``. Having ``optimization_trace``
+            specified separately is useful when we need just one value to
+            evaluate how well the benchmark went.
         score_trace: The scores associated with the problem, typically either
             the optimization_trace or inference_value_trace normalized to a
             0-100 scale for comparability between problems.
@@ -56,6 +79,8 @@ class BenchmarkResult(Base):
     name: str
     seed: int
 
+    oracle_trace: ndarray
+    inference_trace: ndarray
     optimization_trace: ndarray
     score_trace: ndarray
 

diff --git a/ax/benchmark/methods/modular_botorch.py b/ax/benchmark/methods/modular_botorch.py
@@ -48,6 +48,7 @@ def get_sobol_botorch_modular_acquisition(
     name: Optional[str] = None,
     num_sobol_trials: int = 5,
     model_gen_kwargs: Optional[dict[str, Any]] = None,
+    best_point_kwargs: dict[str, bool] | None = None,
 ) -> BenchmarkMethod:
     """Get a `BenchmarkMethod` that uses Sobol followed by MBM.
 
@@ -64,6 +65,7 @@ def get_sobol_botorch_modular_acquisition(
             `BatchTrial`s.
         model_gen_kwargs: Passed to the BoTorch `GenerationStep` and ultimately
             to the BoTorch `Model`.
+        best_point_kwargs: Passed to the created `BenchmarkMethod`.
 
     Example:
         >>> # A simple example
@@ -138,4 +140,5 @@ def get_sobol_botorch_modular_acquisition(
         generation_strategy=generation_strategy,
         scheduler_options=scheduler_options or get_benchmark_scheduler_options(),
         distribute_replications=distribute_replications,
+        best_point_kwargs={} if best_point_kwargs is None else best_point_kwargs,
     )