Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enable inference regret (facebook#2782)
Summary: Pull Request resolved: facebook#2782 # Context: Currently, the benchmarks compute an "oracle" value for each point seen, which evaluates the point noiselessly and at the target task and fidelity, or in a way specified by the `BenchmarkRunner`. This produces an `optimization_trace` used for measuring performance. (For MOO, the hypervolume of all points tested is computed.) While this trace does a good job of capturing whether a good point has been tested, it does not capture *inference regret*: The difference between the value of the the point the model would recommend and that of the best point. This distinction becomes important (both for getting a good measure of absolute performance and for comparing methods) in contexts such as * Bandit problems (in a noisy and discrete space), where the best point will be seen quickly; the question is when the model identifies it * Multi-fidelity problems, where simply evaluating as many small arms as possible maximizes the current metric for optimization value * Noisy problems, if different best-point selection strategies are being considered. # Open questions * Should inference value always be computed? My take: Yes, it needen't add much computational overhead, as long as evaluating the same parameterization a second time isn't expensive, because we can use a best-pont selection strategy of "empirical best." Current implementation: Always computes this. * Should the "oracle trace" (the status quo behavior) always be computed? My take: Yes, because people say they find it helpful, and for consistency with the past. Current implementation: Always computes this. * If we want both, should we tag one of the two traces as "the" trace, for backwards compatibility? The current implementation does this; `BenchmarkResult.optimization_trace` is one of the `inference_value_trace` and the `oracle_trace`, with the `BenchmarkProblem` specifying which one. * Set of best points returned for MOO: Is choosing K points and then evaluating them by hypervolume what we want? * To what degree do we want to rely on Ax's `BestPointMixin` functionality, which is pretty stale, missing functionality we want, requires constructing dummy `Experiment`s, and won't do the right thing for multi-fidelity and multi-task methods? An alternative approach would be to support this for MBM only, which would address or enable addressing all these issues. * When should the trace be updated in async settings? * This diff adds support for SOO and MOO and for `n_best_points`, but only supports SOO with 1 best point. That's a lot of infra for raising `NotImplementedError`s. Is this what we want? * In sample and out of sample: Currently, I'm not using these terms at all since they are confusing in multi-task and multi-fidelity contexts. Is that what we want? * When people develop best-point functionality in the future, would they do it be updating or adding options to `BestPointMixin._get_trace`? I wrote this under the assumption that they would either do that or use a similar method that consumes an `experiment` and `optimization_config` and can access the `generation_strategy` used. # This diff ## High-level changes Technically, this adds "inference value" rather than "inference regret", because it is not relative to the optimum. That gives it the same sign as the default optimization_trace. It is always computed and returned on the `BenchmarkResult`. The old trace is renamed the `oracle_trace`. `optimization_trace` continues to exist; it can be either the `oracle_trace` (default) or the `inference_trace`, depending on what the `BenchmarkProblem` specifies. The `BenchmarkMethod` is responsible for specifying a best-point selector. This currently relies heavily on Ax's best-point functionality, but this can be overridden. There are major limitations: * *The ideal approach for MOO isn't supported yet, so MOO isn't supported at all with inference value*: The `BenchmarkProblem` specifies `n_best_points`, how many points are returned as the best, and for MOO, we would want `n_best_points > 1` and to take the hypervolume of the oracle values at those points. That is the only way it makes sense to set this up if we want to compare best-point selectors. If we use hypervolume and don't cap `n_best_points`, the ideal best-point selector would give every point. Metrics other than hypervolume, such as the fraction of "best" points actually on the Pareto frontier, would also be odd. However, there is no Ax functionality generically hooked up for getting `k` points to maximize expected hypervolume. * Different best-point selectors can be compared by using a different `BenchmarkMethod`, either by passing different `best_point_kwargs` to the `BenchmarkMethod` or by subclassing `BenchmarkMethod` and overriding `get_best_parameters`. ## Detailed changes ### BenchmarkResult Docstrings ought to be self-explanatory. * The old `optimization_trace` becomes `oracle_trace` * It always has an `inference_value_trace` as well as an `oracle_trace` * The `optimization_trace` can be either, depending on what the `BenchmarkProblem` specifies. ### `benchmark_replication` * Computes inference value after each time the scheduler generates a trial. Note that incomplete trials can thus be used, since this could then happen before the trial completes. * For MOO, should find K pareto-optimal parameters (according to the model), get their oracle values, and get the hypervolume of those oracle values in the following manner: constructs a new experiment with one BatchTrial whose arms are K pareto-optimal parameters and whose metrics are oracle values, and uses Ax's best-point functionality to get the hypervolume. This is done to avoid re-implementing inference of objective thresholds, using constraints, weighting, etc. HOWEVER, MOO is currently unsupported because we don't have a way of getting the K best. * For SOO, finds K best parameters (according to the model) and gets their oracle value. HOWEVER, K>1 is currently unsupported. ### `BenchmarkProblem` * Gets an attribute `report_inference_value_as_trace` that * Makes the BenchmarkResult's `optimization_trace` be inference value when the problem specifies that inference value should be used. Docstrings should be self-explanatory. * Adds to BenchmarkProblem. Docstrings should be self-explanatory. ### `BenchmarkMethod` * Adds a method `get_best_parameters` and an attribute `best_point_kwargs`. If not overridden, `get_best_parameters` uses `BestPointMixin._get_trace` and passes it the `best_point_kwargs`. * Currently, the only supported argument in `best_point_kwargs` is "use_model_predictions". Reviewed By: Balandat Differential Revision: D61930178
- Loading branch information