Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable inference regret #2782

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

esantorella
Copy link
Contributor

Summary:

Context:

Currently, the benchmarks compute an "oracle" value for each point seen, which evaluates the point noiselessly and at the target task and fidelity, or in a way specified by the BenchmarkRunner. This produces an optimization_trace used for measuring performance. (For MOO, the hypervolume of all points tested is computed.)

While this trace does a good job of capturing whether a good point has been tested, it does not capture inference regret: The difference between the value of the the point the model would recommend and that of the best point. This distinction becomes important (both for getting a good measure of absolute performance and for comparing methods) in contexts such as

  • Bandit problems (in a noisy and discrete space), where the best point will be seen quickly; the question is when the model identifies it
  • Multi-fidelity problems, where simply evaluating as many small arms as possible maximizes the current metric for optimization value
  • Noisy problems, if different best-point selection strategies are being considered.

Open questions

  • Should inference value always be computed? My take: Yes, it needen't add much computational overhead, as long as evaluating the same parameterization a second time isn't expensive, because we can use a best-pont selection strategy of "empirical best." Current implementation: Always computes this.
  • Should the "oracle trace" (the status quo behavior) always be computed? My take: Yes, because people say they find it helpful, and for consistency with the past. Current implementation: Always computes this.
  • If we want both, should we tag one of the two traces as "the" trace, for backwards compatibility? The current implementation does this; BenchmarkResult.optimization_trace is one of the inference_value_trace and the oracle_trace, with the BenchmarkProblem specifying which one.
  • Set of best points returned for MOO: Is choosing K points and then evaluating them by hypervolume what we want?
  • To what degree do we want to rely on Ax's BestPointMixin functionality, which is pretty stale, missing functionality we want, requires constructing dummy Experiments, and won't do the right thing for multi-fidelity and multi-task methods? An alternative approach would be to support this for MBM only, which would address or enable addressing all these issues.
  • When should the trace be updated in async settings?
  • This diff adds support for SOO and MOO and for n_best_points, but only supports SOO with 1 best point. That's a lot of infra for raising NotImplementedErrors. Is this what we want?
  • In sample and out of sample: Currently, I'm not using these terms at all since they are confusing in multi-task and multi-fidelity contexts. Is that what we want?
  • When people develop best-point functionality in the future, would they do it be updating or adding options to BestPointMixin._get_trace? I wrote this under the assumption that they would either do that or use a similar method that consumes an experiment and optimization_config and can access the generation_strategy used.

This diff

High-level changes

Technically, this adds "inference value" rather than "inference regret", because it is not relative to the optimum. That gives it the same sign as the default optimization_trace. It is always computed and returned on the BenchmarkResult. The old trace is renamed the oracle_trace. optimization_trace continues to exist; it can be either the oracle_trace (default) or the inference_trace, depending on what the BenchmarkProblem specifies. The BenchmarkMethod is responsible for specifying a best-point selector. This currently relies heavily on Ax's best-point functionality, but this can be overridden.

There are major limitations:

  • The ideal approach for MOO isn't supported yet, so MOO isn't supported at all with inference value: The BenchmarkProblem specifies n_best_points, how many points are returned as the best, and for MOO, we would want n_best_points > 1 and to take the hypervolume of the oracle values at those points. That is the only way it makes sense to set this up if we want to compare best-point selectors. If we use hypervolume and don't cap n_best_points, the ideal best-point selector would give every point. Metrics other than hypervolume, such as the fraction of "best" points actually on the Pareto frontier, would also be odd. However, there is no Ax functionality generically hooked up for getting k points to maximize expected hypervolume.
  • Different best-point selectors can be compared by using a different BenchmarkMethod, either by passing different best_point_kwargs to the BenchmarkMethod or by subclassing BenchmarkMethod and overriding get_best_parameters.

Detailed changes

BenchmarkResult

Docstrings ought to be self-explanatory.

  • The old optimization_trace becomes oracle_trace
  • It always has an inference_value_trace as well as an oracle_trace
  • The optimization_trace can be either, depending on what the BenchmarkProblem specifies.

benchmark_replication

  • Computes inference value after each time the scheduler generates a trial. Note that incomplete trials can thus be used, since this could then happen before the trial completes.
  • For MOO, should find K pareto-optimal parameters (according to the model), get their oracle values, and get the hypervolume of those oracle values in the following manner: constructs a new experiment with one BatchTrial whose arms are K pareto-optimal parameters and whose metrics are oracle values, and uses Ax's best-point functionality to get the hypervolume. This is done to avoid re-implementing inference of objective thresholds, using constraints, weighting, etc. HOWEVER, MOO is currently unsupported because we don't have a way of getting the K best.
  • For SOO, finds K best parameters (according to the model) and gets their oracle value. HOWEVER, K>1 is currently unsupported.

BenchmarkProblem

  • Gets an attribute report_inference_value_as_trace that
  • Makes the BenchmarkResult's optimization_trace be inference value when the problem specifies that inference value should be used. Docstrings should be self-explanatory.
  • Adds to BenchmarkProblem. Docstrings should be self-explanatory.

BenchmarkMethod

  • Adds a method get_best_parameters and an attribute best_point_kwargs. If not overridden, get_best_parameters uses BestPointMixin._get_trace and passes it the best_point_kwargs.
  • Currently, the only supported argument in best_point_kwargs is "use_model_predictions".

Reviewed By: Balandat

Differential Revision: D61930178

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Sep 24, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61930178

@codecov-commenter
Copy link

codecov-commenter commented Sep 24, 2024

Codecov Report

Attention: Patch coverage is 95.49550% with 5 lines in your changes missing coverage. Please review.

Project coverage is 95.68%. Comparing base (8ba8ce3) to head (d62296c).

Files with missing lines Patch % Lines
ax/benchmark/benchmark_method.py 82.35% 3 Missing ⚠️
ax/benchmark/benchmark.py 92.85% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2782      +/-   ##
==========================================
- Coverage   95.68%   95.68%   -0.01%     
==========================================
  Files         488      488              
  Lines       47843    47943     +100     
==========================================
+ Hits        45779    45874      +95     
- Misses       2064     2069       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61930178

esantorella added a commit to esantorella/Ax that referenced this pull request Sep 24, 2024
Summary:
Pull Request resolved: facebook#2782

# Context:

Currently, the benchmarks compute an "oracle" value for each point seen, which evaluates the point noiselessly and at the target task and fidelity, or in a way specified by the `BenchmarkRunner`. This produces an `optimization_trace` used for measuring performance. (For MOO, the hypervolume of all points tested is computed.)

While this trace does a good job of capturing whether a good point has been tested, it does not capture *inference regret*: The difference between the value of the the point the model would recommend and that of the best point.  This distinction becomes important (both for getting a good measure of absolute performance and for comparing methods) in contexts such as
* Bandit problems (in a noisy and discrete space), where the best point will be seen quickly; the question is when the model identifies it
* Multi-fidelity problems, where simply evaluating as many small arms as possible maximizes the current metric for optimization value
* Noisy problems, if different best-point selection strategies are being considered.

# Open questions
* Should inference value always be computed? My take: Yes, it needen't add much computational overhead, as long as evaluating the same parameterization a second time isn't expensive, because we can use a best-pont selection strategy of "empirical best." Current implementation: Always computes this.
* Should the "oracle trace" (the status quo behavior) always be computed? My take: Yes, because people say they find it helpful, and for consistency with the past. Current implementation: Always computes this.
* If we want both, should we tag one of the two traces as "the" trace, for backwards compatibility? The current implementation does this; `BenchmarkResult.optimization_trace` is one of the `inference_value_trace` and the `oracle_trace`, with the `BenchmarkProblem` specifying which one.
* Set of best points returned for MOO: Is choosing K points and then evaluating them by hypervolume what we want?
* To what degree do we want to rely on Ax's `BestPointMixin` functionality, which is pretty stale, missing functionality we want, requires constructing dummy `Experiment`s, and won't do the right thing for multi-fidelity and multi-task methods? An alternative approach would be to support this for MBM only, which would address or enable addressing all these issues.
* When should the trace be updated in async settings?
* This diff adds support for SOO and MOO and for `n_best_points`, but only supports SOO with 1 best point. That's a lot of infra for raising `NotImplementedError`s. Is this what we want?
* In sample and out of sample: Currently, I'm not using these terms at all since they are confusing in multi-task and multi-fidelity contexts. Is that what we want?
* When people develop best-point functionality in the future, would they do it be updating or adding options to `BestPointMixin._get_trace`? I wrote this under the assumption that they would either do that or use a similar method that consumes an `experiment` and `optimization_config` and can access the `generation_strategy` used.

# This diff

## High-level changes

Technically, this adds "inference value" rather than "inference regret", because it is not relative to the optimum. That gives it the same sign as the default optimization_trace. It is always computed and returned on the `BenchmarkResult`. The old trace is renamed the `oracle_trace`. `optimization_trace` continues to exist; it can be either the `oracle_trace` (default) or the `inference_trace`, depending on what the `BenchmarkProblem` specifies. The `BenchmarkMethod` is responsible for specifying a best-point selector. This currently relies heavily on Ax's best-point functionality, but this can be overridden.

There are major limitations:
* *The ideal approach for MOO isn't supported yet, so MOO isn't supported at all with inference value*: The `BenchmarkProblem` specifies `n_best_points`, how many points are returned as the best, and for MOO, we would want `n_best_points > 1` and to take the hypervolume of the oracle values at those points. That is the only way it makes sense to set this up if we want to compare best-point selectors. If we use hypervolume and don't cap `n_best_points`, the ideal best-point selector would give every point. Metrics other than hypervolume, such as the fraction of "best" points actually on the Pareto frontier, would also be odd.  However, there is no Ax functionality generically hooked up for getting `k` points to maximize expected hypervolume.
* Different best-point selectors can be compared by using a different `BenchmarkMethod`, either by passing different `best_point_kwargs` to the `BenchmarkMethod` or by subclassing `BenchmarkMethod` and overriding `get_best_parameters`.

## Detailed changes

### BenchmarkResult
Docstrings ought to be self-explanatory.
* The old `optimization_trace` becomes `oracle_trace`
* It always has an `inference_value_trace` as well as an `oracle_trace`
* The `optimization_trace` can be either, depending on what the `BenchmarkProblem` specifies.

### `benchmark_replication`
* Computes inference value after each time the scheduler generates a trial. Note that incomplete trials can thus be used, since this could then happen before the trial completes.
* For MOO, should find K pareto-optimal parameters (according to the model), get their oracle values, and get the hypervolume of those oracle values in the following manner: constructs a new experiment with one BatchTrial whose arms are K pareto-optimal parameters and whose metrics are oracle values, and uses Ax's best-point functionality to get the hypervolume. This is done to avoid re-implementing inference of objective thresholds, using constraints, weighting, etc. HOWEVER, MOO is currently unsupported because we don't have a way of getting the K best.
* For SOO, finds K best parameters (according to the model) and gets their oracle value. HOWEVER, K>1 is currently unsupported.

### `BenchmarkProblem`
* Gets an attribute `report_inference_value_as_trace` that
* Makes the BenchmarkResult's `optimization_trace` be inference value when the problem specifies that inference value should be used. Docstrings should be self-explanatory.
* Adds  to BenchmarkProblem. Docstrings should be self-explanatory.

### `BenchmarkMethod`
* Adds a method `get_best_parameters` and an attribute `best_point_kwargs`. If not overridden, `get_best_parameters` uses `BestPointMixin._get_trace` and passes it the `best_point_kwargs`.
* Currently, the only supported argument in `best_point_kwargs` is "use_model_predictions".

Reviewed By: Balandat

Differential Revision: D61930178
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61930178

esantorella added a commit to esantorella/Ax that referenced this pull request Sep 24, 2024
Summary:
Pull Request resolved: facebook#2782

# Context:

Currently, the benchmarks compute an "oracle" value for each point seen, which evaluates the point noiselessly and at the target task and fidelity, or in a way specified by the `BenchmarkRunner`. This produces an `optimization_trace` used for measuring performance. (For MOO, the hypervolume of all points tested is computed.)

While this trace does a good job of capturing whether a good point has been tested, it does not capture *inference regret*: The difference between the value of the the point the model would recommend and that of the best point.  This distinction becomes important (both for getting a good measure of absolute performance and for comparing methods) in contexts such as
* Bandit problems (in a noisy and discrete space), where the best point will be seen quickly; the question is when the model identifies it
* Multi-fidelity problems, where simply evaluating as many small arms as possible maximizes the current metric for optimization value
* Noisy problems, if different best-point selection strategies are being considered.

# Open questions
* Should inference value always be computed? My take: Yes, it needen't add much computational overhead, as long as evaluating the same parameterization a second time isn't expensive, because we can use a best-pont selection strategy of "empirical best." Current implementation: Always computes this.
* Should the "oracle trace" (the status quo behavior) always be computed? My take: Yes, because people say they find it helpful, and for consistency with the past. Current implementation: Always computes this.
* If we want both, should we tag one of the two traces as "the" trace, for backwards compatibility? The current implementation does this; `BenchmarkResult.optimization_trace` is one of the `inference_value_trace` and the `oracle_trace`, with the `BenchmarkProblem` specifying which one.
* Set of best points returned for MOO: Is choosing K points and then evaluating them by hypervolume what we want?
* To what degree do we want to rely on Ax's `BestPointMixin` functionality, which is pretty stale, missing functionality we want, requires constructing dummy `Experiment`s, and won't do the right thing for multi-fidelity and multi-task methods? An alternative approach would be to support this for MBM only, which would address or enable addressing all these issues.
* When should the trace be updated in async settings?
* This diff adds support for SOO and MOO and for `n_best_points`, but only supports SOO with 1 best point. That's a lot of infra for raising `NotImplementedError`s. Is this what we want?
* In sample and out of sample: Currently, I'm not using these terms at all since they are confusing in multi-task and multi-fidelity contexts. Is that what we want?
* When people develop best-point functionality in the future, would they do it be updating or adding options to `BestPointMixin._get_trace`? I wrote this under the assumption that they would either do that or use a similar method that consumes an `experiment` and `optimization_config` and can access the `generation_strategy` used.

# This diff

## High-level changes

Technically, this adds "inference value" rather than "inference regret", because it is not relative to the optimum. That gives it the same sign as the default optimization_trace. It is always computed and returned on the `BenchmarkResult`. The old trace is renamed the `oracle_trace`. `optimization_trace` continues to exist; it can be either the `oracle_trace` (default) or the `inference_trace`, depending on what the `BenchmarkProblem` specifies. The `BenchmarkMethod` is responsible for specifying a best-point selector. This currently relies heavily on Ax's best-point functionality, but this can be overridden.

There are major limitations:
* *The ideal approach for MOO isn't supported yet, so MOO isn't supported at all with inference value*: The `BenchmarkProblem` specifies `n_best_points`, how many points are returned as the best, and for MOO, we would want `n_best_points > 1` and to take the hypervolume of the oracle values at those points. That is the only way it makes sense to set this up if we want to compare best-point selectors. If we use hypervolume and don't cap `n_best_points`, the ideal best-point selector would give every point. Metrics other than hypervolume, such as the fraction of "best" points actually on the Pareto frontier, would also be odd.  However, there is no Ax functionality generically hooked up for getting `k` points to maximize expected hypervolume.
* Different best-point selectors can be compared by using a different `BenchmarkMethod`, either by passing different `best_point_kwargs` to the `BenchmarkMethod` or by subclassing `BenchmarkMethod` and overriding `get_best_parameters`.

## Detailed changes

### BenchmarkResult
Docstrings ought to be self-explanatory.
* The old `optimization_trace` becomes `oracle_trace`
* It always has an `inference_value_trace` as well as an `oracle_trace`
* The `optimization_trace` can be either, depending on what the `BenchmarkProblem` specifies.

### `benchmark_replication`
* Computes inference value after each time the scheduler generates a trial. Note that incomplete trials can thus be used, since this could then happen before the trial completes.
* For MOO, should find K pareto-optimal parameters (according to the model), get their oracle values, and get the hypervolume of those oracle values in the following manner: constructs a new experiment with one BatchTrial whose arms are K pareto-optimal parameters and whose metrics are oracle values, and uses Ax's best-point functionality to get the hypervolume. This is done to avoid re-implementing inference of objective thresholds, using constraints, weighting, etc. HOWEVER, MOO is currently unsupported because we don't have a way of getting the K best.
* For SOO, finds K best parameters (according to the model) and gets their oracle value. HOWEVER, K>1 is currently unsupported.

### `BenchmarkProblem`
* Gets an attribute `report_inference_value_as_trace` that
* Makes the BenchmarkResult's `optimization_trace` be inference value when the problem specifies that inference value should be used. Docstrings should be self-explanatory.
* Adds  to BenchmarkProblem. Docstrings should be self-explanatory.

### `BenchmarkMethod`
* Adds a method `get_best_parameters` and an attribute `best_point_kwargs`. If not overridden, `get_best_parameters` uses `BestPointMixin._get_trace` and passes it the `best_point_kwargs`.
* Currently, the only supported argument in `best_point_kwargs` is "use_model_predictions".

Reviewed By: Balandat

Differential Revision: D61930178
Summary:
Pull Request resolved: facebook#2782

# Context:

Currently, the benchmarks compute an "oracle" value for each point seen, which evaluates the point noiselessly and at the target task and fidelity, or in a way specified by the `BenchmarkRunner`. This produces an `optimization_trace` used for measuring performance. (For MOO, the hypervolume of all points tested is computed.)

While this trace does a good job of capturing whether a good point has been tested, it does not capture *inference regret*: The difference between the value of the the point the model would recommend and that of the best point.  This distinction becomes important (both for getting a good measure of absolute performance and for comparing methods) in contexts such as
* Bandit problems (in a noisy and discrete space), where the best point will be seen quickly; the question is when the model identifies it
* Multi-fidelity problems, where simply evaluating as many small arms as possible maximizes the current metric for optimization value
* Noisy problems, if different best-point selection strategies are being considered.

# Open questions
* Should inference value always be computed? My take: Yes, it needen't add much computational overhead, as long as evaluating the same parameterization a second time isn't expensive, because we can use a best-pont selection strategy of "empirical best." Current implementation: Always computes this.
* Should the "oracle trace" (the status quo behavior) always be computed? My take: Yes, because people say they find it helpful, and for consistency with the past. Current implementation: Always computes this.
* If we want both, should we tag one of the two traces as "the" trace, for backwards compatibility? The current implementation does this; `BenchmarkResult.optimization_trace` is one of the `inference_value_trace` and the `oracle_trace`, with the `BenchmarkProblem` specifying which one.
* Set of best points returned for MOO: Is choosing K points and then evaluating them by hypervolume what we want?
* To what degree do we want to rely on Ax's `BestPointMixin` functionality, which is pretty stale, missing functionality we want, requires constructing dummy `Experiment`s, and won't do the right thing for multi-fidelity and multi-task methods? An alternative approach would be to support this for MBM only, which would address or enable addressing all these issues.
* When should the trace be updated in async settings?
* This diff adds support for SOO and MOO and for `n_best_points`, but only supports SOO with 1 best point. That's a lot of infra for raising `NotImplementedError`s. Is this what we want?
* In sample and out of sample: Currently, I'm not using these terms at all since they are confusing in multi-task and multi-fidelity contexts. Is that what we want?
* When people develop best-point functionality in the future, would they do it be updating or adding options to `BestPointMixin._get_trace`? I wrote this under the assumption that they would either do that or use a similar method that consumes an `experiment` and `optimization_config` and can access the `generation_strategy` used.

# This diff

## High-level changes

Technically, this adds "inference value" rather than "inference regret", because it is not relative to the optimum. That gives it the same sign as the default optimization_trace. It is always computed and returned on the `BenchmarkResult`. The old trace is renamed the `oracle_trace`. `optimization_trace` continues to exist; it can be either the `oracle_trace` (default) or the `inference_trace`, depending on what the `BenchmarkProblem` specifies. The `BenchmarkMethod` is responsible for specifying a best-point selector. This currently relies heavily on Ax's best-point functionality, but this can be overridden.

There are major limitations:
* *The ideal approach for MOO isn't supported yet, so MOO isn't supported at all with inference value*: The `BenchmarkProblem` specifies `n_best_points`, how many points are returned as the best, and for MOO, we would want `n_best_points > 1` and to take the hypervolume of the oracle values at those points. That is the only way it makes sense to set this up if we want to compare best-point selectors. If we use hypervolume and don't cap `n_best_points`, the ideal best-point selector would give every point. Metrics other than hypervolume, such as the fraction of "best" points actually on the Pareto frontier, would also be odd.  However, there is no Ax functionality generically hooked up for getting `k` points to maximize expected hypervolume.
* Different best-point selectors can be compared by using a different `BenchmarkMethod`, either by passing different `best_point_kwargs` to the `BenchmarkMethod` or by subclassing `BenchmarkMethod` and overriding `get_best_parameters`.

## Detailed changes

### BenchmarkResult
Docstrings ought to be self-explanatory.
* The old `optimization_trace` becomes `oracle_trace`
* It always has an `inference_value_trace` as well as an `oracle_trace`
* The `optimization_trace` can be either, depending on what the `BenchmarkProblem` specifies.

### `benchmark_replication`
* Computes inference value after each time the scheduler generates a trial. Note that incomplete trials can thus be used, since this could then happen before the trial completes.
* For MOO, should find K pareto-optimal parameters (according to the model), get their oracle values, and get the hypervolume of those oracle values in the following manner: constructs a new experiment with one BatchTrial whose arms are K pareto-optimal parameters and whose metrics are oracle values, and uses Ax's best-point functionality to get the hypervolume. This is done to avoid re-implementing inference of objective thresholds, using constraints, weighting, etc. HOWEVER, MOO is currently unsupported because we don't have a way of getting the K best.
* For SOO, finds K best parameters (according to the model) and gets their oracle value. HOWEVER, K>1 is currently unsupported.

### `BenchmarkProblem`
* Gets an attribute `report_inference_value_as_trace` that
* Makes the BenchmarkResult's `optimization_trace` be inference value when the problem specifies that inference value should be used. Docstrings should be self-explanatory.
* Adds  to BenchmarkProblem. Docstrings should be self-explanatory.

### `BenchmarkMethod`
* Adds a method `get_best_parameters` and an attribute `best_point_kwargs`. If not overridden, `get_best_parameters` uses `BestPointMixin._get_trace` and passes it the `best_point_kwargs`.
* Currently, the only supported argument in `best_point_kwargs` is "use_model_predictions".

Reviewed By: Balandat

Differential Revision: D61930178
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61930178

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants