Skip to content

Commit

Permalink
Merge: Simulation userguide (#172)
Browse files Browse the repository at this point in the history
This PR introduces a user guide for BayBE's simulation capabilities. It
briefly introduces the core concepts like `lookup` and the different
kinds of simulation available.

In addition, I realized that there was no example using the
`simulate_transfer_learning` function so far. I consequently added one
example.
  • Loading branch information
AVHopp authored Mar 19, 2024
2 parents eb427ab + 238d955 commit 7397f03
Show file tree
Hide file tree
Showing 5 changed files with 3,600 additions and 2 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Changed
- More detailed and sophisticated search space user guide

### Added
- Simulation user guide
- Example for transfer learning backtest utility

## [0.8.1] - 2024-03-11
### Added
- Better human readable `__str__` representation of campaign
Expand Down
127 changes: 125 additions & 2 deletions docs/userguide/simulation.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,127 @@
# Simulation

This page will soon contain information about simulations.
In the meantime, please see the [examples](../../examples/examples) instead.
BayBE offers multiple functionalities to "simulate" experimental campaigns with a given lookup mechanism. This user guide briefly introduces how to use the methods available in our [simulation submodule](baybe.simulation).

For a wide variety of applications of this functionality, we refer to the corresponding [examples](../../examples/Backtesting/Backtesting).

## Terminology: What do we mean by "Simulation"?

The term "simulation" can have two slightly different interpretations, depending on the applied context.

1. It can refer to "backtesting" a particular experimental campaign on a fixed finite dataset.
Thus, "simulation" means investigating what experimental trajectory we would have observed if we had used different setups or recommenders and restricted the possible parameter configurations to those contained in the dataset.

2. It can refer to the simulation of an *actual* DOE loop, i.e., recommending experiments and retrieving the corresponding measurements, where the loop closure is realized in the form of a callable (black-box) function that can be queried during the optimization to provide target values. Such a callable could for instance be a simple analytical function or a numerical solver of a set of differential equations that describe a physical system.

## The Lookup Functionality

In BayBE, the simulation submodule allows a wide range of use cases and can even be used for "oracle predictions".
This is enabled by the proper use of the `lookup` functionality, which allows to either use fixed data sets, analytical functions, and general callbacks for retrieving target function values.

All functions require a `lookup` which is used to close the loop and return target values for points in the search space.
It can be provided in the form of a dataframe or a `Callable`.

```{note}
Technically, the `lookup` can also be `None`. This results in the simulation producing random results which is not discussed further.
```

### Using a Dataframe

When choosing a dataframe, it needs to contain parameter combinations and their target results.
To make sure that the backtest produces a realistic assessment of the performance, all possible parameter combinations should be measured and present in the dataframe.
However, this is an unrealistic assumption for most applications as it is typically not the case that all possible parameter combinations have been measured prior to the optimization.
As a consequence, it might well be the case that a provided dataframe contains the measurements of only some parameter configurations while a majority of combinations is not present.
For this case, BayBE offers different ways of handling such "missing" values.
This behavior is configured using the `impute_mode` keyword and provides the following possible choices:
- ``"error"``: An error will be thrown.
- ``"worst"``: Imputation uses the worst available value for each target.
- ``"best"``: Imputation uses the best available value for each target.
- ``"mean"``: Imputation uses the mean value for each target.
- ``"random"``: A random row will be used as lookup.
- ``"ignore"``: The search space is stripped before recommendations are made so that unmeasured experiments will not be recommended.

### Using a `Callable`

The `Callable` needs to return the target values for any given parameter combination. The only requirement that BayBE imposes on using a `Callable` as a lookup mechanism is thus that it returns either a float or a tuple of floats and to accept an arbitrary number of floats as input.

## Simulating a Single Experiment

The function [`simulate_experiment`](baybe.simulation.simulate_experiment) is the most basic form of simulation.
It runs a single execution of a DoE loop for either a specific number of iteration or until the search space is fully observed.

For using this function, it is necessary to provide a [`campaign`](baybe.campaign.Campaign). Although technically not necessary, we advise to also always provide a lookup mechanisms since fake results will be produced if none is provided. It is possible to specify several additional parameters like the batch size, initial data or the number of DoE iterations that should be performed

~~~python
results = simulate_scenarios(
# Necessary
campaign=campaign,
# Technically optional but should always be set
lookup=lookup,
# Optional
batch_size=batch_size,
n_doe_iterations=n_doe_iterations,
initial_data=initial_data,
random_seed=random_seed,
impute_mode=impute_mode,
noise_percent=noise_percent,
)
~~~

This function returns a dataframe that contains the results. For details on the columns of this dataframe as well as the dataframes returned by the other functions discussed here, we refer to the documentation of the submodule [here](baybe.simulation).

## Simulating Multiple Scenarios

The function [`simulate_scenarios`](baybe.simulation.simulate_scenarios) allows to specify multiple simulation settings at once.
Instead of a single campaign, this function expects a dictionary of campaigns, mapping scenario identifiers to `Campaign` objects.
In addition to the keyword arguments available for `simulate_experiment`, this function has two different keywords available:
1. `n_mc_iterations`: This can be used to perform multiple Monte Carlo runs with a single call. Multiple Monte Carlo runs are always advised to average out the effect of random effects such as the initial starting data.
2. `initial_data`: This can be used to provide a list of dataframe, where each dataframe is then used as initial data for an independent run. That is, the function performs one optimization loop per dataframe in this list.

Note that these two keywords are mutually exclusive.

~~~python
lookup = ... # some reasonable lookup, e.g. a Callable
campaign1 = Campaign(...)
campaign2 = Campaign(...)
scenarios = {"Campaign 1": campaign1, "Campaign 2": campaign2}

results = simulate_scenarios(
scenarios=scenarios,
lookup=lookup,
batch_size=batch_size,
n_doe_iterations=n_doe_iterations,
n_mc_iterations=n_mc_iterations,
)
~~~

## Simulating Transfer Learning

The function [`simulate_transfer_learning`](baybe.simulation.simulate_transfer_learning) partitions the search space into its tasks and simulates each task with the training data from the remaining tasks.

```{note}
Currently, this only supports discrete search spaces. See [`simulate_transfer_learning`](baybe.simulation.simulate_transfer_learning) for the reasons.
```

~~~python
task_param = TaskParameter(
name="Cell Line",
values=["Liver Cell", "Brain Cell", "Skin Cell"],
)
# Define searchspace using a task parameter
searchspace = SearchSpace.from_product(parameters=[param1, param2, task_param])

# Create a suitable campaign
campaign = Campaign(searchspace=searchspace, objective=objective)

# Create a lookup dataframe. Note that this needs to have a column labeled "Function"
# with values "F1" and "F2"
lookup = DataFrame(...)

results = simulate_transfer_learning(
campaign=campaign,
lookup=lookup,
batch_size=BATCH_SIZE,
n_doe_iterations=N_DOE_ITERATIONS,
n_mc_iterations=N_MC_ITERATIONS,
)
~~~
185 changes: 185 additions & 0 deletions examples/Transfer_Learning/backtesting.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
## Backtesting

# This example demonstrates the use of the
# [`simulate_transfer_learning`](baybe.simulation.simulate_transfer_learning) function
# to learn across tasks:
# * We construct a campaign,
# * define two related test functions,
# * use the data from the first function to train the second,
# * and vice versa

### Imports

import os
import sys
from pathlib import Path
from typing import Dict

import numpy as np
import pandas as pd
import seaborn as sns
from botorch.test_functions.synthetic import Hartmann

from baybe import Campaign
from baybe.objective import Objective
from baybe.parameters import NumericalDiscreteParameter, TaskParameter
from baybe.searchspace import SearchSpace
from baybe.simulation import simulate_scenarios, simulate_transfer_learning
from baybe.targets import NumericalTarget
from baybe.utils.botorch_wrapper import botorch_function_wrapper
from baybe.utils.plotting import create_example_plots

### Settings

# The following settings are used to set up the problem:

SMOKE_TEST = "SMOKE_TEST" in os.environ # reduce the problem complexity in CI pipelines
DIMENSION = 3 # input dimensionality of the test function
BATCH_SIZE = 1 # batch size of recommendations per DOE iteration
N_MC_ITERATIONS = 2 if SMOKE_TEST else 50 # number of Monte Carlo runs
N_DOE_ITERATIONS = 2 if SMOKE_TEST else 10 # number of DOE iterations
POINTS_PER_DIM = 3 if SMOKE_TEST else 7 # number of grid points per input dimension


### Creating the Optimization Objective

# The test functions each have a single output that is to be minimized.
# The corresponding [Objective](baybe.objective.Objective)
# is created as follows:

objective = Objective(
mode="SINGLE", targets=[NumericalTarget(name="Target", mode="MIN")]
)

### Creating the Search Space

# This example uses the [Hartmann Function](https://botorch.org/api/test_functions.html#botorch.test_functions.synthetic.Hartmann)
# as implemented by `botorch`.
# The bounds of the search space are dictated by the test function and can be extracted
# from the function itself.

BOUNDS = Hartmann(dim=DIMENSION).bounds

# First, we define one
# [NumericalDiscreteParameter](baybe.parameters.numerical.NumericalDiscreteParameter)
# per input dimension of the test function:

discrete_params = [
NumericalDiscreteParameter(
name=f"x{d}",
values=np.linspace(lower, upper, POINTS_PER_DIM),
)
for d, (lower, upper) in enumerate(BOUNDS.T)
]


# Next, we define a
# [TaskParameter](baybe.parameters.categorical.TaskParameter) to encode the task context,
# which allows the model to establish a relationship between the training data and
# the data collected during the optimization process.
# Since we perform a cross training here, we do not specify any `active_values`.

task_param = TaskParameter(
name="Function",
values=["Hartmann", "Shifted"],
)

# With the parameters at hand, we can now create our search space.

parameters = [*discrete_params, task_param]
searchspace = SearchSpace.from_product(parameters=parameters)

### Defining the Tasks

# To demonstrate the transfer learning mechanism, we consider the problem of optimizing
# the Hartmann function using training data from a shifted, scaled and noisy version
# and vice versa. The used model is of course not aware of this relationship but
# needs to infer it from the data gathered during the optimization process.


def shifted_hartmann(*x: float) -> float:
"""Calculate a shifted, scaled and noisy variant of the Hartman function."""
noised_hartmann = Hartmann(dim=DIMENSION, noise_std=0.15)
return 2.5 * botorch_function_wrapper(noised_hartmann)(x) + 3.25


test_functions = {
"Hartmann": botorch_function_wrapper(Hartmann(dim=DIMENSION)),
"Shifted": shifted_hartmann,
}

### Generating Lookup Tables

# We generate a single lookup table containing the target values of both functions at
# the given parameter grid.
# Parts of one lookup serve as the training data for the model.
# The other lookup is used as the loop-closing element, providing the target values of
# the other function.

grid = np.meshgrid(*[p.values for p in discrete_params])

lookups: Dict[str, pd.DataFrame] = {}
for function_name, function in test_functions.items():
lookup = pd.DataFrame({f"x{d}": grid_d.ravel() for d, grid_d in enumerate(grid)})
lookup["Target"] = tuple(lookup.apply(function, axis=1))
lookup["Function"] = function_name
lookups[function_name] = lookup
lookup = pd.concat([lookups["Hartmann"], lookups["Shifted"]]).reset_index()

### Simulation Loop

campaign = Campaign(searchspace=searchspace, objective=objective)

results = simulate_transfer_learning(
campaign,
lookup,
batch_size=BATCH_SIZE,
n_doe_iterations=N_DOE_ITERATIONS,
n_mc_iterations=N_MC_ITERATIONS,
)

# For comparison, we also compare with the baseline tasks

# ```{note}
# It is intended to implement a more elegant way of comparing results with and
# without transfer learning in the future.
# ```

for func_name, function in test_functions.items():
task_param = TaskParameter(
name="Function", values=["Hartmann", "Shifted"], active_values=[func_name]
)
parameters = [*discrete_params, task_param]
searchspace = SearchSpace.from_product(parameters=parameters)
result_baseline = simulate_scenarios(
{f"{func_name}_No_TL": Campaign(searchspace=searchspace, objective=objective)},
lookups[func_name],
batch_size=BATCH_SIZE,
n_doe_iterations=N_DOE_ITERATIONS,
n_mc_iterations=N_MC_ITERATIONS,
)

results = pd.concat([results, result_baseline])

# All that remains is to visualize the results.
# As the example shows, the optimization speed can be significantly increased by
# using even small amounts of training data from related optimization tasks.

results.rename(columns={"Scenario": "Function"}, inplace=True)
# Add column to enable different styles for non-TL examples
results["Uses TL"] = results["Function"].apply(lambda val: "No_TL" not in val)
path = Path(sys.path[0])
ax = sns.lineplot(
data=results,
markers=["o", "s"],
markersize=13,
x="Num_Experiments",
y="Target_CumBest",
hue="Function",
style="Uses TL",
)
create_example_plots(
ax=ax,
path=path,
base_name="backtesting",
)
Loading

0 comments on commit 7397f03

Please sign in to comment.