COBS is an Off-Policy Policy Evaluation (OPE) Benchmarking Suite. The goal is to provide fine experimental control to carefully tease out an OPE method's performance across many key conditions.
We'd like to make this repo as useful as possible for the community. We commit to continual refactoring and code-review to make sure the COBS continues to serve its purpose. Help is always appreciated!
COBS is based on Empirical Study of Off Policy Policy Estimation paper (https://arxiv.org/abs/1911.06854).
To get started using the experimental tools see Tutorial.ipynb. For individual files, see example_tabular.py and example_nn.py.
To run example_tabular.py:
python3 example_tabular.py tabular_example_cfg.json
To run example_nn.py:
python3 example_nn.py nn_example_cfg.json
We have migrated from Tensorflow to Pytorch and made COBS more easy to use. For the original TF implementation and replication of the paper, please see the paper branch. Run paper.py using the instructions provided at the bottom of the file.
Tested on python3.6+.
python3 -m venv cobs-env
source cobs-env/bin/activate
pip3 install -r requirements.txt
See an example experiment configuration. The configuration contains two parts, the experiment section and the models section. The experiment section is used to instatiate the environment and general parameters. The models section is used to specify which methods to use and their specific parameters. The experiment section looks like:
"experiment": {
"gamma": 0.98, # discount factor
"horizon": 5, # horizon of the environment
"base_policy": 0.8, # Probability of deviation from greedy for base policy.
# Note: This parameter means different things depending on the type of policy
"eval_policy": 0.2, # Probability of deviation from greedy for evaluation policy.
# Note: This parameter means different things depending on the type of policy
"stochastic_env": true, # Make environment have stochastic transitions
"stochastic_rewards": false, # Make environment have stochastic rewards
"sparse_rewards": false, # Make environment have sparse rewards
"num_traj": 8, # Number of trajectories to collect from base_policy/behavior_policy (pi_b)
"is_pomdp": false, # Make the environment a POMDP
"pomdp_horizon": 2, # POMDP horizon, if POMDP is true
"seed": 1000, # Seed
"experiment_number": 0, # Label for experiment. Used for distributed compute
"access": 0, # Credentials for AWS. Used for distributed compute
"secret": 0, # Credentials for AWS. Used for distributed compute
"to_regress_pi_b": {
"to_regress": false, # Should we regress pi_b? Is it unknown?
"model": "defaultCNN", # What model to fit pi_b with
# Note: To add your own, see later in the README.md
"max_epochs": 100, # Max number of fitting iterations
"batch_size": 32, # Minibatch size
"clipnorm": 1.0 # Gradient clip
},
"frameskip": 1, # (x_t, a, r, x_{t+frameskip}). Apply action "a" frameskip number of times
"frameheight": 1 # (x_{t:t+frameheight}, a, r, x_{t+1:t+1+frameheight}). State is consider a concatenation of frameheight number of states
},
and the models section (TODO: rename to methods section) looks like:
"models": {
"FQE": {
"model": "defaultCNN", # What model to fit FQE with
# Note: To add your own, see later in the README.md
"convergence_epsilon": 1e-4, # When to stop iterations
"max_epochs": 100, # Max number of fitting iterations
"batch_size": 32, # Minibatch size
"clipnorm": 1.0 # Gradient clip
},
"Retrace": {
"model": "defaultCNN",
"convergence_epsilon": 1e-4,
"max_epochs": 3,
"batch_size": 32,
"clipnorm": 1.0,
"lamb": 0.9 # Lambda, parameter for this family of method
},
...
To add a new environment, implement an OpenAI gym-like environment and place the environment in the envs directory. The environment should implement the reset, step, and (optionally) render functions. Each environment must also contain two variables
self.n_dim # The number of states (if discrete), otherwise set this to 1.
self.n_actions # The number of possible actions
See Tutorial.ipynb for how to instantiate the environment during an experiment.
Current Direct Method Baselines: FQE, Retrace, Tree-Backup, Q^pi(lambda), Q-Reg, MRDR, IH, MBased.
To add a new Direct Method, implement one of the Direct Method classes and put the new method in the algos directory.
Suppose your new method is called NewMethod and it works by fitting a Q function.
Modify line 149 of experiment.py by adding:
...
elif 'NewMethod' == model:
new_method = NewMethod() ## Instatiates the method
new_method.fit(behavior_data, pi_e, cfg, cfg.models[model]['model']) ## Fits the method
new_method_Qs = new_method.get_Qs_for_data(behavior_data, cfg) ## Gets Q(s, a) for each s in the data and a in the action space.
out = self.estimate(new_method_Qs, behavior_data, gamma, model, true) ## Get Direct and Hybrid estimates and error.
dic.update(out)
...
Suppose your new method is called NewMethod and it works by fitting a weight function.
Modify line 149 of experiment.py by adding:
...
elif 'NewMethod' == model:
new_method = NewMethod() ## Instatiates the method
new_method.fit(behavior_data, pi_e, cfg, cfg.models[model]['model']) ## Fits the method
new_method_output = new_method.evaluate(behavior_data, cfg) ## Evaluate the method
dic.update({'NewMethod': [new_method_output, (new_method_output - true )**2]}) ## Update results
...
Current Hybrid Method Baselines: DR, WDR, MAGIC
TODO: How to add your own.
Current IPS Method Baselines: Naive, IS, Per-Decision IS, WIS, Per-Decision WIS
TODO: How to add your own.
Add your own NN architechture to a new file in the models directory. Then modify the get_model_from_name function in factory.py
from ope.models.YourNN import YourNN
def get_model_from_name(name):
...
elif name == 'YourNN'
return YourNN
...
You can now add your own NN as a method's model in the configuration:
"SomeMethod": {
"model": "YourNN", # YourNN model
...other params....
},
There are currently two available policy types.
- Basic Policy: pi(.|s) = [prob(a=0), prob(a=1),..., prob(a=n)]
- Epsilon-Greedy: pi(.|s) = Greedy(s) with prob 1-e and random(1,..,n) otherwise
If you use COBS, please use the following BibTeX entry.
@inproceedings{
voloshin2021empirical,
title={Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning},
author={Cameron Voloshin and Hoang Minh Le and Nan Jiang and Yisong Yue},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)},
year={2021},
url={https://openreview.net/forum?id=IsK8iKbL-I}
}