evaluating_rewards
is a library to compare and evaluate reward functions. The accompanying paper,
Quantifying Differences in Reward Functions, describes the
methods implemented in this repository.
To install evaluating_rewards
, clone the repository and run:
pip install evaluating_rewards/
To install in developer mode so that edits will be immediately available:
pip install -e evaluating_rewards/
The package is compatible with Python 3.6 and upwards. There is no support for Python 2.
The evaluating_rewards.analysis.dissimilarity_heatmaps.plot_epic_heatmap
script provides
a convenient front-end to generate heatmaps of EPIC distance between reward models.
For example, to replicate Figure 2(a) from the paper, simply run:
python -m evaluating_rewards.analysis.dissimilarity_heatmaps.plot_epic_heatmap with point_mass paper
# If you want higher precision results like the paper (warning: will take several hours), use:
# python -m evaluating_rewards.analysis.dissimilarity_heatmaps.plot_epic_heatmap with point_mass paper high_precision
plot_epic_heatmap
uses the evaluating_rewards.epic_sample
module to compute the EPIC distance.
You may wish to use this module directly, for example when integrating EPIC into an existing
evaluation suite. Check out this notebook for an example of how to use
epic_sample
.
evaluating_rewards
consists of:
- the main package, containing some reward learning algorithms and distance methods. In particular:
epic_sample.py
contains the implementation of EPIC.rewards.py
andcomparisons.py
define deep reward models and associated comparison methods, andtabular.py
the equivalent in a tabular (finite-state) setting.serialize.py
loads reward models, both from this package and other projects.
envs
, a sub-package defining some simple environments and associated hard-coded reward models and (in some cases) policies.experiments
, a sub-package defining helper methods supporting experiments we have performed.
To use evaluating_rewards
, you will need to have some reward models to
compare. We provide native support to load reward models output by
imitation, an open-source
implementation of AIRL and GAIL. It is also simple to add new formats to
serialize.py
.
Note in addition to the Python requirements, you will need to install
GNU Parallel to run these bash scripts, for example with
sudo apt-get parallel
.
By default results will be saved to $HOME/output
; set the EVAL_OUTPUT_ROOT
environment variable
to override this.
To replicate the results in section 6.1 "Comparing hand-designed reward functions", run:
./runners/comparison/hardcoded_npec.sh
./runners/comparison/hardcoded_figs.sh
The first script, hardcoded_npec.sh
, computes the NPEC distance between the hand-designed
reward functions. This is relatively slow since it requires training a deep network.
The second script, hardcoded_figs
, plots the EPIC, NPEC and ERC heatmaps. For NPEC, the
distances precomputed by the first script are used. For EPIC and ERC they are computed on
demand since the distances are relatively cheap to compute.
The distances in gridworlds are produced by a separate script, exploiting the tabular nature of the problem. Run:
# NPEC
python -m evaluating_rewards.analysis.dissimilarity_heatmaps.plot_gridworld_heatmap with paper
# EPIC
python -m evaluating_rewards.analysis.dissimilarity_heatmaps.plot_gridworld_heatmap with paper kind=fully_connected_random_canonical_direct
To replicate the results in section 6.2 "Predicting policy performance from reward distance" and section 6.3 "Sensitivity of reward distance to visitation state distribution", you should run:
./runners/transfer_point_maze.sh
python -m evaluating_rewards.scripts.pipeline.combined_distances with point_maze_learned_good high_precision
The first script, transfer_point_maze.sh
, 0) trains a synthetic expert policy using RL on the
ground-truth reward; 1) trains reward models via IRL on demonstrations from this expert, and
via preference comparison and regression with labels from the ground-truth reward; 2a) computes
NPEC distance of these rewards from the ground-truth; 2b) trains policies using RL on the learned
reward models; 3) evaluates the resulting policies.
The second script, combined_distances
, generates a table and also computes the EPIC and ERC distances
of the learned reward models from the ground truth.
This library is licensed under the terms of the Apache license. See LICENSE for more information.
DeepMind holds the copyright to all work prior to October 4th, during the lead author's (Adam Gleave) internship at DeepMind. Subsequent modifications conducted at UC Berkeley are copyright of the author. Disclaimer: This is not an officially supported Google or DeepMind product.