In the experiment, we re-implement following the baseline agents in JAX:
- TD3BC
- CQL
- COMBO
- IQL
In this work, we take a closer look at the behaviors of current SOTA offline RL agents, especially for the learned representations, value functions and policies. Surprisingly, we find that the most performant offline RL agent sometimes has relatively low-quality representations and inaccurate value functions. In specific, a performant offline RL policy is usually able to select better sub-optimal actions, while avoiding bad ones.
In this experiment, we run linear representation probing experiments and evaluate some recently proposed representation metrics.
python exp_representation/run_exp.py
In this experiment, we evaluate the ability of the learned
python exp_value_function/value_ranking_exp.py
In this experiment, we directly evaluate the learned policy of each agent.
# policy ranking experiment
python exp_policy/policy_ranking_exp.py
# ood action experiment
python exp_policy/ood_action_pi_exp.py
We present a variant of IQL, which relaxes the in-sample constriant in the policy improvement step.
./relaxed_iql/run_exp.sh
We further investigate the use of a learned dynamics model for model-free offline RL agents.
# run miql
python miql/main.py
# run mtd3bc
python mtd3bc/main.py