To read the blog post writeup, click here.
We want to build scalable, aligned RL agents; one approach is with reward learning. The ideal reward learning algorithm demonstrates good generalization properties; here, we use the Procgen benchmark to test some of these algorithms.
So far, we investigated two different reward learning architectures:
- Deep Learning from Human Preferences
- Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations
There are others we can test, but given limited time we could only look at these two. We implemented PPO in Pytorch using the baselines package (upgrading to stable-baselines pending) by adapting them to allow for a custom reward learning function, which we would be able to learn.
Procgen has 16 different environments; based on some initial testing we decided to focus on 4 environments: coinrun, bigfish, starpilot, and fruitbot.
An explanation for how to run the T-REX code is written in the trex
repository.