A minimal and stable Proximal Policy Optimization (PPO), tested on IsaacGymEnvs.
- Python (tested on 3.7)
- PyTorch (tested on 1.8.1)
Following instructions here to install Isaac Gym and the IsaacGymEnvs repo.
Optional instructions for cleaner code and dependencies:
- Under isaacgymenvs directory, the
cfg
,learning
subdirectories andtrain.py
file can be removed. - The dependency on
rl-games
on this line can be removed.
To train a policy on Cartpole, run
python train.py task=Cartpole
Cartpole should converge to optimal within a few seconds of starting.
In configs
directory, we provide the main config file and template configs for Cartpole
and AllegroHand
tasks. We use Hydra for config management following IsaacGymEnvs.
To train on additional tasks, follow the template configs to define [new_task].yaml
under configs/task
and [new_task]PPO.yaml
under configs/train
.
Logging on TensorBoard and WandB are supported by default.
Our PPO results match IsaacGymEnvs' default RL implementation, in terms of both training speed and performance.
task=TASK
- Selects which task to use. Options correspond to the config for each environment inconfigs/task
.num_envs=NUM_ENVS
- Selects the number of environments to use (overriding the default number of environments set in the task config).seed=SEED
- Sets a seed value for randomizations, and overrides the default seed set up in the task config.device_id=DEVICE_ID
- Device used for physics simulation and the RL algorithm.graphics_device_id=GRAPHICS_DEVICE_ID
- Which Vulkan graphics device ID to use for rendering. Defaults to 0. Note - this may be different from CUDA device ID, and does not follow PyTorch-like device syntax.pipeline=PIPELINE
- Which API pipeline to use. Defaults togpu
, can also set tocpu
. When using thegpu
pipeline, all data stays on the GPU and everything runs as fast as possible. When using thecpu
pipeline, simulation can run on either CPU or GPU, depending on thesim_device
setting, but a copy of the data is always made on the CPU at every step.test=TEST
- If set toTrue
, only runs inference on the policy and does not do any training.checkpoint=CHECKPOINT_PATH
- Set to path to the checkpoint to load for training or testing.headless=HEADLESS
- Whether to run in headless mode.output_name=OUTPUT_NAME
- Sets the output folder name.wandb_mode=WANDB_MODE
- Options for using WandB.
The main configs to experiment with are:
train.network.mlp.units
train.ppo.gamma
train.ppo.tau
train.ppo.learning_rate
train.ppo.lr_schedule
train.ppo.kl_threshold
(only relevant whenlr_schedule == 'kl'
)train.ppo.e_clip
train.ppo.horizon_length
train.ppo.minibatch_size
train.ppo.max_agent_steps
We recommend the default value for other configs, but of course, RL is RL :)
Here are some helpful guides to tuning PPO hyperparameters:
The 37 Implementation Details of Proximal Policy Optimization
I also documented a few general takeaways in this tweet.
Yes, rl_games has great performance but could be hard to use.
If all you're looking for is a simple, clean, performant PPO that is easy to modify and extend, try this repo :))) And feel free to give feedback to make this better!
Please use the following bibtex if you find this repo helpful and would like to cite:
@misc{minimal-stable-PPO,
author = {Lin, Toru},
title = {A minimal and stable PPO},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ToruOwO/minimal-stable-PPO}},
}
Shout-out to hora and rl_games, which this code implementation referenced!