Skip to content

ToruOwO/minimal-stable-PPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

minimal-stable-PPO

A minimal and stable Proximal Policy Optimization (PPO), tested on IsaacGymEnvs.

Requirements

  • Python (tested on 3.7)
  • PyTorch (tested on 1.8.1)

Training on IsaacGymEnvs

Following instructions here to install Isaac Gym and the IsaacGymEnvs repo.

Optional instructions for cleaner code and dependencies:

  • Under isaacgymenvs directory, the cfg, learning subdirectories and train.py file can be removed.
  • The dependency on rl-games on this line can be removed.

First example

To train a policy on Cartpole, run

python train.py task=Cartpole

Cartpole should converge to optimal within a few seconds of starting.

In configs directory, we provide the main config file and template configs for Cartpole and AllegroHand tasks. We use Hydra for config management following IsaacGymEnvs.

Custom tasks

To train on additional tasks, follow the template configs to define [new_task].yaml under configs/task and [new_task]PPO.yaml under configs/train.

Results

Logging on TensorBoard and WandB are supported by default.

Our PPO results match IsaacGymEnvs' default RL implementation, in terms of both training speed and performance.

Cartpole in 40 seconds

AllegroHand in 3 hours

Key arguments and parameters

Main config (config.yaml)

  • task=TASK - Selects which task to use. Options correspond to the config for each environment in configs/task.
  • num_envs=NUM_ENVS - Selects the number of environments to use (overriding the default number of environments set in the task config).
  • seed=SEED - Sets a seed value for randomizations, and overrides the default seed set up in the task config.
  • device_id=DEVICE_ID - Device used for physics simulation and the RL algorithm.
  • graphics_device_id=GRAPHICS_DEVICE_ID - Which Vulkan graphics device ID to use for rendering. Defaults to 0. Note - this may be different from CUDA device ID, and does not follow PyTorch-like device syntax.
  • pipeline=PIPELINE - Which API pipeline to use. Defaults to gpu, can also set to cpu. When using the gpu pipeline, all data stays on the GPU and everything runs as fast as possible. When using the cpu pipeline, simulation can run on either CPU or GPU, depending on the sim_device setting, but a copy of the data is always made on the CPU at every step.
  • test=TEST- If set to True, only runs inference on the policy and does not do any training.
  • checkpoint=CHECKPOINT_PATH - Set to path to the checkpoint to load for training or testing.
  • headless=HEADLESS - Whether to run in headless mode.
  • output_name=OUTPUT_NAME - Sets the output folder name.
  • wandb_mode=WANDB_MODE - Options for using WandB.

RL config (train/[task_name]PPO.yaml)

The main configs to experiment with are:

  • train.network.mlp.units
  • train.ppo.gamma
  • train.ppo.tau
  • train.ppo.learning_rate
  • train.ppo.lr_schedule
  • train.ppo.kl_threshold (only relevant when lr_schedule == 'kl')
  • train.ppo.e_clip
  • train.ppo.horizon_length
  • train.ppo.minibatch_size
  • train.ppo.max_agent_steps

We recommend the default value for other configs, but of course, RL is RL :)

Here are some helpful guides to tuning PPO hyperparameters:

The 37 Implementation Details of Proximal Policy Optimization

Engstrom L, Ilyas A, Santurkar S, Tsipras D, Janoos F, Rudolph L, Madry A. Implementation matters in deep policy gradients: A case study on ppo and trpo. International Conference on Learning Representations, 2020

Andrychowicz M, Raichuk A, Stańczyk P, Orsini M, Girgin S, Marinier R, Hussenot L, Geist M, Pietquin O, Michalski M, Gelly S. What matters in on-policy reinforcement learning? a large-scale empirical study. International Conference on Learning Representations, 2021

Duan Y, Chen X, Houthooft R, Schulman J, Abbeel P. Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning 2016 Jun 11 (pp. 1329-1338). PMLR.

I also documented a few general takeaways in this tweet.

Wait, doesn't IsaacGymEnvs already provide RL training scripts?

Yes, rl_games has great performance but could be hard to use.

If all you're looking for is a simple, clean, performant PPO that is easy to modify and extend, try this repo :))) And feel free to give feedback to make this better!

Citation

Please use the following bibtex if you find this repo helpful and would like to cite:

@misc{minimal-stable-PPO,
  author = {Lin, Toru},
  title = {A minimal and stable PPO},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ToruOwO/minimal-stable-PPO}},
}

Acknowledgement

Shout-out to hora and rl_games, which this code implementation referenced!

About

A minimal and stable PPO.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages