This repository features a PyTorch based implementation of PPO using TransformerXL (TrXL). Its intention is to provide a clean baseline/reference implementation on how to successfully employ memory-based agents using Transformers and PPO.
- Episodic Transformer Memory
- TransformerXL (TrXL)
- Gated TransformerXL (GTrXL)
- Environments
- Proof-of-concept Memory Task (PocMemoryEnv)
- CartPole
- Masked velocity
- Minigrid Memory
- Visual Observation Space 3x84x84
- Egocentric Agent View Size 3x3 (default 7x7)
- Action Space: forward, rotate left, rotate right
- MemoryGym
- Mortar Mayhem
- Mystery Path
- Searing Spotlights (WIP)
- Tensorboard
- Enjoy (watch a trained agent play)
@article{pleines2023trxlppo,
title = {TransformerXL as Episodic Memory in Proximal Policy Optimization},
author = {Pleines, Marco and Pallasch, Matthias and Zimmer, Frank and Preuss, Mike},
journal= {Github Repository},
year = {2023},
url = {https://github.com/MarcoMeter/episodic-transformer-memory-ppo}
}
- Installation
- Train a model
- Enjoy a model
- Episodic Transformer Memory Concept
- Hyperparameters - Episodic Transformer Memory - General - Schedules
- Add Environment
- Tensorboard
- Results
Install PyTorch 1.12.1 depending on your platform. We recommend the usage of Anaconda.
Create Anaconda environment:
conda create -n transformer-ppo python=3.7 --yes
conda activate transformer-ppo
CPU:
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cpuonly -c pytorch
CUDA:
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
Install the remaining requirements and you are good to go:
pip install -r requirements.txt
The training is launched via train.py
. --config
specifies the path to the yaml config file featuring hyperparameters. The --run-id
is used to distinguish training runs. After training, the trained model will be saved to ./models/$run-id$.nn
.
python train.py --config configs/minigrid.yaml --run-id=my-trxl-training
To watch an agent exploit its trained model, execute enjoy.py
. Some pre-trained models can be found in: ./models/
. The to-be-enjoyed model is specified using the --model
flag.
python enjoy.py --model=models/mortar_mayhem_grid_trxl.nn
Hyperparameter | Description |
---|---|
num_blocks | Number of transformer blocks |
embed_dim | Embedding size of every layer inside a transformer block |
num_heads | Number of heads used in the transformer's multi-head attention mechanism |
memory_length | Length of the sliding episodic memory window |
positional_encoding | Relative and learned positional encodings can be used |
layer_norm | Whether to apply layer normalization before or after every transformer component. Pre layer normalization refers to the identity map re-ordering. |
gtrxl | Whether to use Gated TransformerXL |
gtrxl_bias | Initial value for GTrXL's bias weight |
gamma | Discount factor |
lamda | Regularization parameter used when calculating the Generalized Advantage Estimation (GAE) |
updates | Number of cycles that the entire PPO algorithm is being executed |
n_workers | Number of environments that are used to sample training data |
worker_steps | Number of steps an agent samples data in each environment (batch_size = n_workers * worker_steps) |
epochs | Number of times that the whole batch of data is used for optimization using PPO |
n_mini_batch | Number of mini batches that are trained throughout one epoch |
value_loss_coefficient | Multiplier of the value function loss to constrain it |
hidden_layer_size | Number of hidden units in each linear hidden layer |
max_grad_norm | Gradients are clipped by the specified max norm |
These schedules can be used to polynomially decay the learning rate, the entropy bonus coefficient and the clip range.
learning_rate_schedule | The learning rate used by the AdamW optimizer |
beta_schedule | Beta is the entropy bonus coefficient that is used to encourage exploration |
clip_range_schedule | Strength of clipping losses done by the PPO algorithm |
Follow these steps to train another environment:
- Implement a wrapper of your desired environment. It needs the properties
observation_space
,action_space
andmax_episode_steps
. The needed functions arerender()
,reset()
andstep
. - Extend the
create_env()
function inutils.py
by adding another if-statement that queries the environment's "type" - Adjust the "type" and "name" key inside the environment's yaml config
Note that only environments with visual or vector observations are supported. Concerning the environment's action space, it can be either discrte or multi-discrete.
During training, tensorboard summaries are saved to summaries/run-id/timestamp
.
Run tensorboad --logdir=summaries
to watch the training statistics in your browser using the URL http://localhost:6006/.
Every experiment is repeated on 5 random seeds. Each model checkpoint is evaluated on 50 unknown environment seeds, which are repeated 5 times. Hence, one data point aggregates 1250 (5x5x50) episodes. Rliable is used to retrieve the interquartile mean and the bootstrapped confidence interval. The training is conducted using the more sophisticated DRL framework neroRL. The clean GRU-PPO baseline can be found here.
Mystery Path Grid (Goal & Origin Hidden)
TrXL and GTrXL have identical performance. See Issue #7.