TransformerXL as Episodic Memory in Proximal Policy Optimization

This repository features a PyTorch based implementation of PPO using TransformerXL (TrXL). Its intention is to provide a clean baseline/reference implementation on how to successfully employ memory-based agents using Transformers and PPO.

Features

Episodic Transformer Memory
- TransformerXL (TrXL)
- Gated TransformerXL (GTrXL)
Environments
- Proof-of-concept Memory Task (PocMemoryEnv)
- CartPole
  - Masked velocity
- Minigrid Memory
  - Visual Observation Space 3x84x84
  - Egocentric Agent View Size 3x3 (default 7x7)
  - Action Space: forward, rotate left, rotate right
- MemoryGym
  - Mortar Mayhem
  - Mystery Path
  - Searing Spotlights (WIP)
Tensorboard
Enjoy (watch a trained agent play)

Citing this work

@article{pleines2023trxlppo,
  title = {TransformerXL as Episodic Memory in Proximal Policy Optimization},
  author = {Pleines, Marco and Pallasch, Matthias and Zimmer, Frank and Preuss, Mike},
  journal= {Github Repository},
  year = {2023},
  url = {https://github.com/MarcoMeter/episodic-transformer-memory-ppo}
}

Installation

Install PyTorch 1.12.1 depending on your platform. We recommend the usage of Anaconda.

Create Anaconda environment:

conda create -n transformer-ppo python=3.7 --yes
conda activate transformer-ppo

CPU:

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cpuonly -c pytorch

CUDA:

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge

Install the remaining requirements and you are good to go:

pip install -r requirements.txt

Train a model

The training is launched via train.py. --config specifies the path to the yaml config file featuring hyperparameters. The --run-id is used to distinguish training runs. After training, the trained model will be saved to ./models/$run-id$.nn.

python train.py --config configs/minigrid.yaml --run-id=my-trxl-training

Enjoy a model

To watch an agent exploit its trained model, execute enjoy.py. Some pre-trained models can be found in: ./models/. The to-be-enjoyed model is specified using the --model flag.

python enjoy.py --model=models/mortar_mayhem_grid_trxl.nn

Episodic Transformer Memory Concept

Hyperparameters

Episodic Transformer Memory

Hyperparameter	Description
num_blocks	Number of transformer blocks
embed_dim	Embedding size of every layer inside a transformer block
num_heads	Number of heads used in the transformer's multi-head attention mechanism
memory_length	Length of the sliding episodic memory window
positional_encoding	Relative and learned positional encodings can be used
layer_norm	Whether to apply layer normalization before or after every transformer component. Pre layer normalization refers to the identity map re-ordering.
gtrxl	Whether to use Gated TransformerXL
gtrxl_bias	Initial value for GTrXL's bias weight

General

gamma	Discount factor
lamda	Regularization parameter used when calculating the Generalized Advantage Estimation (GAE)
updates	Number of cycles that the entire PPO algorithm is being executed
n_workers	Number of environments that are used to sample training data
worker_steps	Number of steps an agent samples data in each environment (batch_size = n_workers * worker_steps)
epochs	Number of times that the whole batch of data is used for optimization using PPO
n_mini_batch	Number of mini batches that are trained throughout one epoch
value_loss_coefficient	Multiplier of the value function loss to constrain it
hidden_layer_size	Number of hidden units in each linear hidden layer
max_grad_norm	Gradients are clipped by the specified max norm

Schedules

These schedules can be used to polynomially decay the learning rate, the entropy bonus coefficient and the clip range.

learning_rate_schedule	The learning rate used by the AdamW optimizer
beta_schedule	Beta is the entropy bonus coefficient that is used to encourage exploration
clip_range_schedule	Strength of clipping losses done by the PPO algorithm

Add Environment

Follow these steps to train another environment:

Implement a wrapper of your desired environment. It needs the properties observation_space, action_space and max_episode_steps. The needed functions are render(), reset() and step.
Extend the create_env() function in utils.py by adding another if-statement that queries the environment's "type"
Adjust the "type" and "name" key inside the environment's yaml config

Note that only environments with visual or vector observations are supported. Concerning the environment's action space, it can be either discrte or multi-discrete.

Tensorboard

During training, tensorboard summaries are saved to summaries/run-id/timestamp.

Run tensorboad --logdir=summaries to watch the training statistics in your browser using the URL http://localhost:6006/.

Results

Every experiment is repeated on 5 random seeds. Each model checkpoint is evaluated on 50 unknown environment seeds, which are repeated 5 times. Hence, one data point aggregates 1250 (5x5x50) episodes. Rliable is used to retrieve the interquartile mean and the bootstrapped confidence interval. The training is conducted using the more sophisticated DRL framework neroRL. The clean GRU-PPO baseline can be found here.

Mystery Path Grid (Goal & Origin Hidden)

TrXL and GTrXL have identical performance. See Issue #7.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
docs/assets		docs/assets
environments		environments
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
buffer.py		buffer.py
enjoy.py		enjoy.py
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py
trainer.py		trainer.py
transformer.py		transformer.py
utils.py		utils.py
worker.py		worker.py
yaml_parser.py		yaml_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TransformerXL as Episodic Memory in Proximal Policy Optimization

Features

Citing this work

Contents

Installation

Train a model

Enjoy a model

Episodic Transformer Memory Concept

Hyperparameters

Episodic Transformer Memory

General

Schedules

Add Environment

Tensorboard

Results

Mystery Path Grid (Goal & Origin Hidden)

Mortar Mayhem Grid (10 commands)

About

Contributors 3

Languages

License

MarcoMeter/episodic-transformer-memory-ppo

Folders and files

Latest commit

History

Repository files navigation

TransformerXL as Episodic Memory in Proximal Policy Optimization

Features

Citing this work

Contents

Installation

Train a model

Enjoy a model

Episodic Transformer Memory Concept

Hyperparameters

Episodic Transformer Memory

General

Schedules

Add Environment

Tensorboard

Results

Mystery Path Grid (Goal & Origin Hidden)

Mortar Mayhem Grid (10 commands)

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages