This repository contains PyTorch Implementation of papers Sample Efficient Actor-Critic with Experience Replay (a.k.a ACER) and, Asynchronous Methods for Deep Reinforcement Learning (a.k.a. A3C.)
The A3C paper introduced some key ideas that can be summarized into:
- Asynchronous updates from multiple parallel agents to decorrelates the agent's data into a more stationary process rather than maintaining an Experience Replay Memory. Consequently, exceeding limits of off-policy methods and also, reducing memory computation per real interaction with the environment.
- Architectures that share layers between the policy and value function. Thus, providing better and more efficient representation learning and feature extraction.
- An updating scheme that operates on fixed-length segments of experiences (say, 5 or 20 time-steps) that increase the stationarity of the agent's data.
But:
A3C's lack of Experience Replay means it is considerably Sample-Inefficient and the number of interactions with the environment, needed to solve the task, is consequently high.
Based on this deficit of the A3C, ACER introduces an actor-critic method upon A3C's core structure accompanied by the benefits of having thread-based Experience Replays to improve sample efficiency. More precisely, in the ACER algorithm, each of the parallel agents, on the one hand, performs A3C-like on-policy updates, and on the other, has its own Experience Replay buffer to perform off-policy updates.
Also, the ACER utilizes more advanced techniques like Truncated Importance Sampling with Bias Correction, Stochastic Dueling Network Architectures and, Efficient Trust Region Policy Optimization to further improve stability (which is a common challenge in Policy Gradient methods) and also helps increasing Sample Efficiency even more.
This repository contains the discrete implementation of the ACER here and the A3C's here.
Although, continuous implementations are also provided here for the ACER and here for A3C, they have not been tested yet and they will be added to the current work whenever they're suitably debugged and validated in the future.
number of parallel agents = 8.
x-axis corresponds episode number.
ACER's Output | ACER's Output |
---|---|
Running Episode Reward | Running Episode Reward |
Running Episode Length | Running Episode Length |
number of parallel agents = 2
x-axis corresponds episode number.
ACER's Output | Recurrent A3C's Output |
---|---|
Running Episode Reward | Running Episode Reward |
Running Episode Length | Running Episode Length |
- The Sample Efficiency promised by the ACER is obvious as it can be seen on the left plot that the score of ≅21 has been achieved in 600 episodes vs. 1.7k episodes of the Recurrent A3C on the right.
Parameter | Value |
---|---|
lr | 1e-4 |
entropy coefficient | 0.001 |
gamma | 0.99 |
k (rollout length) | 20 |
total memory size (Aggregation of all parallel agents' replay buffers) | 6e+5 |
per agent replay memory size | 6e+5 // (number of agents * rollout length) |
c (used in truncated importance sampling) | 10 |
δ (delta used in trust-region computation) | 1 |
replay ratio | 4 |
polyak average coefficients | 0.01 ( = 1 - 0.99) |
critic loss coefficient | 0.5 |
max grad norm | 40 |
- PyYAML == 5.4.1
- cronyo == 0.4.5
- gym == 0.17.3
- numpy == 1.19.2
- opencv_contrib_python == 4.4.0.44
- psutil == 5.5.1
- torch == 1.6.0
pip3 install -r requirements.txt
usage: main.py [-h] [--env_name ENV_NAME] [--interval INTERVAL] [--do_train]
[--train_from_scratch] [--seed SEED]
Variable parameters based on the configuration of the machine or user's choice
optional arguments:
-h, --help show this help message and exit
--env_name ENV_NAME Name of the environment.
--interval INTERVAL The interval specifies how often different parameters
should be saved and printed, counted by episodes.
--do_train The flag determines whether to train the agent or play with it.
--train_from_scratch The flag determines whether to train from scratch or continue previous tries.
--seed SEED The randomness' seed for torch, numpy, random & gym[env].
- In order to train the agent with default arguments, execute the following command and use
--do_train
flag, otherwise the agent would be tested (You may change the environment and random seed based on your desire.):
python3 main.py --do_train --env_name="PongNoFrameskip-v4" --interval=200
- If you want to keep training your previous run, execute the following (add
--train_from_scratch
flag):
python3 main.py --do_train --env_name="PongNoFrameskip-v4" --interval=200 --train_from_scratch
- There are pre-trained weights of the agents that were shown in the Results section playing, if you want to test them by yourself, please do the following:
- First extract your desired weight from
*tar.xz
format to get.pth
extension then, rename your env_name + net_weights.pth file to net_weights.pth. For example:Breakout_net_weights.pth
->net_weights.pth
- Create a folder named Models in the root directory of the project and make sure it is empty.
- Create another folder with an arbitrary name inside Models folder. For example:
mkdir Models/ Models/temp_folder
- Put your
net_weights.pth
file in your temp_folder. - Run above commands without using
--do_tarin
flag:
python3 main.py --env_name="PongNoFrameskip-v4"
- All runs with 8 parallel agents were carried out on paperspace.com [Free-GPU, 8 Cores, 30 GB RAM].
- All runs with 2 parallel agents were carried out on Google Colab [CPU Runtime, 2 Cores, 12 GB RAM].
- PongNoFrameskip-v4
- BreakoutNoFrameskip-v4
- SpaceInvadersNoFrameskip-v4
- AssaultNoFrameskip-v4
- Verify and add results of the Continuous version of ACER
- Verify and add results of the Continuous version of A3C
.
├── Agent
│ ├── __init__.py
│ ├── memory.py
│ └── worker.py
├── LICENSE
├── main.py
├── NN
│ ├── __init__.py
│ ├── model.py
│ └── shared_optimizer.py
├── Pre-Trained Weights
│ ├── Breakout_net_weights.tar.xz
│ ├── PongACER_net_weights.tar.xz
│ ├── PongRecurrentA3C_net_weights.tar.xz
│ └── SpaceInvaders_net_weights.tar.xz
├── Readme files
│ ├── Gifs
│ │ ├── Breakout.gif
│ │ ├── PongACER.gif
│ │ ├── PongRecurrentA3C.gif
│ │ └── SpaceInvaders.gif
│ └── Plots
│ ├── Breakout_ep_len.png
│ ├── Breakout_reward.png
│ ├── PongACER_ep_len.png
│ ├── PongACER_reward.png
│ ├── PongRecurrentA3C_ep_len.png
│ ├── PongRecurrentA3C_reward.png
│ ├── SpaceInvaders_ep_len.png
│ └── SpaceInvaders_reward.png
├── README.md
├── requirements.txt
├── training_configs.yml
└── Utils
├── atari_wrappers.py
├── __init__.py
├── logger.py
├── play.py
└── utils.py
- Agent package includes of the agent's specific configurations like its memory, thread-based functions, etc.
- NN package includes the Neural Network's structure and its optimizer settings.
- Utils includes minor codes that are common for most RL codes and do auxiliary tasks like logging, wrapping Atari environments and... .
- Pre-Trained Weights is the directory that pre-trained weights have been stored at.
- Gifs and plot images of the current Readme file lies at the Readme files directory.
- Sample Efficient Actor-Critic with Experience Replay, Wang, et al., 2016
- Asynchronous Methods for Deep Reinforcement Learning, Mnih et al., 2016
- OpenAI Baselines: ACKTR & A2C
Current code was inspired by following implementation especially the first one: