This repository contains the code from my MSc thesis. While it can be used to reproduce my results, the main purpose of making it available is to provide a starting point for people wanting to train a DDPG agent in a simulation environment, in this case Webots.
The agent is a standard e-puck robot and the environment is a 4 x 4 meter arena with 10 randomly placed obstacles. The task is to learn mapless navigation to the target location.
The agent is an e-puck robot with 8 proximity sensors, compass and GPS. It can be found here.
A DDPG implementation based on the implementation by Nimish Sanghi in his helpful book Deep Reinforcement Learning with Python.
This is modified to operate with Webots (the original is presented for use with OpenAI Gym) and several extensions
are added. These include:
- A prioritised replay buffer to improve learning speed.
- Use of the SELU activation function for one layer in the actor and critic networks for its regularising properties.
- Parameterised sigmoid activation function for the actor output layer to stabilise the action output values.
- Implemented from the paper by Hossney et al.
- Implemented from the paper by Hossney et al.
- Addition of n-step returns to further stabilise learning.
Exploration is an important topic within DDPG. The original paper uses Ornstein-Uhlenbeck noise, but this is not more effective that simple Gaussian noise in this case. It should be noted that the noise is not related to environment exploration, but to actuator (wheel drive) exploration.
Initial environment exploration is achieved by providing random forward motion to the robot, this gathering experiences for the agent to learn from. After this, the standard DDPG exploration method of adding noise to the action output is used to refine the policy further over time.
It is strongly recommended that this method of activation exploration is used and that a suitable form of environmental exploration is used for the training environment you are using.
In this case, the training environment is very simple. The agent is placed in a 4 x 4 meter arena with 10 randomly placed obstacles. The agent is tasked with navigating to a randomly placed target location. The agent is rewarded for moving towards the target and penalised for colliding with obstacles. The episode ends when the agent reaches the target or runs out of time. Note, the episode is not reset on collision with an obstacle since this prevents effective learning in this case.
As shown below, the agent starts in the bottom left and must reach the randomly placed orange target (top right). In navigating to the target, it must avoid the obstacles.
The target location placement during training and testing is randomised and covers the area quite evenly as shown here.
The agent starts in the bottom-left, so the target is placed such that some navigation is required to reach it. It should be noted from a practice perspective, that arranging the training so that the target is always in the top right, with obstacles between the start and target locations, will not work since the agent does not see enough of the portion between obstacles and target to learn what to do. The random placement operates as a form of Curriculum Learning, allowing the agent to collect experiences of the full journey to the target.
Install Webots and set it up with your IDE of choice like this. I use PyCharm, but it should work with most IDEs.
Create a new environment and install the requirements from requirements.txt.
pip install numpy notebook matplotlib scikit-learn pandas
Install PyTorch from here.
Clone this repository and open it in your IDE.
git clone
Open Webots and load the world file e-puck_mapless_nav_con1.wbt
Hyperparameters are set in two places. Firstly, in the supervisor script and secondly in the hyperparameters file
The agent is run through Webots and will offer the following options:
- 't' to train a policy
- 'r' to run a policy
- 'i' to measure interference while training a policy
- 'q' to quit
During the run, data will be saved to the sqlite3 database in the logs directory.
To view performance, access the database in the normal way, an example is given in the
jupyter notebook provided here.
When training an agent, the weights and replay buffer will be stored in the 'models' and 'replays' directories. To run a pre-trained agent, place the weights and replay you want to load into the respective 'load' directories. The agent will load the most recently modified file in these directories.
Two databases are used. The main database is data_logs.db and has the following schema:
sqlite> .schema
CREATE TABLE agent_data (
run_id TEXT,
episode INTEGER,
step_reward FLOAT,
episode_end BOOL,
goal_achieved BOOL,
run_type TEXT,
CREATE TABLE optimise_data (
run_id TEXT,
q_loss FLOAT,
a_loss FLOAT,
inter FLOAT,
q_val FLOAT,
episode INTEGER,
The second database is an application log and has the following schema:
sqlite> .schema
CREATE TABLE app_log (
level TEXT,
function TEXT,
message TEXT);