Report

This project is based on source code offered by the course:deep-reinforcement-learning/ddpg-bipedal. I've extended the code to run with multiple Agents, so I can use the 2nd Version of the environment with 20 Agents.

I chose the DDPG (Deep Deterministic Policy Gradients) algorithm because it is able to handle continuous spaces, which is needed for this environment and seemed easier as discretization (see Chapter 1 of the course). Continuous spaces make it more difficult to train an agent, because the action space gets highly dimensional. In contrast DQN (with Q-tables)solves problems with high-dimensional observation spaces, but it can only handle discrete and low-dimensional action spaces. Using a neural network to approximate these values in a convinient way. The algorithm also benefits from two separate neural network (actor and critic) - so the target network will only be updated with every second training step (see hyperparameters for details) and a TAU of 1e-3.

For more details see Continuous Control with Deep Reinforcement Learning

State and Action Space

Source: Unity-Technologies/ml-agents

Set-up: Double-jointed arm which can move to target locations.
Goal: The agents must move its hand to the goal location, and keep it there.
Agents: The environment contains 10 agent linked to a single Brain.
Agent Reward Function (independent):
- +0.1 Each step agent's hand is in goal location.
Brains: One Brain with the following observation/action space.
- Vector Observation space: 26 variables corresponding to position, rotation, velocity, and angular velocities of the two arm Rigidbodies.
- Vector Action space: (Continuous) Size of 4, corresponding to torque applicable to two joints.
- Visual Observations: None.
Reset Parameters: Two, corresponding to goal size, and goal movement speed.
Benchmark Mean Reward: 30

Learning algorithm

The program is structured in the following way:

Continuous_Control.ipynb - Notebook with additional information about the program structure
ddpg_agent.py - There the Agent class is defined, which will be used in the notebook
ddpg_model.py - Defines the DQN neural network for local and target network
README.md - Instructions how to setup this repository and start the agent
Report.md - This file
rewards_plot.png - Plot of the rewards per episode for the best training
solved_model_weights.pth - Saved model weights (pytorch)
unity-environment.log - wasn't me. autogenerated file from unity environment

Basic setup of the environment

The base function for ddpg was copied from ddpg-bipedal to get the main loops running. I changed the environment setup according to my first project. Then I extended the code to work with more than one Agent - depending on the number of agents the environment provides. When I ran the code, too less information about the training was provided, so I extended the code to print some information about scores per timestep. After a few trainings I decided to extend the initialization of the Agent, to parametrize the needed hyperparameters.

Hyper Parameters

Agent

BUFFER_SIZE = int(1e6) # replay buffer size BATCH_SIZE = 1024 # minibatch size GAMMA = 0.98 # discount factor TAU = 1e-3 # for soft update of target parameters LR_ACTOR = 1e-3 # learning rate of the actor LR_CRITIC = 1e-4 # learning rate of the critic WEIGHT_DECAY = 1e-9 # L2 weight decay

OU Noise

MU = 0. # mean reversion level THETA = 0.05 # mean reversion speed oder mean reversion rate SIGMA = 0.02 # random factor influence source: https://de.wikipedia.org/wiki/Ornstein-Uhlenbeck-Prozess

N_TIME_STEPS = 2 # only learn every n time steps

NN Model

Actor contains 2 fully connected layers (256, 128) with leaky relu activations.
Critic contains 3 fully connected layers (256, 256, 128) with leaky relu activations. The network builds a critic (value) network that maps (state, action) pairs -> Q-values. See Code ddpg_model.py --> class Critic(nn.Module) for more details.

Search method for Hyper Parameters

First I started with the same parameters like as the bipedal environment. After two episodes I had horrible results (around 0,5 points) so I changed the setup the following way. At first the learning rate seemed very slow and the scores where fluctuating, so I checked the NN model and extended a layer on the actor model. The improvment of score values seemed more stable after this, but still quite slow. Every update at episode end seems to reduce the score value. Increasing the timesteps reduced this problem, but did not stop it.

Then I raised the update rate, in hope that the learning will speed up. This was the case, but the updates for each episode threw the scores back again. So I disabled the TAU (setting TAU = 1) to see if this will stop the throwbacks.

It partially stopped the throwbacks, but after a few steps the scores broke down again. So it seems, that a mixture of frequent updates to the target network could help. Thus I implemented this function.

The performance with 20 agent was terrible on my local server (i7 12xCores, 1xGTX1080Ti) so I switched to the single agent environment. The speed was faster with this environment and CPU utilization was around 10-15%, GPU around 20%.

-- New beginning --

In the last trainings I could not get the agent to learn anything. For a few episodes the agent learned a few bits, after this it seemed to forget again. Thus I reseted every hyperparameter back to default set weight_decay down to 0. The learning speed increased. I raised the batch_size to speed up the training which had less effect. Then I lowered the noise (theta and sigma) which finally gave me a stable training. So it seems that reducing the weight decay (seems to be the l2 regularization in pytorch) and artificial noise did the job.

After about 100 episodes the training seemed to overfit, so first I added a small weight_decay again (1e-9) and reduced gamma to 0.98. To test and play around with the stability of the network, I implemented N_TIME_STEPS and set it to 2. With these settings the algorithm got 31,76 after Episode 6 and stood stable untill 100. It seems, that still overfitting takes place and could be further tuned.

Performance plot

See: results_plot.png included in this repository.

Episode 1	    Timestep 999	Score: 0.45	    min: 0.00	    max: 1.38
Episode 2	    Timestep 999	Score: 1.61	    min: 0.83	    max: 3.08
Episode 3       Timestep 999    Score: 4.88     min: 1.34       max: 9.10 
Episode 4       Timestep 999    Score: 13.64    min: 4.85       max: 25.60
Episode 5       Timestep 999    Score: 24.77    min: 16.52      max: 35.08
Episode 6       Timestep 999    Score: 31.76    min: 23.56      max: 35.89
...
Episode 100	    Timestep 999	Score: 35.46	min: 26.03	max: 39.52
Episode 100	    Average Score: 35.94	Score: 35.46

Environment solved in 0 episodes!	Average Score: 35.94

Ideas for Future Work

Discard this version of the code. I will use DeepRL/examples.py at master · ShangtongZhang/DeepRL and change pytorch to tensorflow. This code's structure is fantastic and should be easy to extend with different environments.
Check parallelizing for the environment. The 20 agent version does not use all of the 12 CPUs. Only one CPU at 98% and GPU at 20%
Check Rainbow and PO algorithms Rainbow: Combining Improvements in Deep Reinforcement Learning
Only Update target network if the current episode score is at least as good as the former episode
Tune overfitting a little bit better, so the score won't decrease over time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report.md

Report.md

Report

State and Action Space

Learning algorithm

Basic setup of the environment

Hyper Parameters

Agent

OU Noise

NN Model

Search method for Hyper Parameters

Performance plot

Ideas for Future Work

Files

Report.md

Latest commit

History

Report.md

File metadata and controls

Report

State and Action Space

Learning algorithm

Basic setup of the environment

Hyper Parameters

Agent

OU Noise

NN Model

Search method for Hyper Parameters

Performance plot

Ideas for Future Work