This project is based on source code offered by the course:deep-reinforcement-learning/ddpg-bipedal
. I've extended the code to run with multiple Agents, so I can use the 2nd Version of the environment with 20 Agents.
I chose the DDPG (Deep Deterministic Policy Gradients) algorithm because it is able to handle continuous spaces, which is needed for this environment and seemed easier as discretization (see Chapter 1 of the course). Continuous spaces make it more difficult to train an agent, because the action space gets highly dimensional. In contrast DQN (with Q-tables)solves problems with high-dimensional observation spaces, but it can only handle discrete and low-dimensional action spaces. Using a neural network to approximate these values in a convinient way. The algorithm also benefits from two separate neural network (actor and critic) - so the target network will only be updated with every second training step (see hyperparameters for details) and a TAU of 1e-3.
For more details see Continuous Control with Deep Reinforcement Learning
Source: Unity-Technologies/ml-agents
- Set-up: Double-jointed arm which can move to target locations.
- Goal: The agents must move its hand to the goal location, and keep it there.
- Agents: The environment contains 10 agent linked to a single Brain.
- Agent Reward Function (independent):
- +0.1 Each step agent's hand is in goal location.
- Brains: One Brain with the following observation/action space.
- Vector Observation space: 26 variables corresponding to position, rotation, velocity, and angular velocities of the two arm Rigidbodies.
- Vector Action space: (Continuous) Size of 4, corresponding to torque applicable to two joints.
- Visual Observations: None.
- Reset Parameters: Two, corresponding to goal size, and goal movement speed.
- Benchmark Mean Reward: 30
The program is structured in the following way:
Continuous_Control.ipynb
- Notebook with additional information about the program structureddpg_agent.py
- There the Agent class is defined, which will be used in the notebookddpg_model.py
- Defines the DQN neural network for local and target networkREADME.md
- Instructions how to setup this repository and start the agentReport.md
- This filerewards_plot.png
- Plot of the rewards per episode for the best trainingsolved_model_weights.pth
- Saved model weights (pytorch)unity-environment.log
- wasn't me. autogenerated file from unity environment
The base function for ddpg
was copied from ddpg-bipedal
to get the main loops running. I changed the environment setup according to my first project. Then I extended the code to work with more than one Agent - depending on the number of agents the environment provides. When I ran the code, too less information about the training was provided, so I extended the code to print some information about scores per timestep.
After a few trainings I decided to extend the initialization of the Agent, to parametrize the needed hyperparameters.
BUFFER_SIZE = int(1e6) # replay buffer size BATCH_SIZE = 1024 # minibatch size GAMMA = 0.98 # discount factor TAU = 1e-3 # for soft update of target parameters LR_ACTOR = 1e-3 # learning rate of the actor LR_CRITIC = 1e-4 # learning rate of the critic WEIGHT_DECAY = 1e-9 # L2 weight decay
MU = 0. # mean reversion level THETA = 0.05 # mean reversion speed oder mean reversion rate SIGMA = 0.02 # random factor influence source: https://de.wikipedia.org/wiki/Ornstein-Uhlenbeck-Prozess
N_TIME_STEPS = 2 # only learn every n time steps
Actor
contains 2 fully connected layers (256, 128) with leaky relu activations.Critic
contains 3 fully connected layers (256, 256, 128) with leaky relu activations. The network builds a critic (value) network that maps (state, action) pairs -> Q-values. See Codeddpg_model.py --> class Critic(nn.Module)
for more details.
First I started with the same parameters like as the bipedal environment. After two episodes I had horrible results (around 0,5 points) so I changed the setup the following way. At first the learning rate seemed very slow and the scores where fluctuating, so I checked the NN model and extended a layer on the actor model. The improvment of score values seemed more stable after this, but still quite slow. Every update at episode end seems to reduce the score value. Increasing the timesteps reduced this problem, but did not stop it.
Then I raised the update rate, in hope that the learning will speed up. This was the case, but the updates for each episode threw the scores back again. So I disabled the TAU (setting TAU = 1) to see if this will stop the throwbacks.
It partially stopped the throwbacks, but after a few steps the scores broke down again. So it seems, that a mixture of frequent updates to the target network could help. Thus I implemented this function.
The performance with 20 agent was terrible on my local server (i7 12xCores, 1xGTX1080Ti) so I switched to the single agent environment. The speed was faster with this environment and CPU utilization was around 10-15%, GPU around 20%.
-- New beginning --
In the last trainings I could not get the agent to learn anything. For a few episodes the agent learned a few bits, after this it seemed to forget again. Thus I reseted every hyperparameter back to default set weight_decay
down to 0. The learning speed increased. I raised the batch_size
to speed up the training which had less effect. Then I lowered the noise (theta
and sigma
) which finally gave me a stable training. So it seems that reducing the weight decay (seems to be the l2 regularization in pytorch) and artificial noise did the job.
After about 100 episodes the training seemed to overfit, so first I added a small weight_decay
again (1e-9) and reduced gamma
to 0.98. To test and play around with the stability of the network, I implemented N_TIME_STEPS
and set it to 2. With these settings the algorithm got 31,76 after Episode 6 and stood stable untill 100. It seems, that still overfitting takes place and could be further tuned.
See: results_plot.png
included in this repository.
Episode 1 Timestep 999 Score: 0.45 min: 0.00 max: 1.38
Episode 2 Timestep 999 Score: 1.61 min: 0.83 max: 3.08
Episode 3 Timestep 999 Score: 4.88 min: 1.34 max: 9.10
Episode 4 Timestep 999 Score: 13.64 min: 4.85 max: 25.60
Episode 5 Timestep 999 Score: 24.77 min: 16.52 max: 35.08
Episode 6 Timestep 999 Score: 31.76 min: 23.56 max: 35.89
...
Episode 100 Timestep 999 Score: 35.46 min: 26.03 max: 39.52
Episode 100 Average Score: 35.94 Score: 35.46
Environment solved in 0 episodes! Average Score: 35.94
- Discard this version of the code. I will use
DeepRL/examples.py at master · ShangtongZhang/DeepRL
and change pytorch to tensorflow. This code's structure is fantastic and should be easy to extend with different environments. - Check parallelizing for the environment. The 20 agent version does not use all of the 12 CPUs. Only one CPU at 98% and GPU at 20%
- Check Rainbow and PO algorithms Rainbow: Combining Improvements in Deep Reinforcement Learning
- Only Update target network if the current episode score is at least as good as the former episode
- Tune overfitting a little bit better, so the score won't decrease over time