Trust Region Policy Optimization

This a trust region policy optimization implementation for continues action space system. This repo uses some methods from .

Dependencies

TODO

Documentation : dependendies, how to save network
Experience Replay
Make code stupid proof : check log folders, save and load folders
Reward plotter for only one run
automatic log generation bash for the given plots
various minimal examples

Useful Flags

--max-iteration-number {int} : the max number of episode
--batch-size {int} : batch size of each episode
--episode-length {int} : length of each episode (Environmens may limit this internally for example Pendulum has length 200)
--log : if added at the end of the training the mean reward
--log-dir {string} : the logging directory
--log-prefix {string} : name your log file

Logging

If the --log flag is added to the command line instruction, at the end of the training the average cummulative reward will be logged automatically.

Recommended way of logging:

First create a log directory

  mkdir log
  mkdir log/example

Then run the train.py

  python train.py --log -log-dir "log/example"

the log file will appear in the given folder. If you run the same command multiple time, the log code will automatically enumurate the log file names.

Saving the trained networks

Useful Reference

Books

Papers

Other TRPO repos

If there are some other good implementations please inform me to add to the list

Results

Bootstrapping works way better
Increasing batch size increases the learning rate but simulations takes to long time
Training policy and value networks with data from same time step results a poor learning performance, even if the value training perform after policy optimization. Training value function with previous data solves the problem. Using more than one previous batch does not improve the results.
High the value training iteration number results overfitting, and low cause poor learning. Though, this experiments are performed with minibatches with size batch_size/iter, namely minibatch size is not constant.(TODO: add constant batch)

Experiments

The experiments are performed in Pendulum-v0 environment

Monte Carlo vs Bootstrap

In this experiment, two different way of estimating return is compared.

1- Monte Carlo : return is calculated using the discounted return of next state

2- Bootstrap : return is calculated using the discounted value approaximation of next state

Value function training batch

In this experiment, we train the system with 4 different batch sizes.

Past data for value learning

In , they used previous batches data to train the value function to avoid overfitting. used the previous+current batch to train the value function. Here, we are testing different combinations of both to see difference.

Value training iteration number

We test the value training iteration number. The experiment is performed with the a batch size of 5k and the minibatch size are 5k/iter_num.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
__pycache__		__pycache__
eval		eval
fig		fig
.gitignore		.gitignore
Figure_1.png		Figure_1.png
License.md		License.md
README.md		README.md
policy.py		policy.py
train.py		train.py
trpo.py		trpo.py
utils.py		utils.py
value.py		value.py

Provide feedback

Saved searches