This a trust region policy optimization implementation for continues action space system. This repo uses some methods from .
- Documentation : dependendies, how to save network
- Experience Replay
- Make code stupid proof : check log folders, save and load folders
- Reward plotter for only one run
- automatic log generation bash for the given plots
- various minimal examples
-
--max-iteration-number {int} : the max number of episode
-
--batch-size {int} : batch size of each episode
-
--episode-length {int} : length of each episode (Environmens may limit this internally for example Pendulum has length 200)
-
--log : if added at the end of the training the mean reward
-
--log-dir {string} : the logging directory
-
--log-prefix {string} : name your log file
If the --log flag is added to the command line instruction, at the end of the training the average cummulative reward will be logged automatically.
Recommended way of logging:
First create a log directory
mkdir log
mkdir log/example
Then run the train.py
python train.py --log -log-dir "log/example"
the log file will appear in the given folder. If you run the same command multiple time, the log code will automatically enumurate the log file names.
- Trust Region Policy Optimization
- High-Dimensional Continuous Control Using Generalized Advantage Estimation
- Towards Generalization and Simplicity in Continuous Control
- Deep Reinforcement Learning that Matters
If there are some other good implementations please inform me to add to the list
- Bootstrapping works way better
- Increasing batch size increases the learning rate but simulations takes to long time
- Training policy and value networks with data from same time step results a poor learning performance, even if the value training perform after policy optimization. Training value function with previous data solves the problem. Using more than one previous batch does not improve the results.
- High the value training iteration number results overfitting, and low cause poor learning. Though, this experiments are performed with minibatches with size batch_size/iter, namely minibatch size is not constant.(TODO: add constant batch)
The experiments are performed in Pendulum-v0 environment
In this experiment, two different way of estimating return is compared.
1- Monte Carlo : return is calculated using the discounted return of next state
2- Bootstrap : return is calculated using the discounted value approaximation of next state
In this experiment, we train the system with 4 different batch sizes.
In , they used previous batches data to train the value function to avoid overfitting. used the previous+current batch to train the value function. Here, we are testing different combinations of both to see difference.
We test the value training iteration number. The experiment is performed with the a batch size of 5k and the minibatch size are 5k/iter_num.