Name		Name	Last commit message	Last commit date
parent directory ..
data		data
results		results
tex		tex
README.md		README.md
README.tex.md		README.tex.md
hw2_instructions.pdf		hw2_instructions.pdf
hw2_instructions.tex		hw2_instructions.tex
logz.py		logz.py
lunar_lander.py		lunar_lander.py
plot.py		plot.py
requirements.txt		requirements.txt
run_4.sh		run_4.sh
run_5.sh		run_5.sh
run_7.sh		run_7.sh
run_811.sh		run_811.sh
run_812.sh		run_812.sh
run_813.sh		run_813.sh
run_82.sh		run_82.sh
run_93.sh		run_93.sh
train_pg_f18.py		train_pg_f18.py

README.md

CS294-112 HW 2: Policy Gradient

Usage

To run all experiments and plot figures for the report, run

bash run_4.sh
bash run_5.sh
bash run_7.sh
bash run_811.sh
bash run_812.sh
bash run_813.sh
bash run_82.sh
bash run_93.sh

All data would be saved in data/; all figures would be saved in results/.

Results

Problem 1

1a

For each term in equation 12, we have

Therefore,

1b

a Future states and actions are independent of previous states and actions given the current state according to the Markov property of MDP.

b

Therefore,

Problem 4

Reward-to-go has better performance than the trajectory-centric one without advantage-centering; reward-to-go converges faster and has lower variance.
Advantage centering helps reduce the variance after convergence.
Larger batch size helps reduce the variance.

Problem 5

Problem 7

Problem 8

Within the parameters I tested, better performance was observed for larger batch size or higher learning rate.

I chose batch size of 50000 and learning rate of 0.02.

Bonus 3

I experimented taking multiple gradient descent steps with the same batch of data on InvertedPendulum.

I need to decrease the learning rate to make it work. The effect in this case is essentially increasing the learning rate (although not exactly the same from an optimization perspective).

Original README

Dependencies:

Python 3.5
Numpy version 1.14.5
TensorFlow version 1.10.5
MuJoCo version 1.50 and mujoco-py 1.50.1.56
OpenAI Gym version 0.10.5
seaborn
Box2D==2.3.2

Before doing anything, first replace gym/envs/box2d/lunar_lander.py with the provided lunar_lander.py file.

The only file that you need to look at is train_pg_f18.py, which you will implement.

See the HW2 PDF for further instructions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hw2

hw2

README.md

CS294-112 HW 2: Policy Gradient

Usage

Results

Problem 1

1a

1b

Problem 4

Problem 5

Problem 7

Problem 8

Bonus 3

Original README

Files

hw2

Directory actions

More options

Directory actions

More options

Latest commit

History

hw2

Folders and files

parent directory

README.md

CS294-112 HW 2: Policy Gradient

Usage

Results

Problem 1

1a

1b

Problem 4

Problem 5

Problem 7

Problem 8

Bonus 3

Original README