- RL David Silver , Part II
- Lecture 6:
- Lecture 7: Policy Gradient
- 1 Introduction
- Policy Objective Functions
- Softmax Policy
- Gaussian Policy
- One-Step MDPs
- Policy Gradient Theorem
- Monte-Carlo Policy Gradient (REINFORCE)
- Puck World Example
- Reducing Variance Using a Critic
- Estimating the Action-Value Function
- Reducing Variance Using a Baseline
- Estimating the Advantage Function (1)
- Estimating the Advantage Function (2)
- Alternative Policy Gradient Directions
- 2 Finite Difference Policy Gradient
- 3 Monte-Carlo Policy Gradient
- 4 Actor-Critic Policy Gradient
- 1 Introduction
- Lecture 8: Integrating Learning and Planning
- Lecture 9 :
- Lecture 10: Classic Games
-
for q(s,a,w) , you feature should be extracted from both state , action
- eg. self.featExtractor.getFeatures(state, action )
-
第二张图 action-in
-
第三张图 action-out
- Atari game ?
- When you useing Function Approximation , the size of eligibility traces is on the size of you parameters ?
- it record all the features you have seen so far.
we might have 1 feature ,x₁, say how far am I far from the wall if I moving forwards. And I have another feature , x₂ , says how far am I far from this wall if I'm moving backwards.
我们保存所有的经验,而不是用过一次后就扔掉。
就像又回到了监督学习法。
打乱顺序,做一次随机梯度下降;再次打乱顺序,又一次地做随机地图下降。
之前说过,Sarsa,TD 这些方法 can not work in your network. However this method is stable in your networks.
我们实际是在用两套神经网络运行。一般会freeze 老的神经网路。
idea : maximum policy objective function J(θ)
- policy object function J(θ) : function of πθ(s,a)
- start value
- average value
- average reward per time step
- policy gradient
- ∇θ J(θ)
- Δθ = α·∇θ J(θ)
- compute policy gradient
- score function : d/dxf(x) / f(x)
- ∇θ πθ(s,a) = πθ(s,a) ∇θ log πθ(s,a)
- bold part is called : score function
- what is exactly score function is ?
- Softmax policy score function :
- ∇θ log πθ(s,a) = ɸ(s,a) - 𝔼πθ[ ɸ(s,·) ]
- the feature for the action we actually took , minus the average feature for all action might taken.
- score function : d/dxf(x) / f(x)
- use score function to compute ∇θ J(θ)
- ∇θ J(θ) = 𝔼πθ [ score_function· Qπθ (s, a) ]
- to update θ :
- θ ← θ + α·score_function * Qπθ (s, a)
- using a Critic : θ ← θ + α·score_function * Qw (s, a)
- w ← w + β·difference·φ(s, a)
- use Advantage Function
- ∇θ J(θ) = 𝔼πθ [ score_function· (Qw(s, a) - Vv(s) ) ]
- In practice we can use an approximate TD error
- ∇θ J(θ) = 𝔼πθ [ score_function· difference_Vv(s) ) ]
直接将 policy,而不是 value function, 参数化。
Wheneven a state aliasing occurs , a staochastic policy can do better than a deterministic policy.
- start value
- In episodic environments we can use the start value
- This basically says, if I always start in some state s1, or have some distribution of start state s1, what the total reward will I get from that start state almost.
- 比如说在atari游戏中,我一直打通关直到最后,然后获得得分。
- average value
- In continuing environments we can use the average value
- Let consider the probability of any state , times the value from that state onwards. So we average over the value of all states.
- maybe in s1 ,we got 10 points from that point onwards, in s2 wo got 20 points from that point onwards. So we average those , say, well the objective rewards is 15.
- Or the average reward per time-step
- What we really care about is just , let's consider some continuing environments I'm going round round round , now I care about is make sure the per time step I gaining the most reward. So we're not summing there reward over time, we just saying we will take the average of my immediate rewards over the entire distribution of states I visited.
- so this is basically saying, the objective is there is some probability I'm in some state , some probability I take an action from that state under my policy , and this is the immediate reward that I get at that step. What we care about is getting the most reward per time step.
这三种方法殊途同归。
we use for discrete actions
The score fuction is just the feature for the action we actually took , minus the average feature for all action might taken.
So this basically saying, how much more this feature do I have than usual , this is the score function.
When we start doing this kind of policy gradient algorithm , what are we actually doing is saying , if a feature occurs more than usual and gets reward , then we want to adjust the policy to do more
∇θ log πθ(s,a)
= ∇θ log ( eɸ(s,a)ᵀθ / ∑eɸ(s,a')ᵀθ )
= ∇θ log eɸ(s,a)ᵀθ - ∇θ log ∑eɸ(s,a')ᵀθ
= ∇θ ɸ(s,a)ᵀθ - ∑ ∇θ ɸ(s,a')ᵀθ
= ɸ(s,a) - ∑ɸ(s,a') = ɸ(s,a) - 𝔼πθ[ ɸ(s,·)]
= ɸ(s,a) - ∑b π(b|s,θ)·ɸ(s,b)
used in continuous actions space like AIBO.
score function
- a : action we actually took
- μ(s) : mean action ?
最后一个期望𝔼 , 这是使用 likehood ratios 技巧的要点所在。We start with something which is an expectation, we take the gradient, and recover something which is still an expecation. So this thing is an expectation under a policy , so this is an expectation under start state, next expectation the action we pick is all the gradient to log policy multiply by reward -- the expectation of the score times reward.
It shows us if we want to increase the objective function, want get more reward, we just need to move in the direction that determine by the score times the reward.
Again this tell us how to adjust a policy so that get more or less something , and this tell us what is good or bad.
So if get a really big reward you want to move in the direction that get more reward. if you get a negative reward, you want to move in the opposite direction.
The policy gradient is basically given by this thing , which is some expectation of score function , multiplied by the action value Q. So it basically tells you again how to adjust the policy so to get more or less of that particular action , multiplied by , how good that particular action was.
利用随机梯度上升算法来优化参数。
- sample Q
- so we are gonna be in a state, we start by take the action , add see what return we got. we'd use that to estimate Q. We just plug that in our policy gradient to give us a estimate direction to move.
- so every single episode now , in each step , we just adjust parameters a little bit in the direction of the score multiplied by the return we got from that point onwards.
- it's slow. MC policy gradient methods tend to be slow. They're very high varience.
- so the rest of this class is going to be about using similar ideas but making them more efficient.
So that's going to bring us to the final family of algorithms -- actor critic methods.
- I'm playing a Atari game, one 1 particular episode I might get a score of 1000, and next episode I might get a score of 0. And that's just in the randomness of what happens. because over the course of 10000 steps there are many many different random event which might occur. This thing is very noisy.
- And the main idea of the critic methods is that indead of using the return to estimate the action value function we're going to explicitly estimate the action value function using a critic , using a value function approximator.
- combine value function approximation with our policy methods.
- plug Qw into our policy gradient as a substitute for Qπ.
- actor: actor is the thing which is doing things in the world and it contains the policy ,it's picking actions , it's actually making the decisions of what to do in the world
- critic: critic doesn't actually take any decisions it's just watching what the actor does , and seeing what that's good or bad , evaluating that thing saying those decisions were good they got a score of 1000 or they got a score of -1000.
- the main idea is now to use an approximate policy gradient instead of the true policy gradient
- we're going to adjust the actor , we can adjust the policy in the direction which according to the critic will get more reward.
- so the critics gonna say hey I think if you go in this direction you can actually do better. and then the actor is going to move in the direction of that gradient.
- so the way we're going to do that if we're going to take our original policy gradient algorithm , replace the true action value function with this estimated approximate value function -- where we got our neural network or whatever to estimate this thing.
- we're going to do then is that each step we're just going to move a little bit using stochastic gradient ascent , every step we're going to move a little bit in the direction of the score multiplied by a sample from our own function approximator. So the critic is saying hey I think this thing is good or I think this thing is bad and then we move a little bit in the direction that gets more or less of the things that the critic says are good or bad.
So how do we estimate the action value function ?
- we can think of this is another form of generalized policy iteration
- where we start off with a policy , we evaluate that policy using the critic , and then instead of doing greedy policy improvement we're moving a gradient step in some direction to get a better policy.
- to choose a first policy , you can pick an arbitrary policy , so you initialize your policy parameters θ however you want.
- there's no greedy anymore , no ε really, the policy itself determines how we move around to this environment.
Now we have the basic idea of actor-critic algorithm. But what we want to do is to make these things better. So we consider now some tricks to make this thing better.
So the first trick and perpaps the easiest and best trick is to to reduce vairence using what's called a baseline.
So the idea is to subtract some baseline function from the policy gradient. and this can actually be done in a way that doesn;t change the direction of ascent. In other words, it changes the variance of the estimator , we can reduce the variance of this thing without changing the expectation. So another way to say that is that what we're going to do is we're going to subtract off some term which looks like this (red at bottom) from our policy gradient.
-
B(s) is not depend on actino
-
we can pull the gradient ∇ outside of the sum ∑
-
advantage function
- it is something which tells us how much better than usual is it to take action a.
There is an easier way and probably a better way.
This is probably the most commonly used variant of the actor critic.
It uses the following idea which is that the TD error is a sample of the advantage function.
Thereis a really important question in actor-critic algorithms , which is we've used this critic and we just kind of said let's replace the true critic value with some approximation. and we just plugged in that thing and hoped that the gradient that we follow is still correct. But how do we know that is valid? how do we know this is really pushing us in the corrent gradient direction.
The answer is that amazingly if you choose the value function approximation carefully that you use, it is possible to pick a value function approximator that doesn't introduce any bias at all. In other words despite the fact that we don't have a true value function , they were approximating the value function, we can still follow the true gradient , we can be guaranteed to follow the true gradient with our policy update.
That approach is called compatible function approximation. The main idea is that the features that we use, to have a compatible function approximator , are themselves the score function. we basically build up features for our critic where the features are the score of our policy.
And if we use that particular type of feature , and we use linear combination of those feature then we actually guarantee we don't affect the policy direction , we actually still follow the true gradient direction if we follow there compatible features.
One last idea this is a recent idea , and a useful idea to know about -- Natural Policy Gradient.
我们从真实经验出发,建模,我们生成了一个模型,现在我们可以用这个模型 取样 sampled experience。
这种方法的优势是 即使我们只是看到这些 Real Experience,我们都可以在规定的计算量预算下,尽可能多的抽取轨迹样本。
有了模型就等于有了无穷的数据。
所以,我们真实经验出发,建模,sample experience,now let's start final step which is to learn from sampled experience. Apply , say MC learning , to these trajectory.
- Information state space
- I'm in the state where I've tried left 3 times and right 1 time. and we can ask about how good is it to move in this state where the agent has cumulate information.
- so I might in a state I've never seen.