Doubt on advantage calculation to update the policy on AWAC. #160

Roberto09 · 2022-01-13T18:17:27Z

Hey, this code and the AWAC paper are awesome! thanks for sharing this library; I've been reading some of it lately trying to understand and apply the AWAC paper:)

However, I had a doubt about on how the Q(s,a) term of the advantage function is implemented in the library:

rlkit/rlkit/torch/sac/awac_trainer.py

Line 554 in c81509d

q_adv = torch.min(q1_pred, q2_pred)

Where q1_pred and q2_pred are both directly calculated using the learned Q1 and Q2 functions.

I was wondering about this since, as I understand, the code is using Q(s,a) directly instead of the 1-step returns: r(s,a) + Q(s', a') to compute the Q(s,a) term in the advantage function. It seems to me that there's already have the 1-step returns estimate computed under the variable q_target:

rlkit/rlkit/torch/sac/awac_trainer.py

Line 496 in c81509d

    
           q_target = self.reward_scale * rewards + (1. - terminals) * self.discount * target_q_values

Is there any reason for the use of Q(s,a) directly obtained from doing min(Q1(s,a), Q2(s,a)) functions instead of the 1-step returns as r(s,a) + min(Q1(s',a'), Q2(s', a')); couldn't that introduce more bias to the estimate of the returns?

Oh, also, I did some tests switching it to the latter implementation on my implementation on a different problem and the results were very similar so it's unclear to me if there's actually any benefit from switching from one implementation to the other.

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doubt on advantage calculation to update the policy on AWAC. #160

Doubt on advantage calculation to update the policy on AWAC. #160

Roberto09 commented Jan 13, 2022 •

edited

Loading

Doubt on advantage calculation to update the policy on AWAC. #160

Doubt on advantage calculation to update the policy on AWAC. #160

Comments

Roberto09 commented Jan 13, 2022 • edited Loading

Roberto09 commented Jan 13, 2022 •

edited

Loading