-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce recurrent sac-discrete #1
Conversation
Hi, I ran this command: |
Hi, Yes, I did observe Markovian SAC-Discrete is unstable for Cartpole across seeds. You may try disable auto-tuning the alpha, and grid search over fixed alpha, using
I did not have much insight on it.. |
It seems that your task has sparse reward. I guess the entropy is high at the early learning stage and through training, the entropy decreases to a threshold where the agent can exploit its "optimal" behavior to receive some positive rewards. |
Yeah, that's right. However, the agent does experience rewards during the period 0-10k, but the policy gradient doesn't seem to be large. And the policy during evaluation didn't change much, often not getting any success at all, even though it's not hard that sparse to get a positive reward. |
@hai-h-nguyen Did you ever find a solution to these issues? I am experiencing somewhat similar behaviour. |
Hi @RobertMcCarthy97 , it might help to start alpha at a smaller value, like 0.01, than the starting value in the code (1.0). That is to make the agent explores less initially. |
This PR introduces recurrent SAC-discrete algorithm for POMDPs with discrete action space.
The code is heavily based on the SAC-discrete open-sourced code https://github.com/ku2482/sac-discrete.pytorch/blob/master/sacd/agent/sacd.py and the SAC-discrete paper https://arxiv.org/abs/1910.07207
We provide two sanity checks on classic gym discrete control environments:
CartPole-v0
andLunarLander-v2
. The commands for running Markovian and recurrent SAC-discrete algorithms are:where
target_entropy
sets the ratio of target entropy: ratio * log(|A|).target_entropy
but can solve the task with max return 200:target_entropy
and can solve the task with max return 200 within 10 episodestarget_entropy
but can solve the task with return over 200:target_entropy
but can nearly solve the task in onetarget_entropy
value: