-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train_baseline might not working in cleanup environment #144
Comments
Sorry for the difficulties you've been having and thank you for trying out the baselines! |
Thank you so much for your kind reply ^_^ !! Still... I have some concerns For the second part, I will look through some default parameters in ray and will notice you if i find some critical parameters that can affect the performance. |
One thing I would suggest trying is just disabling the hyperparameters in that file and using the default hyperparameters in Ray, with a possible sweep over the learning rate and the training batch size. To my mind those are usually the most critical hyperparameters. |
This isn't super unusual, without some initial luck in agents deleting enough of the initial waste cells, they never learn to get any apples. Increasing (making more negative) the value of the entropy coefficient, which should encourage exploration, may also help with the 0 score. |
Hey, just to chime in: actually, agents collectively scoring 0 reward in Cleanup is very typical. When I was using these environments in DeepMind, it was understood that 0 was the normal score for A3C agents. If you check out the Inequity Aversion paper https://arxiv.org/pdf/1803.08884.pdf, they report an average collective return of near 0 for 5 A3C agents in Cleanup. I know that my paper reports a higher score; this was pretty atypical and actually just because I did a super extensive hyperparameter sweep for the baseline. But you can consider 0 the expected score. Agents have difficulty solving this task both because of the partial observability (they can't see the apples appear when they clean the river), the delayed rewards (have to clean river, walk to apple patch, obtain apples, build that association), and because even if one agent learns to clean the river, the other agents will exploit it so much that they will consume all the apples before it can get to the apples to harvest anything resulting from its cleaning. So it will eventually un-learn the cleaning behavior. Hope this helps! |
Thanks for your comments! It helps a lot. I am a beginner in MARL and hoping to try some of my ideas to solve these social dilemmas. Just the word super extensive hyper parameter sweep seems to be very scary for me since I'm trying with my personal computer. btw, I loved the way you encoded the intrinsic motivation and thanks for the reproduction of the environments in open-source. |
Hi, we've found a few bugs that may be contributing to your difficulties reproducing and will ping you here as soon as they're resolved; apologies! Additionally, you may want to try focusing on Harvest, I've found it to be less sensitive to hyperparams. |
Actually, I had some fun times experimenting with Cleanup and Harvest environments. However, few things that made me confusing was that the collective returns have soared too early than reported in the paper. Thank you for your kind replies!! |
Hi @natashamjaques, I think you might be able to answer this question best? |
Hi @joonleesky, I'm pretty sure that timesteps is the total number of environment steps. It's perfectly possible that you've just found a better set of hyperparams. Would you mind posting what those hyperparams are so that I can: |
Hi @joonleesky, I'd still love to know what hyperparams you wound up using! |
Hello @eugenevinitsky , Sorry for the late replies! I was in the middle of the semester. Actually, I've had some fun experiments with adopting the reward schemes as https://gist.github.com/dfarhi/66ec9d760ae0c49a5c492c9fae93984a and some other methods. Even though the performance was some what unstable, I will commit the training scripts with hyper-parameters, model scripts and the results in a quick time to help some people who needs so. It was really fun for me to watch how training was going on as shown as below, thank you! |
Hello @joonleesky , I met the same problem like you. But I didn't find your commitments. Could you give the link of your contributions? I would be very grateful. |
All of the agent's rewards are saturated to 0 around ~390 episodes while training with the default configurations of cleanup environment in train_baseline.py
All of the agents are just getting 0 reward until the end of the training.
The text was updated successfully, but these errors were encountered: