Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_baseline might not working in cleanup environment #144

Open
joonleesky opened this issue Apr 2, 2019 · 13 comments
Open

train_baseline might not working in cleanup environment #144

joonleesky opened this issue Apr 2, 2019 · 13 comments

Comments

@joonleesky
Copy link

All of the agent's rewards are saturated to 0 around ~390 episodes while training with the default configurations of cleanup environment in train_baseline.py
All of the agents are just getting 0 reward until the end of the training.

@eugenevinitsky
Copy link
Owner

Sorry for the difficulties you've been having and thank you for trying out the baselines!
There's three aspects to this. One, 390 episodes is not a lot if you look at the paper the training curves come from they take many, many more iterations: https://arxiv.org/abs/1810.08647
Two, some of the default parameters of Ray, such as the way the underlying value functions are implemented, may differ from the paper so simply taking the default hyperparameters from the paper may not be sufficient.
Three, in my understanding, most of the time the scores in Cleanup actually come out as zero and a score above zero is more unusual than it is likely. Is this third point correct @natashamjaques ?

@joonleesky
Copy link
Author

Thank you so much for your kind reply ^_^ !! Still... I have some concerns
Even though, I have trained around 20,000,000 time steps (20000 episodes) all of my agent's rewards are just 0 from episode 390 without any change.

For the second part, I will look through some default parameters in ray and will notice you if i find some critical parameters that can affect the performance.

@eugenevinitsky
Copy link
Owner

One thing I would suggest trying is just disabling the hyperparameters in that file and using the default hyperparameters in Ray, with a possible sweep over the learning rate and the training batch size. To my mind those are usually the most critical hyperparameters.

@eugenevinitsky
Copy link
Owner

This isn't super unusual, without some initial luck in agents deleting enough of the initial waste cells, they never learn to get any apples. Increasing (making more negative) the value of the entropy coefficient, which should encourage exploration, may also help with the 0 score.

@natashamjaques
Copy link
Collaborator

Hey, just to chime in: actually, agents collectively scoring 0 reward in Cleanup is very typical. When I was using these environments in DeepMind, it was understood that 0 was the normal score for A3C agents. If you check out the Inequity Aversion paper https://arxiv.org/pdf/1803.08884.pdf, they report an average collective return of near 0 for 5 A3C agents in Cleanup.

I know that my paper reports a higher score; this was pretty atypical and actually just because I did a super extensive hyperparameter sweep for the baseline. But you can consider 0 the expected score.

Agents have difficulty solving this task both because of the partial observability (they can't see the apples appear when they clean the river), the delayed rewards (have to clean river, walk to apple patch, obtain apples, build that association), and because even if one agent learns to clean the river, the other agents will exploit it so much that they will consume all the apples before it can get to the apples to harvest anything resulting from its cleaning. So it will eventually un-learn the cleaning behavior.

Hope this helps!

@joonleesky
Copy link
Author

Thanks for your comments! It helps a lot.
I was feeling just kinda weird that the causal influence paper's A3C baseline in cleanup was around 50 ~ 100 and the Inequity Aversion paper's baseline seems to be around 10~50 not 0 all the way.

I am a beginner in MARL and hoping to try some of my ideas to solve these social dilemmas. Just the word super extensive hyper parameter sweep seems to be very scary for me since I'm trying with my personal computer.

btw, I loved the way you encoded the intrinsic motivation and thanks for the reproduction of the environments in open-source.

@eugenevinitsky
Copy link
Owner

Hi, we've found a few bugs that may be contributing to your difficulties reproducing and will ping you here as soon as they're resolved; apologies! Additionally, you may want to try focusing on Harvest, I've found it to be less sensitive to hyperparams.

@joonleesky
Copy link
Author

joonleesky commented Apr 28, 2019

Actually, I had some fun times experimenting with Cleanup and Harvest environments.
With A3C algorithms, I was able to kinda successfully reproduce the results.
Below are the results of re-implementing paper, Inequity aversion improves cooperation in intertemporal social dilemmas.

CleanUp - Original Paper
image

CleanUp - My Experiment
image

However, few things that made me confusing was that the collective returns have soared too early than reported in the paper.
I thought timesteps is equal to the timesteps of the single agent. Is timesteps in this multi-agent settings equal to the timesteps of single-agent * number of agents or is this might be the result of few bugs?
Or maybe, am I good at hyper-parameter tuning? LOL

Thank you for your kind replies!!

@eugenevinitsky
Copy link
Owner

Hi @natashamjaques, I think you might be able to answer this question best?

@eugenevinitsky
Copy link
Owner

Hi @joonleesky, I'm pretty sure that timesteps is the total number of environment steps. It's perfectly possible that you've just found a better set of hyperparams. Would you mind posting what those hyperparams are so that I can:
(1) investigate the issue
(2) Put those hyperparams into the project?

@eugenevinitsky
Copy link
Owner

Hi @joonleesky, I'd still love to know what hyperparams you wound up using!

@joonleesky
Copy link
Author

Hello @eugenevinitsky , Sorry for the late replies! I was in the middle of the semester.
I've tested the inequity aversion model again with 5 random seeds with same hyper-parameter, but the soaring performance was observable only 1 out of the 5.

Actually, I've had some fun experiments with adopting the reward schemes as https://gist.github.com/dfarhi/66ec9d760ae0c49a5c492c9fae93984a and some other methods.

Even though the performance was some what unstable, I will commit the training scripts with hyper-parameters, model scripts and the results in a quick time to help some people who needs so.

It was really fun for me to watch how training was going on as shown as below, thank you!

ezgif com-video-to-gif

@LUCKYGT
Copy link

LUCKYGT commented Nov 15, 2020

Hello @eugenevinitsky , Sorry for the late replies! I was in the middle of the semester.
I've tested the inequity aversion model again with 5 random seeds with same hyper-parameter, but the soaring performance was observable only 1 out of the 5.

Actually, I've had some fun experiments with adopting the reward schemes as https://gist.github.com/dfarhi/66ec9d760ae0c49a5c492c9fae93984a and some other methods.

Even though the performance was some what unstable, I will commit the training scripts with hyper-parameters, model scripts and the results in a quick time to help some people who needs so.

It was really fun for me to watch how training was going on as shown as below, thank you!

ezgif com-video-to-gif

Hello @joonleesky , I met the same problem like you. But I didn't find your commitments. Could you give the link of your contributions? I would be very grateful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants