Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in Reward function #72

Open
hepengli opened this issue May 31, 2021 · 9 comments
Open

Bug in Reward function #72

hepengli opened this issue May 31, 2021 · 9 comments

Comments

@hepengli
Copy link

hepengli commented May 31, 2021

I found a bug in the reward function in the file "./env/starcraft2/starcraft2.py", line 729. The bug is that when the enemies heal or regenerate shield, the allies will receive rewards. The location of the bug is in the function "reward_battle(self)", Line 729:

image

In Line 729, the argument "delta_enemy" is negative when the enemy heals or regenerates shield. However, Line 729 uses abs() to convert "delta_enemy + delta_deaths" to a positive reward. This means the allies are rewarded when the enemy heals.

One consequence of this problem is that the allies may only learn to hurt the enemies but never kill them so that they can receive rewards when the enemies heal afterwards. In that case, a policy can learn to increase rewards but never win the game.

@hepengli hepengli changed the title Reward for Protos has some problem Bug in Reward function Jun 28, 2021
@douglasrizzo
Copy link
Contributor

Just to be clear, the problem you mention happens when delta_enemy is negative with a larger magnitude than delta_deaths, which would mean that agents are rewarded by injuring enemies, while not killing them. In this scenario, delta_enemy + delta_deaths would be negative, but the absolute of that would be positive.

Since self.reward_only_positive means we want to ignore negative rewards, I believe the best solution is to take the maximum of 0 and delta_enemy + delta_deaths. It would behave more closely like the reward when self.reward_only_positive == False, but never being negative and ignoring ally damage and deaths.

if self.reward_only_positive:
    reward = max(0, delta_enemy + delta_deaths)  # shield regeneration
else:
    reward = delta_enemy + delta_deaths - delta_ally

@hepengli
Copy link
Author

hepengli commented Jul 9, 2021

Just to be clear, the problem you mention happens when delta_enemy is negative with a larger magnitude than delta_deaths, which would mean that agents are rewarded by injuring enemies, while not killing them. In this scenario, delta_enemy + delta_deaths would be negative, but the absolute of that would be positive.

Since self.reward_only_positive means we want to ignore negative rewards, I believe the best solution is to take the maximum of 0 and delta_enemy + delta_deaths. It would behave more closely like the reward when self.reward_only_positive == False, but never being negative and ignoring ally damage and deaths.

if self.reward_only_positive:
    reward = max(0, delta_enemy + delta_deaths)  # shield regeneration
else:
    reward = delta_enemy + delta_deaths - delta_ally

Yes! And this situation happens especially in Maps with Protoss, which can regenerate shields. And I found in the map "3s5z_vs_3s6z" that the allies can learn a strategy to increase the reward without winning the game. Specifically, the allies can learn a pattern to injure the enemies a little and immediately run away from them by hiding in a corner and waiting the enemies to recover, and then repeat. However, the problem can be solved when modifying the reward function to this:

if self.reward_only_positive:
reward = max(0, delta_enemy + delta_deaths) # shield regeneration
else:
reward = delta_enemy + delta_deaths - delta_ally

douglasrizzo added a commit to douglasrizzo/smac that referenced this issue Jul 9, 2021
@douglasrizzo
Copy link
Contributor

I sent a quick PR with this tiny change. Check if it solves the issues you have in your experiments.

@hepengli
Copy link
Author

hepengli commented Jul 9, 2021

Thanks! I think this will do. I will check this again and get back to you soon.

@samvelyan
Copy link
Collaborator

samvelyan commented Jul 19, 2021

Thanks both for pointing this out and for sending the PR #76. We're going to see how to best integrate this fix in the upcoming SMAC versions to avoid confusion. One issue we've noticed is that some people compare results using different SMAC/StarCraftII versions and report unfair comparisons between methods in their work.

@douglasrizzo
Copy link
Contributor

I have just watched a similar behavior happen in maps which have Medivacs (MMM and MMM2). They have a healing power and my agents have learned to wait for the Medivac to heal the last unit before killing it, sometimes at the expense of the match.

@hepengli
Copy link
Author

Yes! I've also noticed this situation happens in MMM and MMM2. But your solution by changing the only positive reward to "reward = max(0, delta_enemy + delta_deaths)" is able to fix this issue.

samvelyan added a commit to benellis3/smac that referenced this issue Apr 19, 2022
@xihuai18
Copy link

xihuai18 commented Jul 8, 2024

Why not merge #76 ?

@samvelyan
Copy link
Collaborator

Given how much the benchmark has been used by the community, fixing this issue now will result in unfair comparisons with existing work. Therefore, we will not merge it with the main branch in this repo.

If you really want to use that particular version of SMAC, you are welcome to use the branch of #76: https://github.com/douglasrizzo/smac/tree/patch-1. But you must make it clear that this is not the standard version of the benchmark when presenting those results.

Lastly, this issue is resolved in SMACv2, the second version of the benchmark which I encourage you to use instead: https://github.com/oxwhirl/smacv2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants