-
Notifications
You must be signed in to change notification settings - Fork 106
Environments
The map contains apples and lemons. The first player (red) is very sensitive and scores 5 for the team for an apple (green square) and −5 for a lemon (yellow square). The second (blue), less sensitive player scores 1 for the team for an apple and −1 for a lemon. There is a wall of lemons between the players and the apples. Apples and lemons disappear when collected, and the environment resets when all apples are eaten. It is important that the sensitive agent eats the apples while the less sensitive agent should leave them to its team mate but clear the way by eating obstructing lemons.
Reference Paper : Value-Decomposition Networks For Cooperative Multi-Agent Learning ( Section 4.2)
Action Space: 0: Down, 1: Left, 2: Up , 3: Right, 4: Noop
Observation : Agent Coordinate + 3x3 mask around the agent + Steps in env.
Versions:
Name | Description |
---|---|
Checkers-v0 | Each agent receives only it's local observation |
Checkers-v1 | Each agent receives local observation all other agents |
It's a grid world environments having 2 agents (red and blue) where each agent wants to move their corresponding home location ( marked in boxes outlined in same colors). The challenging part of the game is to pass through the narrow corridor through which only one agent can pass at a time. They need to coordinate to not block he pathway for the other. A reward of +5 is given to each agent for reaching their home cell. The episode ends when both agent has reached their home state or for a maximum of 100 steps in environment.
Action Space: 0: Down, 1: Left, 2: Up , 3: Right, 4: Noop
Observation : Agent Coordinate + Steps in env.
Reference Paper : Value-Decomposition Networks For Cooperative Multi-Agent Learning ( Section 4.2)
Versions:
Name | Description |
---|---|
CrossOver-v0 | Each agent receives only it's local position coordinates |
CrossOver-v1 | Each agent receives position coordinates of all other agents |
Predator-prey involves a grid world, in which multiple predators attempt to capture randomly moving prey. Agents have a 5 × 5 view and select one of five actions ∈ {Left, Right, Up, Down, Stop} at each time step. Prey move according to selecting a uniformly random action at each time step. We define the “catching” of a prey as when the prey is within the cardinal direction of at least one predator. Each agent’s observation includes its own coordinates, agent ID, and the coordinates of the prey relative to itself, if observed. The agents can separate roles even if the parameters of the neural networks are shared by agent ID. We test with two different grid worlds: (i) a 5 × 5 grid world with two predators and one prey, and (ii) a 7 × 7 grid world with four predators and two prey. We modify the general predator-prey, such that a positive reward is given only if multiple predators catch a prey simultaneously, requiring a higher degree of cooperation. The predators get a team reward of 1 if two or more catch a prey at the same time, but they are given negative reward −P, if only one predator catches the prey as shown in Figure 7. We experimented with three varying P vales, where P = 0.5, 1.0, 1.5. The terminating condition of this task is when a prey is caught with more than one predator. The prey that has been caught is regenerated at random positions whenever the task terminates, and the game proceeds over fixed 100 steps.
Action Space: 0: Down, 1: Left, 2: Up , 3: Right, 4: Noop
Name | Description |
---|---|
PredatorPrey5x5-v0 | |
PredatorPrey5x5-v1 | |
PredatorPrey5x5-v2 | |
PredatorPrey5x5-v3 |
Same applies for PredatorPrey7x7
The task tests if two agents can synchronize their behaviour, when picking up objects and returning them to a drop point. In the Fetch task both players start on the same side of the map and have pickup points on the opposite side. A player scores 3 points for the team for pick-up, and another 5 points for dropping off the item at the drop point near the starting position. Then the pickup is available to either player again. It is optimal for the agents to cycle such that when one player reaches the pickup point the other returns to base, to be ready to pick up again.
Reference: Learning Multi-Agent communication with Backpropagation
Reference: Learning Multi-Agent communication with Backpropagation
Contributions are Welcome!