Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standard API for multi-agent environments #934

Closed
nicomon24 opened this issue Mar 6, 2018 · 21 comments
Closed

Standard API for multi-agent environments #934

nicomon24 opened this issue Mar 6, 2018 · 21 comments

Comments

@nicomon24
Copy link

Hi everyone,
I'm developing a multi-agent env (multi-snake, latest Request for Research) and I thought that having a common API interface for multi-agent environments would be great. Here some of my thoughts:

  • Real-time multi-agent environments: each player take an action at every step call (e.g. multi-snake)
    • Step function should accept a vector of actions, while returning a vector of observations, vector of rewards, vector of done flags, vector of infos (as in https://github.com/openai/multiagent-particle-envs/blob/master/multiagent/environment.py)
    • Reset function should also return a vector of observations
    • We could support different observation and actions spaces for different players (e.g. players have different visibility or different set of possible actions) by providing a list players in which each element is a tuple (observation_space, actions_space, reward_range)
    • Render function should render the observation of a particular player
  • Turn-based multi-agent environments: e.g. go or tic-tac-toe
    • Step() should take a player index and an action (check internally if it's the current player's turn)
    • We could also support games in which a turn consists in multiple actions (e.g. risk), by making the step function return a turn_end signal
    • Same issue of different observations/actions as before

I'm currently working on this, so I wanted to discuss here if any idea comes up :)

@FirefoxMetzger
Copy link
Contributor

I think we should differentiate between actual real-time games (where there is actual, continous time) and games where multiple agents make moves in the same turn.

The latter is just a special case of a turn-based multi-agent environment where agents take turns proposing a move and the environment updates jointly after all proposals are available (to the environment).

I think it would also make implementation a lot simpler. You could append a "player=1" as optional input parameter to step, render, action_space (getter), observation_space (getter), [...] and have the environment worry about the logic behind it. From an agent's pov it would look a lot like the usual environments.

The only problem I see is reset. Since the agent(s) don't actually reset the environment but rather the learning algorithm does, it gets a bit clunky as to how to pass the initial observations to each agent.

Regarding your proposal for multi-agent environments I think it should be up to the learning algorithm to decide which agent to give control over the current turn. Take self play as an example where there is no need to enforce which players turn it is as both are the same. Still the environment needs to track turns to keep up with potentially different observations or actions.

The same line of thought goes for "multi-step" turn-based environments. It should be up to the algorithm to decide whom to give control over the current turn. If the environment features "multi-step" turns the algorithm has to take care of adhering to that (or not, depending on the setup).

@nicomon24
Copy link
Author

I think we should differentiate between actual real-time games (where there is actual, continous time) and games where multiple agents make moves in the same turn.

Good point, it would be great! Does gym have any environment with continuous time yet to take inspiration?

The latter is just a special case of a turn-based multi-agent environment where agents take turns proposing a move and the environment updates jointly after all proposals are available (to the environment).

So every agent makes a blocking call to the step method? Because it must wait for every agent before getting an observation

Regarding your proposal for multi-agent environments I think it should be up to the learning algorithm to decide which agent to give control over the current turn.

Should we not try to abstract "environment mechanics" from the particular algorithm chosen? This is obviously my personal idea :) For example, in self-play, the environment should not know anything about the fact that we have the same agent playing itself, while it should know that there are N players that must take turns or can take actions jointly.

@FirefoxMetzger
Copy link
Contributor

FirefoxMetzger commented Mar 6, 2018

I gave this a bit more thought and have to revise my idea. I no longer think it makes sense to have a player=1 kind of parameter in either step(), render() or similar. In fact, the current API already supports (almost) all aspects discussed so far.

render() is actually render(self, mode=human) and while using "human", "rgb_array" and "ansi" have special meaning, nothing stops an environment from implementing mode=player_one to render views for specific players if they are only partially observable and different. So that part can go unchanged.

Reset function should also return a vector of observations

Currently, I would agree. The orchestrating code will have to worry about getting observations to each agent. Again, there is no real need to change the existing API since multidimensional spaces are already supported (e.g. a tuple-space) =). reset() is something that the orchestrating code around the agent(s) is concerned with, not the actual agent. [I previously referenced to that code as "learning algorithm" - very poor phrasing on my end, sorry]

step() is more interesting. I think there is no argument that observation and action are already flexible enough to support multi-agent environments as is. done doesn't need to be anything more then a flag, because it's main intention is to signal the orchestrating code when an episode terminates.
The only thing stopping your suggestion is reward, because according to the spec it only supports float. Then again, this is Python and you can easily ignore this detail of the spec and return an array. That is probably the only thing that is missing from the spec to implement what you are looking for.

We could support different observation and actions spaces for different players

The current API already supports that. The action space (and observation space) can change after every step, so using the "trick" of decomposing one step where every agent makes it's move to multiple steps where agents get to "propose" moves can accomplish unique observation and action spaces.

So every agent makes a blocking call to the step method?

This forces people that want to use the API to write multi-threaded code. I think the costs will outweigh the benefit (but that is my opinion, feel free to disagree). Instead agents can split their "pick_action" and "update" parts (which many do anyway) and hand control back to the orchestrating code in between. That way each agent's "pick_action" can be executed and passed to the environment by calling step() N times (accounting for different observability and actions). On the N-th call the environment updates (which produces the actual reward(s) and observations) and the orchestrating code can pass the appropriate new observations and rewards to the each agent's "update" part.

Does gym have any environment with continuous time yet to take inspiration?

I don't know actually. RL generally has a problem with continuous time as it is based on MDPs, which assume you get to make a sequence of discrete choices.

@nicomon24
Copy link
Author

I don't know actually. RL generally has a problem with continuous time as it is based on MDPs, which assume you get to make a sequence of discrete choices.

Yeah, I was kinda reflecting on this, in fact any digital device will "discretize" time, so no worries.

About your "tricks", you're obviously right, there is almost nothing that cannot be done with the current standard environment (in fact, I already did in my env :) ). Since I've seen different repos of multi-agent environment that uses different and specific approaches, I was more interested in finding common "guidelines" for the creation of new multi-agent environments, in order to make them "consistent" with each other (I think the simple and standard interface of gym is its main strength in fact).

@cjm715
Copy link
Contributor

cjm715 commented Jul 28, 2018

If I understand correctly, it looks like the normal environment can be used for the multiagent case but breaks the requirements of the Env class for the reward. This could be solved by simply changing the requirements of the reward in the Env class to a generic object rather than strictly a float. Not sure how this change will percolate to code that works off of the assumption that the reward is a float --- thus, those dependencies would have to be modified.

Alternatively, maybe a new Multiagent env class could be introduced. Also, it would be nice to have a few multiagent environments in this repository to demonstrate a Standard API for multiagent environments.

@ishan00
Copy link

ishan00 commented Oct 14, 2018

@nicomon24, @FirefoxMetzger. What is the current status of this issue? Does gym support any multi-agent environment yet?

@nicomon24
Copy link
Author

AFAIK the situation is still the same, no common strategy on multi-agent environments inside gym, while there are more multi-agent envs outside gym which can be used as inspiration (this tool is very cool if you want to look at some, just filter the multi-agent envs)

@gvgramazio
Copy link

Other than multi-agent environment, I think that the general Env class could be expanded also considering under which condition the decision is taken: at the same time, turn based, turn and phase based.

  • In pong we have two players that act at the same time. In this case the observation could be shared by both of them and the action could be the concatenation of the actions of the two players. One smaller problem is the reward that is currently defined as float. The biggest problem is how it should be handled the step method in order to take an input from both agents.
  • In chess, checkers, battleship, etc. we have turns. Each agent takes rewards and observation from the step function but it should take at least an observation (and maybe also a reward) when the other player make a step. This could be a bit difficult to implement.
  • In poker, we have turns but also phases. It means that at each phase of the game the observation_space change. In case of more than two players is also possible that the number of players currently active (i.e. that haven't folded yet) could be variable.

@VinQbator
Copy link

+1
Keeping an eye on the topic. I'm currently dealing with some multi-agent experiments and trying to keep them compatible with gym and keras-rl. Standardized interface would be really nice to have.

@alsora
Copy link

alsora commented Mar 23, 2019

+1
I am currently developing a Briscola RL framework (popular italian turn based card game).
Now I would like to integrate it with gym in order to also test more advanced algorithms and have more standard APIs.

I'm not an expert of the gym APIs, but I think that the main issue here is that there are different possible "timings" about when observe/act/get_reward.

In a standard, single agent, environment we have something like this

 def step(self, action):
    done = self.act(action)
    state = self.observe()
    reward = self.get_reward()
    return (state, reward, done, {})

In multi-agents environments, the order of these operations may be different and sometimes it will be hard to encapsulate all in the same method.

In a chess-like environment you will have that the state after action 1 will not be the state from which action 2 will be executed (because another agent in playing in the middle)

for agent in agents:
    state = env.observe(agent.id)
    action = agent.select_action(state)
    done = env.act(action)
    reward = env.get_reward()

Additionally, as in my specific case, the reward after a turn can only be computed after both agents have acted

for agent in agents:
    state = env.observe(agent.id)
    action = agent.select_action(state)
    done = env.act(action)

multi_rewards = env.get_reward()

@billtubbs
Copy link

This is a good discussion. I'm also interested in multi-agent environments.

@alsora has identified one important issue:

I think that the main issue here is that there are different possible "timings" about when observe/act/get_reward.

E.g. in Tic-Tac-Toe, player 1 may get a terminal reward after player 2's move.

I stumbled into this problem a few years ago when I implemented a tic-tac-toe game environment (as a learning exercise).

I can't see how that could be achieved with the current Gym method:

observation, reward, done, info = env.step(action)

If it's any help, the way I solved it in the end was to have three object types:

  • agents
  • environment (the game)
  • controller

Here is how it worked:

>>> game = TicTacToeGame()
>>> players = [HumanPlayer("Joe"), TDLearner("TD")]
>>> ctrl = GameController(game, players)
>>> ctrl.play()

With this structure in place, the GameController basically manages the communication of rewards so they can be passed to each agent immediately after they take an action or later after all agents made their move.

Hope that helps. If you do implement multiple agent environments it would be good to allow for the idea of agents that can communicate with each other or share the same value function for example.

@koulanurag
Copy link
Contributor

I stumbled upon this discussion thread and wanted to share similar work: ma-gym

Please, refer to the Usage Wiki for details on my compliance with the open-ai gym.

I hope this might be useful to others.

@stale
Copy link

stale bot commented Oct 17, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 17, 2019
@AdamGleave
Copy link
Contributor

I think this issue should remain open. It's a major lack in Gym's current API that will become only more acute over time with the renewed emphasis on multi-agent systems (OpenAI 5, AlphaStar, ...) in modern deep RL.

@yannbouteiller
Copy link

yannbouteiller commented Jan 6, 2020

I have a kinda philosophical question on this matter.

Wandering across projects that use gym in MARL environments, I have noticed that people usually use python lists for e.g. action_n whereas rllib, that I use a lot, typically uses dictionaries (e.g. action_dict) for everything. What should be the proper way according to you ?

For example, in multiagent gym environments I found, the return of step(): obs_n, rew_n, done_n, info_n would be python lists, whereas in rllib they would be dictionaries with one entry for each agent name. This easily allows for variable number of agents over time, heterogeneous episode lengths, heterogeneous observation/action spaces etc.

Actually I have been asking this question to some OpenAI guys at NeurIPS but they didn't really know.

@dat-boris
Copy link

I like that @yannbouteiller. Using an action dict like rllib's approach is sound [1] for arbitrary agents entering and exiting.

Would love to hear other's who have more experience in using rllib's multi-agent environment to see if there is missing use-case from rllib's API?

[1] https://ray.readthedocs.io/en/latest/rllib-env.html#multi-agent-and-hierarchical

@MouseAndKeyboard
Copy link
Contributor

MouseAndKeyboard commented Jan 8, 2020

I've been thinking about this for a while as I've had multiple projects where having a standard multi-agent API would be great.
In order the achieve "real-time" multi-agent games and separate observations + rewards I think it would be necessary to take some form of "Server-to-Client" architecture in the API

  • The simulator (server) would run the environment and should execute the agents' (clients') actions as requested.
  • An important concept to note is that there has to be some implementation of a "default action". For example, in StarCraft II presumably the default action is doing nothing but, in contrast, in an environment where the agent is controlling a bio-inspired character (e.g. a humanoid), the default action should not be to 'do nothing' as that might suggest making the character to go limp.
  • A server-client architecture is also more logical because each agent will receive their own observations and rewards independent of others.
  • server-client also allows for agents to connect and disconnect which may help model real world situations where agents can join and leave a simulation.

@yannbouteiller
Copy link

I've been thinking about this for a while as I've had multiple projects where having a standard multi-agent API would be great.
In order the achieve "real-time" multi-agent games and separate observations + rewards I think it would be necessary to take some form of "Server-to-Client" architecture in the API

* The simulator (server) would run the environment and should execute the agents' (clients') actions as requested.

* An important concept to note is that there has to be some implementation of a "default action". For example, in StarCraft II presumably the default action is doing nothing but, in contrast, in an environment where the agent is controlling a bio-inspired character (e.g. a humanoid), the default action should not be to 'do nothing' as that might suggest making the character to go limp.

* A server-client architecture is also more logical because each agent will receive their own observations and rewards independent of others.

* server-client also allows for agents to connect and disconnect which may help model real world situations where agents can join and leave a simulation.

All the RL algorithms I know assume a time-step, even in the real time setting, and many MARL algorithms use the global information during training with a global time-step (e.g. centralized critics). I have been thinking about this as well but I don't really see the point of a client-server asynchronous architecture over the current synchronous way of using step() considering these facts?

@MouseAndKeyboard
Copy link
Contributor

You're leaving it up to the person writing the agents to be calling step at periodic intervals. Ideally you would want this to be handled by the environment otherwise sharing gyms would also require the user to write new code each time.

Perhaps a client-server might be overkill but it naively seems quite intuitive/user friendly.

@eager-seeker
Copy link

eager-seeker commented Feb 6, 2020

I don't really see the point

calling step at periodic intervals

it would be good to allow for the idea of agents that can communicate with each other

Allowing between-env-step inter-agent communication (if, indeed, this is a valid MARL approach) is a possible use-case for asynchronous interactions and multi-node, distributed arrangements. I've tried this with SPADE but have been having to figure out some scale and memory issues. I'd tried some of the gym environment approaches mentioned above (thanks, contributors), but with a single coordinator game-master or tournament-master agent (context is games in this case). This is how I've managed to allow for the inter-agent interactions between steps. Has anyone found a better solution for allowing distributed agents to interact with a central multi-agent gym environment, while allowing for such async interactons between steps?

@jkterry1
Copy link
Collaborator

Check out https://github.com/PettingZoo-Team/PettingZoo. It's linked from the readme now too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests