Skip to content

Commit

Permalink
Merge branch 'dev' into to_torch_numpy
Browse files Browse the repository at this point in the history
  • Loading branch information
duburcqa authored Jul 21, 2020
2 parents 2cb3a97 + 8c32d99 commit 5b2995d
Show file tree
Hide file tree
Showing 30 changed files with 1,535 additions and 136 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -143,3 +143,5 @@ MUJOCO_LOG.TXT
*.pth
.vscode/
.DS_Store
*.zip
*.pstats
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Here is Tianshou's other features:
- Support any type of environment state (e.g. a dict, a self-defined class, ...) [Usage](https://tianshou.readthedocs.io/en/latest/tutorials/cheatsheet.html#user-defined-environment-and-different-state-representation)
- Support customized training process [Usage](https://tianshou.readthedocs.io/en/latest/tutorials/cheatsheet.html#customize-training-process)
- Support n-step returns estimation for all Q-learning based algorithms
- Support multi-agent RL easily [Usage](https://tianshou.readthedocs.io/en/latest/tutorials/cheatsheet.html##multi-agent-reinforcement-learning)

In Chinese, Tianshou means divinely ordained and is derived to the gift of being born with. Tianshou is a reinforcement learning platform, and the RL algorithm does not learn from humans. So taking "Tianshou" means that there is no teacher to study with, but rather to learn by themselves through constant interaction with the environment.

Expand Down
Binary file added docs/_static/images/marl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/images/tic-tac-toe.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/contributor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ We always welcome contributions to help make Tianshou better. Below are an incom
* Jiayi Weng (`Trinkle23897 <https://github.com/Trinkle23897>`_)
* Minghao Zhang (`Mehooz <https://github.com/Mehooz>`_)
* Alexis Duburcq (`duburcqa <https://github.com/duburcqa>`_)
* Kaichao You (`youkaichao <https://github.com/youkaichao>`_)
2 changes: 2 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Here is Tianshou's other features:
* Support any type of environment state (e.g. a dict, a self-defined class, ...): :ref:`self_defined_env`
* Support customized training process: :ref:`customize_training`
* Support n-step returns estimation :meth:`~tianshou.policy.BasePolicy.compute_nstep_return` for all Q-learning based algorithms
* Support multi-agent RL easily (a tutorial is available at :doc:`/tutorials/tictactoe`)

中文文档位于 https://tianshou.readthedocs.io/zh/latest/

Expand Down Expand Up @@ -71,6 +72,7 @@ Tianshou is still under development, you can also check out the documents in sta
tutorials/dqn
tutorials/concepts
tutorials/batch
tutorials/tictactoe
tutorials/trick
tutorials/cheatsheet

Expand Down
43 changes: 43 additions & 0 deletions docs/tutorials/cheatsheet.rst
Original file line number Diff line number Diff line change
Expand Up @@ -244,3 +244,46 @@ But the state stored in the buffer may be a shallow-copy. To make sure each of y
def step(a):
...
return copy.deepcopy(self.graph), reward, done, {}

.. _marl_example:

Multi-Agent Reinforcement Learning
----------------------------------

This is related to `Issue 121 <https://github.com/thu-ml/tianshou/issues/121>`_. The discussion is still goes on.

With the flexible core APIs, Tianshou can support multi-agent reinforcement learning with minimal efforts.

Currently, we support three types of multi-agent reinforcement learning paradigms:

1. Simultaneous move: at each timestep, all the agents take their actions (example: moba games)

2. Cyclic move: players take action in turn (example: Go game)

3. Conditional move, at each timestep, the environment conditionally selects an agent to take action. (example: `Pig Game <https://en.wikipedia.org/wiki/Pig_(dice_game)>`_)

We mainly address these multi-agent RL problems by converting them into traditional RL formulations.

For simultaneous move, the solution is simple: we can just add a ``num_agent`` dimension to state, action, and reward. Nothing else is going to change.

For 2 & 3 (cyclic move and conditional move), they can be unified into a single framework: at each timestep, the environment selects an agent with id ``agent_id`` to play. Since multi-agents are usually wrapped into one object (which we call "abstract agent"), we can pass the ``agent_id`` to the "abstract agent", leaving it to further call the specific agent.

In addition, legal actions in multi-agent RL often vary with timestep (just like Go games), so the environment should also passes the legal action mask to the "abstract agent", where the mask is a boolean array that "True" for available actions and "False" for illegal actions at the current step. Below is a figure that explains the abstract agent.

.. image:: /_static/images/marl.png
:align: center
:height: 300

The above description gives rise to the following formulation of multi-agent RL:
::

action = policy(state, agent_id, mask)
(next_state, next_agent_id, next_mask), reward = env.step(action)

By constructing a new state ``state_ = (state, agent_id, mask)``, essentially we can return to the typical formulation of RL:
::

action = policy(state_)
next_state_, reward = env.step(action)

Following this idea, we write a tiny example of playing `Tic Tac Toe <https://en.wikipedia.org/wiki/Tic-tac-toe>`_ against a random player by using a Q-lerning algorithm. The tutorial is at :doc:`/tutorials/tictactoe`.
2 changes: 1 addition & 1 deletion docs/tutorials/dqn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ We use the defined ``net`` and ``optim``, with extra policy hyper-parameters, to

policy = ts.policy.DQNPolicy(net, optim,
discount_factor=0.9, estimation_step=3,
use_target_network=True, target_update_freq=320)
target_update_freq=320)


Setup Collector
Expand Down
Loading

0 comments on commit 5b2995d

Please sign in to comment.