DLR-RM · araffin · May 3, 2021 · Mar 11, 2021 · Mar 11, 2021 · Mar 11, 2021
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -0,0 +1,128 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+We as members, contributors, and leaders pledge to make participation in our
+community a harassment-free experience for everyone, regardless of age, body
+size, visible or invisible disability, ethnicity, sex characteristics, gender
+identity and expression, level of experience, education, socio-economic status,
+nationality, personal appearance, race, religion, or sexual identity
+and orientation.
+
+We pledge to act and interact in ways that contribute to an open, welcoming,
+diverse, inclusive, and healthy community.
+
+## Our Standards
+
+Examples of behavior that contributes to a positive environment for our
+community include:
+
+* Demonstrating empathy and kindness toward other people
+* Being respectful of differing opinions, viewpoints, and experiences
+* Giving and gracefully accepting constructive feedback
+* Accepting responsibility and apologizing to those affected by our mistakes,
+  and learning from the experience
+* Focusing on what is best not just for us as individuals, but for the
+  overall community
+
+Examples of unacceptable behavior include:
+
+* The use of sexualized language or imagery, and sexual attention or
+  advances of any kind
+* Trolling, insulting or derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or email
+  address, without their explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+
+## Enforcement Responsibilities
+
+Community leaders are responsible for clarifying and enforcing our standards of
+acceptable behavior and will take appropriate and fair corrective action in
+response to any behavior that they deem inappropriate, threatening, offensive,
+or harmful.
+
+Community leaders have the right and responsibility to remove, edit, or reject
+comments, commits, code, wiki edits, issues, and other contributions that are
+not aligned to this Code of Conduct, and will communicate reasons for moderation
+decisions when appropriate.
+
+## Scope
+
+This Code of Conduct applies within all community spaces, and also applies when
+an individual is officially representing the community in public spaces.
+Examples of representing our community include using an official e-mail address,
+posting via an official social media account, or acting as an appointed
+representative at an online or offline event.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported to the community leaders responsible for enforcement at
+antonin [dot] raffin [at] dlr [dot] de.
+All complaints will be reviewed and investigated promptly and fairly.
+
+All community leaders are obligated to respect the privacy and security of the
+reporter of any incident.
+
+## Enforcement Guidelines
+
+Community leaders will follow these Community Impact Guidelines in determining
+the consequences for any action they deem in violation of this Code of Conduct:
+
+### 1. Correction
+
+**Community Impact**: Use of inappropriate language or other behavior deemed
+unprofessional or unwelcome in the community.
+
+**Consequence**: A private, written warning from community leaders, providing
+clarity around the nature of the violation and an explanation of why the
+behavior was inappropriate. A public apology may be requested.
+
+### 2. Warning
+
+**Community Impact**: A violation through a single incident or series
+of actions.
+
+**Consequence**: A warning with consequences for continued behavior. No
+interaction with the people involved, including unsolicited interaction with
+those enforcing the Code of Conduct, for a specified period of time. This
+includes avoiding interactions in community spaces as well as external channels
+like social media. Violating these terms may lead to a temporary or
+permanent ban.
+
+### 3. Temporary Ban
+
+**Community Impact**: A serious violation of community standards, including
+sustained inappropriate behavior.
+
+**Consequence**: A temporary ban from any sort of interaction or public
+communication with the community for a specified period of time. No public or
+private interaction with the people involved, including unsolicited interaction
+with those enforcing the Code of Conduct, is allowed during this period.
+Violating these terms may lead to a permanent ban.
+
+### 4. Permanent Ban
+
+**Community Impact**: Demonstrating a pattern of violation of community
+standards, including sustained inappropriate behavior,  harassment of an
+individual, or aggression toward or disparagement of classes of individuals.
+
+**Consequence**: A permanent ban from any sort of public interaction within
+the community.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage],
+version 2.0, available at
+https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
+
+Community Impact Guidelines were inspired by [Mozilla's code of conduct
+enforcement ladder](https://github.com/mozilla/diversity).
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see the FAQ at
+https://www.contributor-covenant.org/faq. Translations are available at
+https://www.contributor-covenant.org/translations.
diff --git a/README.md b/README.md
@@ -37,7 +37,7 @@ you can take a look at the issues [#48](https://github.com/DLR-RM/stable-baselin
 | Type hints                  | :heavy_check_mark: |
 
 
-### Planned features (v1.1+)
+### Planned features
 
 Please take a look at the [Roadmap](https://github.com/DLR-RM/stable-baselines3/issues/1) and [Milestones](https://github.com/DLR-RM/stable-baselines3/milestones).
 
@@ -49,11 +49,13 @@ A migration guide from SB2 to SB3 can be found in the [documentation](https://st
 
 Documentation is available online: [https://stable-baselines3.readthedocs.io/](https://stable-baselines3.readthedocs.io/)
 
-## RL Baselines3 Zoo: A Collection of Trained RL Agents
+## RL Baselines3 Zoo: A Training Framework for Stable Baselines3 Reinforcement Learning Agents
 
-[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo). is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines3.
+[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a training framework for Reinforcement Learning (RL).
 
-It also provides basic scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.
+It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.
+
+In addition, it includes a collection of tuned hyperparameters for common environments and RL algorithms, and agents trained with those settings.
 
 Goals of this repository:
 
@@ -92,6 +94,7 @@ Install the Stable Baselines3 package:
 ```
 pip install stable-baselines3[extra]
 ```
+**Note:** Some shells such as Zsh require quotation marks around brackets, i.e. `pip install 'stable-baselines3[extra]'` ([More Info](https://stackoverflow.com/a/30539963)).
 
 This includes an optional dependencies like Tensorboard, OpenCV or `atari-py` to train on atari games. If you do not need those, you can use:
 ```
@@ -111,9 +114,9 @@ import gym
 
 from stable_baselines3 import PPO
 
-env = gym.make('CartPole-v1')
+env = gym.make("CartPole-v1")
 
-model = PPO('MlpPolicy', env, verbose=1)
+model = PPO("MlpPolicy", env, verbose=1)
 model.learn(total_timesteps=10000)
 
 obs = env.reset()

diff --git a/docs/README.md b/docs/README.md
@@ -6,9 +6,9 @@ This folder contains documentation for the RL baselines.
 ### Build the Documentation
 
 #### Install Sphinx and Theme
-
+Execute this command in the project root:
 ```
-pip install sphinx sphinx-autobuild sphinx-rtd-theme
+pip install -e .[docs]
 ```
 
 #### Building the Docs

diff --git a/docs/_static/img/net_arch.png b/docs/_static/img/net_arch.png
diff --git a/docs/_static/img/sb3_loop.png b/docs/_static/img/sb3_loop.png
diff --git a/docs/_static/img/sb3_policy.png b/docs/_static/img/sb3_policy.png
diff --git a/docs/guide/callbacks.rst b/docs/guide/callbacks.rst
@@ -185,6 +185,11 @@ It will save the best model if ``best_model_save_path`` folder is specified and
 	You can pass a child callback via the ``callback_on_new_best`` argument. It will be triggered each time there is a new best model.
 
 
+.. warning::
+
+  You need to make sure that ``eval_env`` is wrapped the same way as the training environment, for instance using the ``VecTransposeImage`` wrapper if you have a channel-last image as input.
+  The ``EvalCallback`` class outputs a warning if it is not the case.
+
 
 .. code-block:: python
 

diff --git a/docs/guide/custom_env.rst b/docs/guide/custom_env.rst
@@ -13,6 +13,12 @@ That is to say, your environment must implement the following methods (and inher
 	channel-first or channel-last.
 
 
+.. note::
+
+  Although SB3 supports both channel-last and channel-first images as input, we recommend using the channel-first convention when possible.
+  Under the hood, when a channel-last image is passed, SB3 uses a ``VecTransposeImage`` wrapper to re-order the channels.
+
+
 
 .. code-block:: python
 
@@ -29,9 +35,9 @@ That is to say, your environment must implement the following methods (and inher
       # They must be gym.spaces objects
       # Example when using discrete actions:
       self.action_space = spaces.Discrete(N_DISCRETE_ACTIONS)
-      # Example for using image as input (can be channel-first or channel-last):
+      # Example for using image as input (channel-first; channel-last also works):
       self.observation_space = spaces.Box(low=0, high=255,
-                                          shape=(HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)
+                                          shape=(N_CHANNELS, HEIGHT, WIDTH), dtype=np.uint8)
 
     def step(self, action):
       ...

diff --git a/docs/guide/custom_policy.rst b/docs/guide/custom_policy.rst
@@ -3,8 +3,8 @@
 Custom Policy Network
 =====================
 
-Stable Baselines3 provides policy networks for images (CnnPolicies)
-and other type of input features (MlpPolicies).
+Stable Baselines3 provides policy networks for images (CnnPolicies),
+other type of input features (MlpPolicies) and multiple different inputs (MultiInputPolicies).
 
 
 .. warning::
@@ -13,9 +13,49 @@ and other type of input features (MlpPolicies).
   which handles bounds more correctly.
 
 
+SB3 Policy
+^^^^^^^^^^
 
-Custom Policy Architecture
-^^^^^^^^^^^^^^^^^^^^^^^^^^
+SB3 networks are separated into two mains parts (see figure below):
+
+- A features extractor (usually shared between actor and critic when applicable, to save computation)
+  whose role is to extract features (i.e. convert to a feature vector) from high-dimensional observations, for instance, a CNN that extracts features from images.
+  This is the ``features_extractor_class`` parameter. You can change the default parameters of that features extractor
+  by passing a ``features_extractor_kwargs`` parameter.
+
+- A (fully-connected) network that maps the features to actions/value. Its architecture is controlled by the ``net_arch`` parameter.
+
+
+.. note::
+
+    All observations are first pre-processed (e.g. images are normalized, discrete obs are converted to one-hot vectors, ...) before being fed to the features extractor.
+    In the case of vector observations, the features extractor is just a ``Flatten`` layer.
+
+
+.. image:: ../_static/img/net_arch.png
+
+
+SB3 policies are usually composed of several networks (actor/critic networks + target networks when applicable) together
+with the associated optimizers.
+
+Each of these network have a features extractor followed by a fully-connected network.
+
+.. note::
+
+  When we refer to "policy" in Stable-Baselines3, this is usually an abuse of language compared to RL terminology.
+  In SB3, "policy" refers to the class that handles all the networks useful for training,
+  so not only the network used to predict actions (the "learned controller").
+
+
+
+.. image:: ../_static/img/sb3_policy.png
+
+
+.. .. figure:: https://cdn-images-1.medium.com/max/960/1*h4WTQNVIsvMXJTCpXm_TAw.gif
+
+
+Custom Network Architecture
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 One way of customising the policy network architecture is to pass arguments when creating the model,
 using ``policy_kwargs`` parameter:
@@ -109,6 +149,70 @@ that derives from ``BaseFeaturesExtractor`` and then pass it to the model when t
   model.learn(1000)
 
 
+Multiple Inputs and Dictionary Observations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Stable Baselines3 supports handling of multiple inputs by using ``Dict`` Gym space. This can be done using
+``MultiInputPolicy``, which by default uses the ``CombinedExtractor`` feature extractor to turn multiple
+inputs into a single vector, handled by the ``net_arch`` network.
+
+By default, ``CombinedExtractor`` processes multiple inputs as follows:
+
+1. If input is an image (automatically detected, see ``common.preprocessing.is_image_space``), process image with Nature Atari CNN network and 
+   output a latent vector of size ``64``.
+2. If input is not an image, flatten it (no layers).
+3. Concatenate all previous vectors into one long vector and pass it to policy.
+
+Much like above, you can define custom feature extractors as above. The following example assumes the environment has two keys in the
+observation space dictionary: "image" is a (1,H,W) image, and "vector" is a (D,) dimensional vector. We process "image" with a simple
+downsampling and "vector" with a single linear layer.
+
+.. code-block:: python
+
+  import gym
+  import torch as th
+  from torch import nn
+
+  from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
+
+  class CustomCombinedExtractor(BaseFeaturesExtractor):
+      def __init__(self, observation_space: gym.spaces.Dict):
+          # We do not know features-dim here before going over all the items,
+          # so put something dummy for now. PyTorch requires calling
+          # nn.Module.__init__ before adding modules
+          super(CustomCombinedExtractor, self).__init__(observation_space, features_dim=1)
+
+          extractors = {}
+
+          total_concat_size = 0
+          # We need to know size of the output of this extractor,
+          # so go over all the spaces and compute output feature sizes
+          for key, subspace in observation_space.spaces.items():
+              if key == "image":
+                  # We will just downsample one channel of the image by 4x4 and flatten.
+                  # Assume the image is single-channel (subspace.shape[0] == 0)
+                  extractors[key] = nn.Sequential(nn.MaxPool2d(4), nn.Flatten())
+                  total_concat_size += subspace.shape[1] // 4 * subspace.shape[2] // 4
+              elif key == "vector":
+                  # Run through a simple MLP
+                  extractors[key] = nn.Linear(subspace.shape[0], 16)
+                  total_concat_size += 16
+
+          self.extractors = nn.ModuleDict(extractors)
+
+          # Update the features dim manually
+          self._features_dim = total_concat_size
+
+      def forward(self, observations) -> th.Tensor:
+          encoded_tensor_list = []
+
+          # self.extractors contain nn.Modules that do all the processing.
+          for key, extractor in self.extractors.items():
+              encoded_tensor_list.append(extractor(observations[key]))
+          # Return a (B, self._features_dim) PyTorch tensor, where B is batch dimension.
+          return th.cat(encoded_tensor_list, dim=1)
+
+
 
 On-Policy Algorithms
 ^^^^^^^^^^^^^^^^^^^^

diff --git a/docs/guide/developer.rst b/docs/guide/developer.rst
@@ -31,6 +31,9 @@ Each algorithm has two main methods:
 - ``.train()`` which updates the parameters using samples from the buffer
 
 
+.. image:: ../_static/img/sb3_loop.png
+
+
 Where to start?
 ===============