Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor HER #351

Merged
merged 45 commits into from
May 3, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
948005e
Start refactoring HER
araffin Mar 11, 2021
7574bfd
Fixes
araffin Mar 11, 2021
5c7db11
Additional fixes
araffin Mar 11, 2021
47b88da
Faster tests
araffin Mar 11, 2021
c67ffe0
WIP: HER as a custom replay buffer
araffin Mar 12, 2021
15d3e12
New replay only version (working with DQN)
araffin Mar 13, 2021
04ef2cf
Add support for all off-policy algorithms
araffin Mar 13, 2021
e94ced6
Fix saving/loading
araffin Mar 13, 2021
f0dfdc1
Remove ObsDictWrapper and add VecNormalize tests with dict
araffin Mar 13, 2021
e3875b5
Stable-Baselines3 v1.0 (#354)
araffin Mar 17, 2021
12c8be0
Merge branch 'master' into feat/dict_observations
araffin Mar 17, 2021
1e2eae6
Add gym-pybullet-drones project (#358)
JacopoPan Mar 19, 2021
e1ee87f
Include SuperSuit in projects (#359)
jkterry1 Mar 20, 2021
8a08078
Fix default arguments + add bugbear (#363)
araffin Mar 25, 2021
a4851b1
Merge branch 'master' into feat/dict_observations
araffin Mar 25, 2021
ba73d15
Add code of conduct + update doc (#373)
araffin Mar 31, 2021
c0966f3
Make installation command compatible with ZSH (#376)
tom-doerr Apr 2, 2021
c29c43b
Add handle timeouts param
araffin Apr 5, 2021
5166d51
Merge branch 'feat/dict_observations' into feat/refactor-her
araffin Apr 5, 2021
21bb70c
Fixes
araffin Apr 5, 2021
8606561
Fixes (buffer size, extend test)
araffin Apr 5, 2021
866afa9
Fix `max_episode_length` redefinition
araffin Apr 5, 2021
2f397df
Fix potential issue
araffin Apr 5, 2021
4138f96
Add some docs on dict obs
Miffyli Apr 5, 2021
4f12135
Merge branch 'master' into feat/dict_observations
araffin Apr 6, 2021
6bc42f9
Fix performance bug
araffin Apr 7, 2021
a46109d
Fix slowdown
araffin Apr 7, 2021
1ed15bf
Add package to install (#378)
tom-doerr Apr 10, 2021
6b42c96
Fix backward compat + add test
araffin Apr 13, 2021
a6d04fd
Fix VecEnv detection
araffin Apr 13, 2021
9f95a4b
Update doc
araffin Apr 13, 2021
3dc5493
Fix vec env check
araffin Apr 13, 2021
ddbe0e9
Support for `VecMonitor` for gym3-style environments (#311)
vwxyzjn Apr 13, 2021
7f28cdf
Reformat
araffin Apr 13, 2021
c430402
Fixed loading of ``ent_coef`` for ``SAC`` and ``TQC``, it was not opt…
araffin Apr 15, 2021
5d47296
Add test for GAE + rename `RolloutBuffer.dones` for clarification (#375)
araffin Apr 16, 2021
c69f7cd
Fixed saving of `A2C` and `PPO` policy when using gSDE (#401)
araffin Apr 19, 2021
613a141
Merge branch 'master' into feat/dict_observations
araffin Apr 21, 2021
22512a0
Merge branch 'feat/dict_observations' into feat/refactor-her
araffin Apr 21, 2021
b1861b3
Improve doc and replay buffer loading
araffin Apr 21, 2021
fecdfe8
Add support for images
araffin Apr 21, 2021
5bbb2f6
Fix doc
araffin Apr 21, 2021
83a8124
Update Procgen doc
araffin May 3, 2021
85cb6f5
Update changelog
araffin May 3, 2021
72049f9
Update docstrings
araffin May 3, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 128 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# Contributor Covenant Code of Conduct

## Our Pledge

We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.

## Our Standards

Examples of behavior that contributes to a positive environment for our
community include:

* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
overall community

Examples of unacceptable behavior include:

* The use of sexualized language or imagery, and sexual attention or
advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email
address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting

## Enforcement Responsibilities

Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.

Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.

## Scope

This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
antonin [dot] raffin [at] dlr [dot] de.
All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the
reporter of any incident.

## Enforcement Guidelines

Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:

### 1. Correction

**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.

**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.

### 2. Warning

**Community Impact**: A violation through a single incident or series
of actions.

**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or
permanent ban.

### 3. Temporary Ban

**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.

**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.

### 4. Permanent Ban

**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.

**Consequence**: A permanent ban from any sort of public interaction within
the community.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.0, available at
https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.

Community Impact Guidelines were inspired by [Mozilla's code of conduct
enforcement ladder](https://github.com/mozilla/diversity).

[homepage]: https://www.contributor-covenant.org

For answers to common questions about this code of conduct, see the FAQ at
https://www.contributor-covenant.org/faq. Translations are available at
https://www.contributor-covenant.org/translations.
15 changes: 9 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ you can take a look at the issues [#48](https://github.com/DLR-RM/stable-baselin
| Type hints | :heavy_check_mark: |


### Planned features (v1.1+)
### Planned features

Please take a look at the [Roadmap](https://github.com/DLR-RM/stable-baselines3/issues/1) and [Milestones](https://github.com/DLR-RM/stable-baselines3/milestones).

Expand All @@ -49,11 +49,13 @@ A migration guide from SB2 to SB3 can be found in the [documentation](https://st

Documentation is available online: [https://stable-baselines3.readthedocs.io/](https://stable-baselines3.readthedocs.io/)

## RL Baselines3 Zoo: A Collection of Trained RL Agents
## RL Baselines3 Zoo: A Training Framework for Stable Baselines3 Reinforcement Learning Agents

[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo). is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines3.
[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a training framework for Reinforcement Learning (RL).

It also provides basic scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.
It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

In addition, it includes a collection of tuned hyperparameters for common environments and RL algorithms, and agents trained with those settings.

Goals of this repository:

Expand Down Expand Up @@ -92,6 +94,7 @@ Install the Stable Baselines3 package:
```
pip install stable-baselines3[extra]
```
**Note:** Some shells such as Zsh require quotation marks around brackets, i.e. `pip install 'stable-baselines3[extra]'` ([More Info](https://stackoverflow.com/a/30539963)).

This includes an optional dependencies like Tensorboard, OpenCV or `atari-py` to train on atari games. If you do not need those, you can use:
```
Expand All @@ -111,9 +114,9 @@ import gym

from stable_baselines3 import PPO

env = gym.make('CartPole-v1')
env = gym.make("CartPole-v1")

model = PPO('MlpPolicy', env, verbose=1)
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
Expand Down
4 changes: 2 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ This folder contains documentation for the RL baselines.
### Build the Documentation

#### Install Sphinx and Theme

Execute this command in the project root:
```
pip install sphinx sphinx-autobuild sphinx-rtd-theme
pip install -e .[docs]
```

#### Building the Docs
Expand Down
Binary file added docs/_static/img/net_arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/sb3_loop.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/sb3_policy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions docs/guide/callbacks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,11 @@ It will save the best model if ``best_model_save_path`` folder is specified and
You can pass a child callback via the ``callback_on_new_best`` argument. It will be triggered each time there is a new best model.


.. warning::

You need to make sure that ``eval_env`` is wrapped the same way as the training environment, for instance using the ``VecTransposeImage`` wrapper if you have a channel-last image as input.
The ``EvalCallback`` class outputs a warning if it is not the case.


.. code-block:: python

Expand Down
10 changes: 8 additions & 2 deletions docs/guide/custom_env.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@ That is to say, your environment must implement the following methods (and inher
channel-first or channel-last.


.. note::

Although SB3 supports both channel-last and channel-first images as input, we recommend using the channel-first convention when possible.
Under the hood, when a channel-last image is passed, SB3 uses a ``VecTransposeImage`` wrapper to re-order the channels.



.. code-block:: python

Expand All @@ -29,9 +35,9 @@ That is to say, your environment must implement the following methods (and inher
# They must be gym.spaces objects
# Example when using discrete actions:
self.action_space = spaces.Discrete(N_DISCRETE_ACTIONS)
# Example for using image as input (can be channel-first or channel-last):
# Example for using image as input (channel-first; channel-last also works):
self.observation_space = spaces.Box(low=0, high=255,
shape=(HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)
shape=(N_CHANNELS, HEIGHT, WIDTH), dtype=np.uint8)

def step(self, action):
...
Expand Down
112 changes: 108 additions & 4 deletions docs/guide/custom_policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
Custom Policy Network
=====================

Stable Baselines3 provides policy networks for images (CnnPolicies)
and other type of input features (MlpPolicies).
Stable Baselines3 provides policy networks for images (CnnPolicies),
other type of input features (MlpPolicies) and multiple different inputs (MultiInputPolicies).


.. warning::
Expand All @@ -13,9 +13,49 @@ and other type of input features (MlpPolicies).
which handles bounds more correctly.


SB3 Policy
^^^^^^^^^^

Custom Policy Architecture
^^^^^^^^^^^^^^^^^^^^^^^^^^
SB3 networks are separated into two mains parts (see figure below):

- A features extractor (usually shared between actor and critic when applicable, to save computation)
whose role is to extract features (i.e. convert to a feature vector) from high-dimensional observations, for instance, a CNN that extracts features from images.
This is the ``features_extractor_class`` parameter. You can change the default parameters of that features extractor
by passing a ``features_extractor_kwargs`` parameter.

- A (fully-connected) network that maps the features to actions/value. Its architecture is controlled by the ``net_arch`` parameter.


.. note::

All observations are first pre-processed (e.g. images are normalized, discrete obs are converted to one-hot vectors, ...) before being fed to the features extractor.
In the case of vector observations, the features extractor is just a ``Flatten`` layer.


.. image:: ../_static/img/net_arch.png


SB3 policies are usually composed of several networks (actor/critic networks + target networks when applicable) together
with the associated optimizers.

Each of these network have a features extractor followed by a fully-connected network.

.. note::

When we refer to "policy" in Stable-Baselines3, this is usually an abuse of language compared to RL terminology.
In SB3, "policy" refers to the class that handles all the networks useful for training,
so not only the network used to predict actions (the "learned controller").



.. image:: ../_static/img/sb3_policy.png


.. .. figure:: https://cdn-images-1.medium.com/max/960/1*h4WTQNVIsvMXJTCpXm_TAw.gif


Custom Network Architecture
^^^^^^^^^^^^^^^^^^^^^^^^^^^

One way of customising the policy network architecture is to pass arguments when creating the model,
using ``policy_kwargs`` parameter:
Expand Down Expand Up @@ -109,6 +149,70 @@ that derives from ``BaseFeaturesExtractor`` and then pass it to the model when t
model.learn(1000)


Multiple Inputs and Dictionary Observations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Stable Baselines3 supports handling of multiple inputs by using ``Dict`` Gym space. This can be done using
``MultiInputPolicy``, which by default uses the ``CombinedExtractor`` feature extractor to turn multiple
inputs into a single vector, handled by the ``net_arch`` network.

By default, ``CombinedExtractor`` processes multiple inputs as follows:

1. If input is an image (automatically detected, see ``common.preprocessing.is_image_space``), process image with Nature Atari CNN network and
output a latent vector of size ``64``.
2. If input is not an image, flatten it (no layers).
3. Concatenate all previous vectors into one long vector and pass it to policy.

Much like above, you can define custom feature extractors as above. The following example assumes the environment has two keys in the
observation space dictionary: "image" is a (1,H,W) image, and "vector" is a (D,) dimensional vector. We process "image" with a simple
downsampling and "vector" with a single linear layer.

.. code-block:: python

import gym
import torch as th
from torch import nn

from stable_baselines3.common.torch_layers import BaseFeaturesExtractor

class CustomCombinedExtractor(BaseFeaturesExtractor):
def __init__(self, observation_space: gym.spaces.Dict):
# We do not know features-dim here before going over all the items,
# so put something dummy for now. PyTorch requires calling
# nn.Module.__init__ before adding modules
super(CustomCombinedExtractor, self).__init__(observation_space, features_dim=1)

extractors = {}

total_concat_size = 0
# We need to know size of the output of this extractor,
# so go over all the spaces and compute output feature sizes
for key, subspace in observation_space.spaces.items():
if key == "image":
# We will just downsample one channel of the image by 4x4 and flatten.
# Assume the image is single-channel (subspace.shape[0] == 0)
extractors[key] = nn.Sequential(nn.MaxPool2d(4), nn.Flatten())
total_concat_size += subspace.shape[1] // 4 * subspace.shape[2] // 4
elif key == "vector":
# Run through a simple MLP
extractors[key] = nn.Linear(subspace.shape[0], 16)
total_concat_size += 16

self.extractors = nn.ModuleDict(extractors)

# Update the features dim manually
self._features_dim = total_concat_size

def forward(self, observations) -> th.Tensor:
encoded_tensor_list = []

# self.extractors contain nn.Modules that do all the processing.
for key, extractor in self.extractors.items():
encoded_tensor_list.append(extractor(observations[key]))
# Return a (B, self._features_dim) PyTorch tensor, where B is batch dimension.
return th.cat(encoded_tensor_list, dim=1)



On-Policy Algorithms
^^^^^^^^^^^^^^^^^^^^
Expand Down
3 changes: 3 additions & 0 deletions docs/guide/developer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ Each algorithm has two main methods:
- ``.train()`` which updates the parameters using samples from the buffer


.. image:: ../_static/img/sb3_loop.png


Where to start?
===============

Expand Down
Loading