diff --git a/Gallery.md b/Gallery.md index fa46a192..924b8ff4 100644 --- a/Gallery.md +++ b/Gallery.md @@ -24,12 +24,14 @@ Users are also welcome to contribute their own training examples and demos to th
-| Algorithm | Tags | Refs | -|:-----------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------:|:-------------------------------:| -| [PPO](https://arxiv.org/abs/1707.06347) | ![discrete](https://img.shields.io/badge/-discrete-brightgreen) | [code](./examples/cartpole/) | -| [MAPPO](https://arxiv.org/abs/2103.01955) | ![MARL](https://img.shields.io/badge/-MARL-yellow) | [code](./examples/mpe/) | -| [JRPO](https://arxiv.org/abs/2302.07515) | ![MARL](https://img.shields.io/badge/-MARL-yellow) | [code](./examples/mpe/) | -| [MAT](https://arxiv.org/abs/2205.14953) | ![MARL](https://img.shields.io/badge/-MARL-yellow) | [code](./examples/mpe/) | +| Algorithm | Tags | Refs | +|:-------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------:|:-------------------------------:| +| [PPO](https://arxiv.org/abs/1707.06347) | ![discrete](https://img.shields.io/badge/-discrete-brightgreen) | [code](./examples/cartpole/) | +| [PPO-continuous](https://arxiv.org/abs/1707.06347) | ![continuous](https://img.shields.io/badge/-continous-green) | [code](./examples/mujoco/) | +| [Dual-clip PPO](https://arxiv.org/abs/1912.09729) | ![discrete](https://img.shields.io/badge/-discrete-brightgreen) | [code](./examples/cartpole/) | +| [MAPPO](https://arxiv.org/abs/2103.01955) | ![MARL](https://img.shields.io/badge/-MARL-yellow) | [code](./examples/mpe/) | +| [JRPO](https://arxiv.org/abs/2302.07515) | ![MARL](https://img.shields.io/badge/-MARL-yellow) | [code](./examples/mpe/) | +| [MAT](https://arxiv.org/abs/2205.14953) | ![MARL](https://img.shields.io/badge/-MARL-yellow) | [code](./examples/mpe/) |
## Demo List @@ -38,8 +40,9 @@ Users are also welcome to contribute their own training examples and demos to th | Environment/Demo | Tags | Refs | |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------:|:-------------------------------:| -| [CartPole](https://gymnasium.farama.org/environments/classic_control/cart_pole/)
| ![discrete](https://img.shields.io/badge/-discrete-brightgreen) | [code](./examples/cartpole/) | +| [MuJoCo](https://github.com/deepmind/mujoco)
| ![continuous](https://img.shields.io/badge/-continous-green) | [code](./examples/mujoco/) | +| [CartPole](https://gymnasium.farama.org/environments/classic_control/cart_pole/)
| ![discrete](https://img.shields.io/badge/-discrete-brightgreen) | [code](./examples/cartpole/) | | [MPE: Simple Spread](https://pettingzoo.farama.org/environments/mpe/simple_spread/)
| ![discrete](https://img.shields.io/badge/-discrete-brightgreen) ![MARL](https://img.shields.io/badge/-MARL-yellow) | [code](./examples/mpe/) | | [Super Mario Bros](https://github.com/Kautenja/gym-super-mario-bros)
| ![discrete](https://img.shields.io/badge/-discrete-brightgreen) | [code](./examples/super_mario/) | -| [Gym Retro](https://github.com/openai/retro)
| ![discrete](https://img.shields.io/badge/-discrete-brightgreen) | [code](./examples/retro/) | +| [Gym Retro](https://github.com/openai/retro)
| ![discrete](https://img.shields.io/badge/-discrete-brightgreen) | [code](./examples/retro/) | \ No newline at end of file diff --git a/README.md b/README.md index 69751b45..03736d50 100644 --- a/README.md +++ b/README.md @@ -71,12 +71,14 @@ Currently, the features supported by OpenRL include: Algorithms currently supported by OpenRL (for more details, please refer to [Gallery](./Gallery.md)): - [Proximal Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347) +- [Dual-clip PPO](https://arxiv.org/abs/1912.09729) - [Multi-agent PPO (MAPPO)](https://arxiv.org/abs/2103.01955) - [Joint-ratio Policy Optimization (JRPO)](https://arxiv.org/abs/2302.07515) - [Multi-Agent Transformer (MAT)](https://arxiv.org/abs/2205.14953) Environments currently supported by OpenRL (for more details, please refer to [Gallery](./Gallery.md)): - [Gymnasium](https://gymnasium.farama.org/) +- [MuJoCo](https://github.com/deepmind/mujoco) - [MPE](https://github.com/openai/multiagent-particle-envs) - [Super Mario Bros](https://github.com/Kautenja/gym-super-mario-bros) - [Gym Retro](https://github.com/openai/retro) diff --git a/README_zh.md b/README_zh.md index c6b204b2..d2730aa4 100644 --- a/README_zh.md +++ b/README_zh.md @@ -56,12 +56,14 @@ OpenRL是一个开源的通用强化学习研究框架,支持单智能体、 OpenRL目前支持的算法(更多详情请参考 [Gallery](Gallery.md)): - [Proximal Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347) +- [Dual-clip PPO](https://arxiv.org/abs/1912.09729) - [Multi-agent PPO (MAPPO)](https://arxiv.org/abs/2103.01955) - [Joint-ratio Policy Optimization (JRPO)](https://arxiv.org/abs/2302.07515) - [Multi-Agent Transformer (MAT)](https://arxiv.org/abs/2205.14953) OpenRL目前支持的环境(更多详情请参考 [Gallery](Gallery.md)): - [Gymnasium](https://gymnasium.farama.org/) +- [MuJoCo](https://github.com/deepmind/mujoco) - [MPE](https://github.com/openai/multiagent-particle-envs) - [Super Mario Bros](https://github.com/Kautenja/gym-super-mario-bros) - [Gym Retro](https://github.com/openai/retro) diff --git a/docs/images/cartpole.png b/docs/images/cartpole.png new file mode 100644 index 00000000..c0f89a74 Binary files /dev/null and b/docs/images/cartpole.png differ diff --git a/docs/images/cartpole_trained.gif b/docs/images/cartpole_trained.gif deleted file mode 100644 index 97a7cb7b..00000000 Binary files a/docs/images/cartpole_trained.gif and /dev/null differ diff --git a/docs/images/mujoco.png b/docs/images/mujoco.png new file mode 100644 index 00000000..ef72f001 Binary files /dev/null and b/docs/images/mujoco.png differ diff --git a/examples/cartpole/README.md b/examples/cartpole/README.md index 67e5a0db..50e663f3 100644 --- a/examples/cartpole/README.md +++ b/examples/cartpole/README.md @@ -4,4 +4,11 @@ Users can train CartPole via: ```shell python train_ppo.py +``` + + +To train with [Dual-clip PPO](https://arxiv.org/abs/1912.09729): + +```shell +python train_ppo.py --config dual_clip_ppo.yaml ``` \ No newline at end of file diff --git a/examples/cartpole/dual_clip_ppo.yaml b/examples/cartpole/dual_clip_ppo.yaml new file mode 100644 index 00000000..8e682729 --- /dev/null +++ b/examples/cartpole/dual_clip_ppo.yaml @@ -0,0 +1,2 @@ +dual_clip_ppo: true +dual_clip_coeff: 3.0 \ No newline at end of file diff --git a/examples/cartpole/train_ppo.py b/examples/cartpole/train_ppo.py index dd1ba632..5c089850 100644 --- a/examples/cartpole/train_ppo.py +++ b/examples/cartpole/train_ppo.py @@ -1,6 +1,7 @@ """""" import numpy as np +from openrl.configs.config import create_config_parser from openrl.envs.common import make from openrl.modules.common import PPONet as Net from openrl.runners.common import PPOAgent as Agent @@ -10,7 +11,12 @@ def train(): # create environment, set environment parallelism to 9 env = make("CartPole-v1", env_num=9) # create the neural network - net = Net(env) + cfg_parser = create_config_parser() + cfg = cfg_parser.parse_args() + net = Net( + env, + cfg=cfg, + ) # initialize the trainer agent = Agent(net) # start training, set total number of training steps to 20000 @@ -34,7 +40,8 @@ def evaluation(agent): action, _ = agent.act(obs, deterministic=True) obs, r, done, info = env.step(action) step += 1 - print(f"{step}: reward:{np.mean(r)}") + if step % 50 == 0: + print(f"{step}: reward:{np.mean(r)}") env.close() diff --git a/examples/mujoco/README.md b/examples/mujoco/README.md new file mode 100644 index 00000000..6cf1b17b --- /dev/null +++ b/examples/mujoco/README.md @@ -0,0 +1,9 @@ +## Installation + +`pip install mujoco` + +## Usage + +```shell +python train_ppo.py +``` \ No newline at end of file diff --git a/examples/mujoco/train_ppo.py b/examples/mujoco/train_ppo.py new file mode 100644 index 00000000..21b294c0 --- /dev/null +++ b/examples/mujoco/train_ppo.py @@ -0,0 +1,65 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +# Copyright 2023 The OpenRL Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""""" +import numpy as np + +from openrl.envs.common import make +from openrl.modules.common import PPONet as Net +from openrl.runners.common import PPOAgent as Agent + + +def train(): + # create environment, set environment parallelism to 9 + env = make("InvertedPendulum-v4", env_num=9) + # create the neural network + net = Net(env) + # initialize the trainer + agent = Agent(net) + # start training, set total number of training steps to 20000 + agent.train(total_time_steps=30000) + env.close() + return agent + + +def evaluation(agent): + # begin to test + # Create an environment for testing and set the number of environments to interact with to 9. Set rendering mode to group_human. + env = make("InvertedPendulum-v4", render_mode=None, env_num=9, asynchronous=False) + + # The trained agent sets up the interactive environment it needs. + agent.set_env(env) + # Initialize the environment and get initial observations and environmental information. + obs, info = env.reset() + + done = False + step = 0 + totoal_reward = 0 + while not np.any(done): + # Based on environmental observation input, predict next action. + action, _ = agent.act(obs, deterministic=True) + obs, r, done, info = env.step(action) + step += 1 + if step % 100 == 0: + print(f"{step}: reward:{np.mean(r)}") + totoal_reward += np.mean(r) + env.close() + print(f"total reward: {totoal_reward}") + + +if __name__ == "__main__": + agent = train() + evaluation(agent) diff --git a/openrl/configs/config.py b/openrl/configs/config.py index 0722b278..c5bd83fc 100644 --- a/openrl/configs/config.py +++ b/openrl/configs/config.py @@ -460,7 +460,6 @@ def create_config_parser(): ) parser.add_argument( "--dual_clip_ppo", - action="store_true", default=False, help="by default False, use dual-clip ppo.", ) diff --git a/openrl/envs/vec_env/base_venv.py b/openrl/envs/vec_env/base_venv.py index 1511d350..a5c36348 100644 --- a/openrl/envs/vec_env/base_venv.py +++ b/openrl/envs/vec_env/base_venv.py @@ -23,6 +23,7 @@ import gymnasium as gym import numpy as np +from openrl.envs.vec_env.utils.numpy_utils import single_random_action from openrl.envs.vec_env.utils.util import tile_images IN_COLAB = "google.colab" in sys.modules @@ -257,9 +258,10 @@ def random_action(self): """ Get a random action from the action space """ + return np.array( [ - [[self.action_space.sample()] for _ in range(self.agent_num)] + [single_random_action(self.action_space) for _ in range(self.agent_num)] for _ in range(self.parallel_env_num) ] ) diff --git a/openrl/envs/vec_env/utils/numpy_utils.py b/openrl/envs/vec_env/utils/numpy_utils.py index d3246e92..2f7875fa 100644 --- a/openrl/envs/vec_env/utils/numpy_utils.py +++ b/openrl/envs/vec_env/utils/numpy_utils.py @@ -20,6 +20,7 @@ "concatenate", "create_empty_array", "iterate_action", + "single_random_action", ] @@ -53,7 +54,7 @@ def _iterate_discrete(space, actions): @iterate_action.register(MultiDiscrete) @iterate_action.register(MultiBinary) def _iterate_base(space, actions): - raise NotImplementedError("Not implemented yet.") + return iter(actions) @iterate_action.register(Tuple) @@ -205,3 +206,20 @@ def _create_empty_array_dict(space, n=1, agent_num=1, fn=np.zeros): @create_empty_array.register(Space) def _create_empty_array_custom(space, n=1, agent_num=1, fn=np.zeros): return None + + +@singledispatch +def single_random_action(space: Space) -> Union[tuple, dict, np.ndarray]: + raise ValueError( + f"Space of type `{type(space)}` is not a valid `gymnasium.Space` instance." + ) + + +@single_random_action.register(Discrete) +def _single_random_action_discrete(space): + return [space.sample()] + + +@single_random_action.register(Box) +def _single_random_action_discrete(space): + return space.sample() diff --git a/setup.py b/setup.py index d8ea5af0..8b7ce1aa 100644 --- a/setup.py +++ b/setup.py @@ -38,6 +38,7 @@ def get_install_requires() -> list: "imageio", "opencv-python", "pygame", + "mujoco", ] diff --git a/tests/test_examples/test_train_mujoco.py b/tests/test_examples/test_train_mujoco.py new file mode 100644 index 00000000..cfd14db5 --- /dev/null +++ b/tests/test_examples/test_train_mujoco.py @@ -0,0 +1,58 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +# Copyright 2023 The OpenRL Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""""" + +import os +import sys + +import numpy as np +import pytest + +from openrl.envs.common import make +from openrl.modules.common import PPONet as Net +from openrl.runners.common import PPOAgent as Agent + + +@pytest.fixture(scope="module", params=[""]) +def config(request): + from openrl.configs.config import create_config_parser + + cfg_parser = create_config_parser() + cfg = cfg_parser.parse_args(request.param.split()) + return cfg + + +@pytest.mark.unittest +def test_train_mujoco(config): + env = make("InvertedPendulum-v4", env_num=9) + agent = Agent(Net(env, cfg=config)) + agent.train(total_time_steps=30000) + + agent.set_env(env) + obs, info = env.reset() + done = False + total_reward = 0 + while not np.any(done): + action, _ = agent.act(obs, deterministic=True) + obs, r, done, info = env.step(action) + total_reward += np.mean(r) + assert total_reward >= 900, "InvertedPendulum-v4 should be solved." + env.close() + + +if __name__ == "__main__": + sys.exit(pytest.main(["-sv", os.path.basename(__file__)]))