Skip to content

Commit

Permalink
Add RL (PaddlePaddle#163)
Browse files Browse the repository at this point in the history
  • Loading branch information
ceci3 authored Apr 22, 2020
1 parent 823ca6b commit d21b2c6
Show file tree
Hide file tree
Showing 16 changed files with 1,394 additions and 3 deletions.
54 changes: 54 additions & 0 deletions docs/zh_cn/api_cn/custom_rl_controller.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# 外部如何自定义强化学习Controller

首先导入必要的依赖:
```python
### 引入强化学习Controller基类函数和注册类函数
from paddleslim.common.RL_controller.utils import RLCONTROLLER
from paddleslim.common.RL_controller import RLBaseController
```

通过装饰器的方式把自定义强化学习Controller注册到PaddleSlim,继承基类之后需要重写基类中的`next_tokens``update`两个函数。注意:本示例仅说明一些必不可少的步骤,并不能直接运行,完整代码请参考[这里]()

```python
### 注意: 类名一定要全部大写
@RLCONTROLLER.register
class LSTM(RLBaseController):
def __init__(self, range_tables, use_gpu=False, **kwargs):
### range_tables 表示tokens的取值范围
self.range_tables = range_tables
### use_gpu 表示是否使用gpu来训练controller
self.use_gpu = use_gpu
### 定义一些强化学习算法中需要的参数
...
### 构造相应的program, _build_program这个函数会构造两个program,一个是pred_program,一个是learn_program, 并初始化参数
self._build_program()
self.place = fluid.CUDAPlace(0) if self.args.use_gpu else fluid.CPUPlace()
self.exe = fluid.Executor(self.place)
self.exe.run(fluid.default_startup_program())

### 保存参数到一个字典中,这个字典由server端统一维护更新,因为可能有多个client同时更新一份参数,所以这一步必不可少,由于pred_program和learn_program使用的同一份参数,所以只需要把learn_program中的参数放入字典中即可
self.param_dicts = {}
self.param_dicts.update(self.learn_program: self.get_params(self.learn_program))

def next_tokens(self, states, params_dict):
### 把从server端获取参数字典赋值给当前要用到的program
self.set_params(self.pred_program, params_dict, self.place)
### 根据states构造输入
self.num_archs = states
feed_dict = self._create_input()
### 获取当前token
actions = self.exe.run(self.pred_program, feed=feed_dict, fetch_list=self.tokens)
...
return actions

def update(self, rewards, params_dict=None):
### 把从server端获取参数字典赋值给当前要用到的program
self.set_params(self.learn_program, params_dict, self.place)
### 根据`next_tokens`中的states和`update`中的rewards构造输入
feed_dict = self._create_input(is_test=False, actual_rewards = rewards)
### 计算当前step的loss
loss = self.exe.run(self.learn_program, feed=feed_dict, fetch_list=[self.loss])
### 获取当前program的参数并返回,client会把本轮的参数传给server端进行参数更新
params_dict = self.get_params(self.learn_program)
return params_dict
```
154 changes: 153 additions & 1 deletion docs/zh_cn/api_cn/nas_api.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
SA-NAS
NAS
========

搜索空间参数的配置
Expand Down Expand Up @@ -160,3 +160,155 @@ SANAS(Simulated Annealing Neural Architecture Search)是基于模拟退火
sanas = SANAS(configs=config)
print(sanas.current_info())
RLNAS
------

.. py:class:: paddleslim.nas.RLNAS(key, configs, use_gpu=False, server_addr=("", 8881), is_server=True, is_sync=False, save_controller=None, load_controller=None, **kwargs)
`源代码 <> `_

RLNAS (Reinforcement Learning Neural Architecture Search)是基于强化学习算法进行模型结构搜索的算法。

- **key<str>** - 使用的强化学习Controller名称,目前paddleslim支持的有`LSTM`和`DDPG`,自定义强化学习Controller请参考 ` 自定义强化学习Controller <> `_
- **configs(list<tuple>)** - 搜索空间配置列表,格式是 ``[(key, {input_size, output_size, block_num, block_mask})]`` 或者 ``[(key)]`` (MobileNetV2、MobilenetV1和ResNet的搜索空间使用和原本网络结构相同的搜索空间,所以仅需指定 ``key`` 即可), ``input_size`` 和 ``output_size`` 表示输入和输出的特征图的大小, ``block_num`` 是指搜索网络中的block数量, ``block_mask`` 是一组由0和1组成的列表,0代表不进行下采样的block,1代表下采样的block。 更多paddleslim提供的搜索空间配置可以参考[Search Space](../search_space.md)。
- **use_gpu(bool)** - 是否使用GPU来训练Controller。默认:False。
- **server_addr(tuple)** - RLNAS中Controller的地址,包括server的ip地址和端口号,如果ip地址为None或者为""的话则默认使用本机ip。默认:("", 8881)。
- **is_server(bool)** - 当前实例是否要启动一个server。默认:True。
- **is_sync(bool)** - 是否使用同步模式更新Controller,该模式仅在多client下有差别。默认:False。
- **save_controller(str|None)** - 保存Controller的checkpoint的文件目录,如果设置为None的话则不保存checkpoint。默认:None 。
- **load_controller(str|None)** - 加载Controller的checkpoint的文件目录,如果设置为None的话则不加载checkpoint。默认:None。
- **\*\*kwargs** - 附加的参数,由具体强化学习算法决定,`LSTM`和`DDPG`的附加参数请参考note。

.. note::

`LSTM`算法的附加参数:

- lstm_num_layers(int, optional): - Controller中堆叠的LSTM的层数。默认:1.
- hidden_size(int, optional): - LSTM中隐藏层的大小。默认:100.
- temperature(float, optional): - 是否在计算每个token过程中做温度平均。默认:None.
- tanh_constant(float, optional): 是否在计算每个token过程中做tanh激活,并乘上`tanh_constant`值。 默认:None。
- decay(float, optional): LSTM中记录rewards的baseline的平滑率。默认:0.99.
- weight_entropy(float, optional): 在更新controller参数时是否为接收到的rewards加上计算token过程中的带权重的交叉熵值。默认:None。
- controller_batch_size(int, optional): controller的batch_size,即每运行一次controller可以拿到几个token。默认:1.


`DDPG`算法的附加参数:
注意:使用`DDPG`算法的话必须安装parl。安装方法: pip install parl

- obs_dim(int): observation的维度。
- model(class,optional): DDPG算法中使用的具体的模型,一般是个类,包含actor_model和critic_model,需要实现两个方法,一个是policy用来获得策略,另一个是value,需要获得Q值。可以参考默认的model` <>_`实现您自己的model。默认:`default_ddpg_model`.
- actor_lr(float, optional): actor网络的学习率。默认:1e-4.
- critic_lr(float, optional): critic网络的学习率。默认:1e-3.
- gamma(float, optional): 接收到rewards之后的折扣因子。默认:0.99.
- tau(float, optional): DDPG中把models的参数同步累积到target_model上时的折扣因子。默认:0.001.
- memory_size(int, optional): DDPG中记录历史信息的池子大小。默认:10.
- reward_scale(float, optional): 记录历史信息时,对rewards信息进行的折扣因子。默认:0.1.
- controller_batch_size(int, optional): controller的batch_size,即每运行一次controller可以拿到几个token。默认:1.
- actions_noise(class, optional): 通过DDPG拿到action之后添加的噪声,设置为False或者None时不添加噪声。默认:default_noise.
..
**返回:**
一个RLNAS类的实例

**示例代码:**

.. code-block:: python
from paddleslim.nas import RLNAS
config = [('MobileNetV2Space')]
rlnas = RLNAS(key='lstm', configs=config)
.. py:method:: next_archs(obs=None)
获取下一组模型结构。
**参数:**
- **obs<int|np.array>** - 需要获取的模型结构数量或者当前模型的observations。
**返回:**
返回模型结构实例的列表,形式为list
**示例代码:**
.. code-block:: python
import paddle.fluid as fluid
from paddleslim.nas import RLNAS
config = [('MobileNetV2Space')]
rlnas = RLNAS(key='lstm', configs=config)
input = fluid.data(name='input', shape=[None, 3, 32, 32], dtype='float32')
archs = rlnas.next_archs(1)
for arch in archs:
output = arch(input)
input = output
print(output)
.. py:method:: reward(rewards, **kwargs):
把当前模型结构的rewards回传。
**参数:**
- **rewards<float|list<float>>:** - 当前模型的rewards,分数越大越好。
- **\*\*kwargs:** - 附加的参数,取决于具体的强化学习算法。
**示例代码:**
.. code-block:: python
import paddle.fluid as fluid
from paddleslim.nas import RLNAS
config = [('MobileNetV2Space')]
rlnas = RLNAS(key='lstm', configs=config)
rlnas.next_archs(1)
rlnas.reward(1.0)
.. note::
reward这一步必须在`next_token`之后执行。
..
.. py:method:: final_archs(batch_obs):
获取最终的模型结构。一般在controller训练完成之后会获取几十个模型结构进行完整的实验。

**参数:**

- **obs<int|np.array>** - 需要获取的模型结构数量或者当前模型的observations。

**返回:**
返回模型结构实例的列表,形式为list。

**示例代码:**

.. code-block:: python
import paddle.fluid as fluid
from paddleslim.nas import RLNAS
config = [('MobileNetV2Space')]
rlnas = RLNAS(key='lstm', configs=config)
archs = rlnas.final_archs(10)
.. py:methd:: tokens2arch(tokens)
通过一组tokens得到实际的模型结构,一般用来把搜索到最优的token转换为模型结构用来做最后的训练。tokens的形式是一个列表,tokens映射到搜索空间转换成相应的网络结构,一组tokens对应唯一的一个网络结构。

**参数:**

- **tokens(list):** - 一组tokens。tokens的长度和范围取决于搜索空间。

**返回:**
根据传入的token得到一个模型结构实例列表。

**示例代码:**

.. code-block:: python
import paddle.fluid as fluid
from paddleslim.nas import SANAS
config = [('MobileNetV2Space')]
rlnas = RLNAS(key='lstm', configs=config)
input = fluid.data(name='input', shape=[None, 3, 32, 32], dtype='float32')
tokens = ([0] * 25)
archs = sanas.tokens2arch(tokens)[0]
print(archs(input))
157 changes: 157 additions & 0 deletions paddleslim/common/RL_controller/DDPG/DDPGController.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import numpy as np
import parl
from parl import layers
from paddle import fluid
from ..utils import RLCONTROLLER, action_mapping
from ...controller import RLBaseController
from .ddpg_model import DefaultDDPGModel as default_ddpg_model
from .noise import AdaptiveNoiseSpec as default_noise
from parl.utils import ReplayMemory

__all__ = ['DDPG']


class DDPGAgent(parl.Agent):
def __init__(self, algorithm, obs_dim, act_dim):
assert isinstance(obs_dim, int)
assert isinstance(act_dim, int)
self.obs_dim = obs_dim
self.act_dim = act_dim
super(DDPGAgent, self).__init__(algorithm)

# Attention: In the beginning, sync target model totally.
self.alg.sync_target(decay=0)

def build_program(self):
self.pred_program = fluid.Program()
self.learn_program = fluid.Program()

with fluid.program_guard(self.pred_program):
obs = layers.data(
name='obs', shape=[self.obs_dim], dtype='float32')
self.pred_act = self.alg.predict(obs)

with fluid.program_guard(self.learn_program):
obs = layers.data(
name='obs', shape=[self.obs_dim], dtype='float32')
act = layers.data(
name='act', shape=[self.act_dim], dtype='float32')
reward = layers.data(name='reward', shape=[], dtype='float32')
next_obs = layers.data(
name='next_obs', shape=[self.obs_dim], dtype='float32')
terminal = layers.data(name='terminal', shape=[], dtype='bool')
_, self.critic_cost = self.alg.learn(obs, act, reward, next_obs,
terminal)

def predict(self, obs):
obs = np.expand_dims(obs, axis=0)
act = self.fluid_executor.run(self.pred_program,
feed={'obs': obs},
fetch_list=[self.pred_act])[0]
return act

def learn(self, obs, act, reward, next_obs, terminal):
feed = {
'obs': obs,
'act': act,
'reward': reward,
'next_obs': next_obs,
'terminal': terminal
}
critic_cost = self.fluid_executor.run(self.learn_program,
feed=feed,
fetch_list=[self.critic_cost])[0]
self.alg.sync_target()
return critic_cost


@RLCONTROLLER.register
class DDPG(RLBaseController):
def __init__(self, range_tables, use_gpu=False, **kwargs):
self.use_gpu = use_gpu
self.range_tables = range_tables - np.asarray(1)
self.act_dim = len(self.range_tables)
self.obs_dim = kwargs.get('obs_dim')
self.model = kwargs.get(
'model') if 'model' in kwargs else default_ddpg_model
self.actor_lr = kwargs.get(
'actor_lr') if 'actor_lr' in kwargs else 1e-4
self.critic_lr = kwargs.get(
'critic_lr') if 'critic_lr' in kwargs else 1e-3
self.gamma = kwargs.get('gamma') if 'gamma' in kwargs else 0.99
self.tau = kwargs.get('tau') if 'tau' in kwargs else 0.001
self.memory_size = kwargs.get(
'memory_size') if 'memory_size' in kwargs else 10
self.reward_scale = kwargs.get(
'reward_scale') if 'reward_scale' in kwargs else 0.1
self.batch_size = kwargs.get(
'controller_batch_size') if 'controller_batch_size' in kwargs else 1
self.actions_noise = kwargs.get(
'actions_noise') if 'actions_noise' in kwargs else default_noise
self.action_dist = 0.0
self.place = fluid.CUDAPlace(0) if self.use_gpu else fluid.CPUPlace()

model = self.model(self.act_dim)

if self.actions_noise:
self.actions_noise = self.actions_noise()

algorithm = parl.algorithms.DDPG(
model,
gamma=self.gamma,
tau=self.tau,
actor_lr=self.actor_lr,
critic_lr=self.critic_lr)
self.agent = DDPGAgent(algorithm, self.obs_dim, self.act_dim)
self.rpm = ReplayMemory(self.memory_size, self.obs_dim, self.act_dim)

self.pred_program = self.agent.pred_program
self.learn_program = self.agent.learn_program
self.param_dict = self.get_params(self.learn_program)

def next_tokens(self, obs, params_dict, is_inference=False):
batch_obs = np.expand_dims(obs, axis=0)
self.set_params(self.pred_program, params_dict, self.place)
actions = self.agent.predict(batch_obs.astype('float32'))
### add noise to action
if self.actions_noise and is_inference == False:
actions_noise = np.clip(
np.random.normal(
actions, scale=self.actions_noise.stdev_curr),
-1.0,
1.0)
self.action_dist = np.mean(np.abs(actions_noise - actions))
else:
actions_noise = actions
actions_noise = action_mapping(actions_noise, self.range_tables)
return actions_noise

def _update_noise(self, actions_dist):
self.actions_noise.update(actions_dist)

def update(self, rewards, params_dict, obs, actions, obs_next, terminal):
self.set_params(self.learn_program, params_dict, self.place)
self.rpm.append(obs, actions, self.reward_scale * rewards, obs_next,
terminal)
if self.actions_noise:
self._update_noise(self.action_dist)
if self.rpm.size() > self.memory_size:
obs, actions, rewards, obs_next, terminal = rpm.sample_batch(
self.batch_size)
self.agent.learn(obs, actions, rewards, obs_next, terminal)
params_dict = self.get_params(self.learn_program)
return params_dict
15 changes: 15 additions & 0 deletions paddleslim/common/RL_controller/DDPG/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from .DDPGController import *
Loading

0 comments on commit d21b2c6

Please sign in to comment.