From 4c2d7c3d319186a57676b699bb851afd7b736fdb Mon Sep 17 00:00:00 2001 From: ForFishes <2282912238@qq.com> Date: Wed, 1 Jun 2022 02:37:10 +0000 Subject: [PATCH] fix mp/pp --- .../model_parallel_cn.rst | 284 ++++++------------ .../pipeline_parallel_cn.rst | 246 ++++++--------- 2 files changed, 190 insertions(+), 340 deletions(-) diff --git a/docs/guides/06_distributed_training/model_parallel_cn.rst b/docs/guides/06_distributed_training/model_parallel_cn.rst index 1dd0aa6e909..2864742514f 100644 --- a/docs/guides/06_distributed_training/model_parallel_cn.rst +++ b/docs/guides/06_distributed_training/model_parallel_cn.rst @@ -3,84 +3,73 @@ 张量模型并行 ======================= -通常来讲,训练更大规模的网络模型可以在多种任务上取得更好的效果,如自然语言处理类 -任务的准确率。然而,训练更大规模的网络模型会消耗更多的显存资源,甚至是超过单个设 -备的显存容量,从而导致模型无法训练。模型并行通过将网络中的张量(Tensor)切分到不 -同的设备,从而降低单个设备的显存消耗,使得超大规模模型训练成为可能。本文主要介绍 -飞桨模型并行的基本原理和使用方法。 +通常来讲,训练更大规模的网络模型可以在多种任务上取得更好的效果,如自然语言处理类任务的准确率。然而,训练更大规模的网络模型会消耗更多的显存资源,甚至是超过单个设备的显存容量,从而导致模型无法训练。模型并行通过将网络中的张量(Tensor)切分到不同的设备,从而降低单个设备的显存消耗,使得超大规模模型训练成为可能。本文主要介绍飞桨模型并行的基本原理和使用方法。 一、原理介绍 ----------------------- -自2017年提出以来, `Transformer `__ 及其 -变种模型成为自然语言类任务的常用模型,并于近年来被应用到图像视觉领域。 -Transformer模型的基础结构是由Attention和MLP组成的Encoder和Decoder,以及 -Embedding,如下图所示[1]。其中Attention和MLP的底层实现均为矩阵乘法运算,而Embedding是一种 -查找表实现。 +张量模型并行需要解决两个问题:参数如何切分到不同设备(切分方式);以及切分后,如何保证数学一致性(数学等价)。本文以NLP中的Transformer结构为例,介绍张量模型并行的切分方式和随机性控制。 + + +1.1 切分方法 +^^^^^^^^^^^^^^^^^^^^^^^^^^ + + +自2017年提出以来, `Transformer `__ 及其变种模型成为自然语言类任务的常用模型,并于近年来被应用到图像视觉领域。Transformer模型的基础结构是由Attention和MLP组成的Encoder和Decoder,以及Embedding,如下图所示[1]。其中Attention和MLP的底层实现均为矩阵乘法运算,而Embedding是一种查找表实现。本文以NLP中的Transformer结构为例,介绍张量模型并行的切分方式和随机性控制。但总体上看核心思想都是利用分块矩阵的计算原理,实现其参数切分到不同的设备2 。下面详细介绍这三种层的切分方式。 .. image:: ./images/transformer_overview.png :width: 200 :alt: transformer overview :align: center -对于Embedding操作,可以将其理解为一种查找表操作。即,将输入看做索引,将Embedding参数 -看做查找表,根据该索引查表得到相应的输出,如下图(a)所示。当采用模型并行时, -Embedding的参数被均匀切分到多个卡上。假设Embedding参数的维度为N*D,并采用K张卡执行模型 -并行,那么模型并行模式下每张卡上的Embedding参数的维度为N//K*D。当参数的维度N不能被卡 -数K整除时,最后一张卡的参数维度值为(N//K+N%K)*D。以下图(b)为例,Embedding参数的维度 -为8*D,采用2张卡执行模型并行,那么每张卡上Embedding参数的维度为4*D。 -为了便于说明,以下我们均假设Embedding的参数维度值D可以被模型并行的卡数D整除。此时,每张 -卡上Embeeding参数的索引值为[0, N/K),逻辑索引值为[k*N/K, (k+1)*N/K),其中k表示卡序号, -0<=k`_。 @@ -418,32 +323,43 @@ Dropout。直观理解,模型并行下,所有卡上的Dropout算子构成一 .. code-block:: bash - WARNING 2021-10-27 09:19:24,072 launch.py:381] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode - launch train in GPU mode! - INFO 2021-10-27 09:19:24,074 launch_utils.py:525] Local start 2 processes. First process distributed environment info (Only For Debug): - +=======================================================================================+ - | Distributed Envs Value | - +---------------------------------------------------------------------------------------+ - | PADDLE_TRAINER_ID 0 | - | PADDLE_CURRENT_ENDPOINT 127.0.0.1:10129 | - | PADDLE_TRAINERS_NUM 2 | - | PADDLE_TRAINER_ENDPOINTS 127.0.0.1:10129,127.0.0.1:13182 | - | PADDLE_RANK_IN_NODE 0 | - | PADDLE_LOCAL_DEVICE_IDS 0 | - | PADDLE_WORLD_DEVICE_IDS 0,1 | - | FLAGS_selected_gpus 0 | - | FLAGS_selected_accelerators 0 | - +=======================================================================================+ - -日志信息位于log目录下, 需要注意的是模型并行的loss与单卡模型的loss在小数点后三位是能够精确对齐的,然后两张卡上对应的loss应该是一样的: + LAUNCH INFO 2022-05-31 02:35:16,954 ----------- Configuration ---------------------- + LAUNCH INFO 2022-05-31 02:35:16,954 devices: None + LAUNCH INFO 2022-05-31 02:35:16,954 elastic_level: -1 + LAUNCH INFO 2022-05-31 02:35:16,954 elastic_timeout: 30 + LAUNCH INFO 2022-05-31 02:35:16,954 gloo_port: 6767 + LAUNCH INFO 2022-05-31 02:35:16,954 host: None + LAUNCH INFO 2022-05-31 02:35:16,954 job_id: default + LAUNCH INFO 2022-05-31 02:35:16,955 legacy: False + LAUNCH INFO 2022-05-31 02:35:16,955 log_dir: log + LAUNCH INFO 2022-05-31 02:35:16,955 log_level: INFO + LAUNCH INFO 2022-05-31 02:35:16,955 master: None + LAUNCH INFO 2022-05-31 02:35:16,955 max_restart: 3 + LAUNCH INFO 2022-05-31 02:35:16,955 nnodes: 1 + LAUNCH INFO 2022-05-31 02:35:16,955 nproc_per_node: None + LAUNCH INFO 2022-05-31 02:35:16,955 rank: -1 + LAUNCH INFO 2022-05-31 02:35:16,955 run_mode: collective + LAUNCH INFO 2022-05-31 02:35:16,955 server_num: None + LAUNCH INFO 2022-05-31 02:35:16,955 servers: + LAUNCH INFO 2022-05-31 02:35:16,955 trainer_num: None + LAUNCH INFO 2022-05-31 02:35:16,955 trainers: + LAUNCH INFO 2022-05-31 02:35:16,955 training_script: test.py + LAUNCH INFO 2022-05-31 02:35:16,955 training_script_args: [] + LAUNCH INFO 2022-05-31 02:35:16,955 with_gloo: 1 + LAUNCH INFO 2022-05-31 02:35:16,955 -------------------------------------------------- + LAUNCH INFO 2022-05-31 02:35:16,956 Job: default, mode collective, replicas 1[1:1], elastic False + LAUNCH INFO 2022-05-31 02:35:16,957 Run Pod: jbvsbv, replicas 2, status ready + LAUNCH INFO 2022-05-31 02:35:16,984 Watching Pod: jbvsbv, replicas 2, status running + +日志信息位于log目录下, loss的输出信息: .. code-block:: bash - mp_loss: 0.0 single_loss: 0.0 - mp_loss: -0.14513375 single_loss: -0.14513376 - mp_loss: -0.2902736 single_loss: -0.2902736 - mp_loss: -0.43542737 single_loss: -0.43542737 - mp_loss: -0.5806184 single_loss: -0.5806184 + loss [0.0282112] + loss [-0.05410034] + loss [0.01392444] + loss [0.01289728] + loss [0.06050334] 四、参考文献 ----------------------- diff --git a/docs/guides/06_distributed_training/pipeline_parallel_cn.rst b/docs/guides/06_distributed_training/pipeline_parallel_cn.rst index 01cb0e56eb7..c71b8039bc7 100644 --- a/docs/guides/06_distributed_training/pipeline_parallel_cn.rst +++ b/docs/guides/06_distributed_training/pipeline_parallel_cn.rst @@ -29,8 +29,7 @@ :alt: pipeline_timeline2 :align: center -如上图所示先进行前向计算,再进行反向计算,这种方式我们称之为 F-the-B 模式。不难看出这种 F-then-B 模式由于缓存了多个 micro-batch 的中间变量和梯度,显存的实际利用率并不高。接下来我们介绍一种前向计算和反向计算交叉进行的方式,即 1F1B 模型。 -在 1F1B 模式下,前向计算和反向计算交叉进行,可以及时释放不必要的中间变量。我们以下图1F1B中 stage4 的 F42(stage4的第2个 micro-batch 的前向计算)为例,F42 在计算前,F41 的反向 B41(stage4的第1个 micro-batch 的反向计算)已经计算结束,即可释放 F41 的中间变量,从而 F42 可以复用 F41 中间变量的显存。1F1B 方式相比 F-then-B 方式峰值显存可以节省37.5%,对比朴素流水线并行峰值显存明显下降,设备资源利用率显著提升。 +如上图所示先进行前向计算,再进行反向计算,这种方式我们称之为 F-the-B 模式。不难看出这种 F-then-B 模式由于缓存了多个 micro-batch 的中间变量和梯度,显存的实际利用率并不高。接下来我们介绍一种前向计算和反向计算交叉进行的方式,即 1F1B 模型。在 1F1B 模式下,前向计算和反向计算交叉进行,可以及时释放不必要的中间变量。我们以下图1F1B中 stage4 的 F42(stage4的第2个 micro-batch 的前向计算)为例,F42 在计算前,F41 的反向 B41(stage4的第1个 micro-batch 的反向计算)已经计算结束,即可释放 F41 的中间变量,从而 F42 可以复用 F41 中间变量的显存。1F1B 方式相比 F-then-B 方式峰值显存可以节省37.5%,对比朴素流水线并行峰值显存明显下降,设备资源利用率显著提升。 .. image:: ./images/pipeline-4.png :width: 600 @@ -66,10 +65,10 @@ import paddle.distributed as dist import random -然后构造一个普通的AlexNet模型, 作为对比 -.. code-block:: python +构建一个可以运行流水线的模型,模型的layer需要被LayerDesc或者继承了LayerDesc的SharedLayerDesc包裹,这里因为不需要共享参数,所以就使用LayerDesc +.. code-block:: python class ReshapeHelp(Layer): def __init__(self, shape): super(ReshapeHelp, self).__init__() @@ -78,48 +77,6 @@ def forward(self, x): return x.reshape(shape=self.shape) - - class AlexNet(Layer): - def __init__(self, num_classes=10): - super(AlexNet, self).__init__() - self.features = Sequential( - nn.Conv2D( - 1, 64, kernel_size=11, stride=4, padding=5), - nn.ReLU(), - nn.MaxPool2D( - kernel_size=2, stride=2), - nn.Conv2D( - 64, 192, kernel_size=5, padding=2), - nn.ReLU(), - nn.MaxPool2D( - kernel_size=2, stride=2), - nn.Conv2D( - 192, 384, kernel_size=3, padding=1), - nn.ReLU(), - nn.Conv2D( - 384, 256, kernel_size=3, padding=1), - nn.ReLU(), - nn.Conv2D( - 256, 256, kernel_size=3, padding=1), - nn.ReLU(), - nn.MaxPool2D( - kernel_size=2, stride=2), ) - - - self.reshape_layer = ReshapeHelp(shape=[-1, 256]) - self.classifier = nn.Linear(256, num_classes) - self.loss_fn = nn.loss.CrossEntropyLoss() - - def forward(self, x, y): - x = self.features(x) - x = self.reshape_layer(x) - x = self.classifier(x) - return self.loss_fn(x, y) - -然后构建一个可以运行流水线的模型,模型的layer需要被LayerDesc或者继承了LayerDesc的SharedLayerDesc包裹,这里因为不需要共享参数,所以就使用LayerDesc - -.. code-block:: python - class AlexNetPipeDesc(PipelineLayer): def __init__(self, num_classes=10, **kwargs): self.num_classes = num_classes @@ -156,22 +113,22 @@ .. code-block:: python - batch_size = 4 - micro_batch_size = 2 - - strategy = fleet.DistributedStrategy() - model_parallel_size = 1 - data_parallel_size = 1 - pipeline_parallel_size = 2 - strategy.hybrid_configs = { - "dp_degree": data_parallel_size, - "mp_degree": model_parallel_size, - "pp_degree": pipeline_parallel_size - } - strategy.pipeline_configs = { - "accumulate_steps": batch_size // micro_batch_size, - "micro_batch_size": micro_batch_size - } + batch_size = 4 + micro_batch_size = 2 + + strategy = fleet.DistributedStrategy() + model_parallel_size = 1 + data_parallel_size = 1 + pipeline_parallel_size = 2 + strategy.hybrid_configs = { + "dp_degree": data_parallel_size, + "mp_degree": model_parallel_size, + "pp_degree": pipeline_parallel_size + } + strategy.pipeline_configs = { + "accumulate_steps": batch_size // micro_batch_size, + "micro_batch_size": micro_batch_size + } fleet.init(is_collective=True, strategy=strategy) @@ -180,30 +137,20 @@ .. code-block:: python - def set_random_seed(seed, dp_id, rank_id): - """Set random seed for reproducability.""" - random.seed(seed) - np.random.seed(seed + dp_id) - paddle.seed(seed + dp_id + rank_id) - print("seed: ", seed) - print("rank_id: ", rank_id) - print("dp_id: ", dp_id) - hcg = fleet.get_hybrid_communicate_group() - world_size = hcg.get_model_parallel_world_size() - dp_id = hcg.get_data_parallel_rank() - pp_id = hcg.get_stage_id() - rank_id = dist.get_rank() - set_random_seed(1024, dp_id, rank_id) - -然后创建出普通模型以及对应的优化器 - -.. code-block:: python - - model_a = AlexNet(10) - scheduler_a = paddle.optimizer.lr.PiecewiseDecay( - boundaries=[2], values=[0.001, 0.002], verbose=False - ) - optimizer_a = paddle.optimizer.SGD(learning_rate=scheduler_a, parameters=model_a.parameters()) + def set_random_seed(seed, dp_id, rank_id): + random.seed(seed) + np.random.seed(seed + dp_id) + paddle.seed(seed + dp_id + rank_id) + print("seed: ", seed) + print("rank_id: ", rank_id) + print("dp_id: ", dp_id) + + hcg = fleet.get_hybrid_communicate_group() + world_size = hcg.get_model_parallel_world_size() + dp_id = hcg.get_data_parallel_rank() + pp_id = hcg.get_stage_id() + rank_id = dist.get_rank() + set_random_seed(1024, dp_id, rank_id) 然后创建出流水线并行的模型, @@ -215,29 +162,15 @@ fleet.distributed_optimizer(...):这一步则是为优化器添加分布式属 .. code-block:: python - model_b = AlexNetPipeDesc(num_stages=pipeline_parallel_size, topology=hcg._topo) - scheduler_b = paddle.optimizer.lr.PiecewiseDecay( - boundaries=[2], values=[0.001, 0.002], verbose=False - ) - optimizer_b = paddle.optimizer.SGD(learning_rate=scheduler_b, - parameters=model_b.parameters()) - model_b = fleet.distributed_model(model_b) - optimizer_b = fleet.distributed_optimizer(optimizer_b) - -流水线并行将模型按layers切分,为了能够和普通模型loss对齐,需要采用热启模式,先保存普通模型的参数,然后流水线并行模型加载相关参数 - -.. code-block:: python + model = AlexNetPipeDesc(num_stages=pipeline_parallel_size, topology=hcg._topo) + scheduler = paddle.optimizer.lr.PiecewiseDecay( + boundaries=[2], values=[0.001, 0.002], verbose=False + ) + optimizer = paddle.optimizer.SGD(learning_rate=scheduler, + parameters=model.parameters()) + model = fleet.distributed_model(model) + optimizer = fleet.distributed_optimizer(optimizer) - # 保存普通模型参数 - param_len = len(model_a.parameters()) - parameters = [] - for param in model_a.parameters(): - parameters.append(param.numpy()) - - - # 流水线并行模型加载参数 - for idx, param in enumerate(model_b.parameters()): - param.set_value(parameters[idx + pp_id * (param_len // 2)]) 创建mnist数据集 @@ -249,32 +182,26 @@ fleet.distributed_optimizer(...):这一步则是为优化器添加分布式属 开始训练 -model_b.train_batch(...):这一步主要就是执行1F1B的流水线并行方式 +model.train_batch(...):这一步主要就是执行1F1B的流水线并行方式 .. code-block:: python - for step_id, data in enumerate(train_reader()): - x_data = np.array([x[0] for x in data]).astype("float32").reshape( - batch_size, 1, 28, 28 - ) - y_data = np.array([x[1] for x in data]).astype("int64").reshape( - batch_size, 1 - ) - img = paddle.to_tensor(x_data) - label = paddle.to_tensor(y_data) - img.stop_gradient = True - label.stop_gradient = True - if step_id >= 5: - break - loss_a = model_a(img, label) - loss_a.backward() - optimizer_a.step() - optimizer_a.clear_grad() - scheduler_a.step() - - loss_b = model_b.train_batch([img, label], optimizer_b, scheduler_b) - - print("single_loss: ", loss_a.numpy(), "pp_loss: ", loss_b.numpy()) + for step_id, data in enumerate(train_reader()): + x_data = np.array([x[0] for x in data]).astype("float32").reshape( + batch_size, 1, 28, 28 + ) + y_data = np.array([x[1] for x in data]).astype("int64").reshape( + batch_size, 1 + ) + img = paddle.to_tensor(x_data) + label = paddle.to_tensor(y_data) + img.stop_gradient = True + label.stop_gradient = True + if step_id >= 5: + break + + loss = model.train_batch([img, label], optimizer, scheduler) + print("pp_loss: ", loss.numpy()) 运行方式(需要保证当前机器有两张GPU): @@ -289,34 +216,41 @@ model_b.train_batch(...):这一步主要就是执行1F1B的流水线并行方 .. code-block:: bash - WARNING 2021-10-21 14:47:54,245 launch.py:381] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode - launch train in GPU mode! - INFO 2021-10-21 14:47:54,246 launch_utils.py:525] Local start 2 processes. First process distributed environment info (Only For Debug): - +=======================================================================================+ - | Distributed Envs Value | - +---------------------------------------------------------------------------------------+ - | PADDLE_TRAINER_ID 0 | - | PADDLE_CURRENT_ENDPOINT 127.0.0.1:10101 | - | PADDLE_TRAINERS_NUM 2 | - | PADDLE_TRAINER_ENDPOINTS 127.0.0.1:10101,127.0.0.1:13727 | - | PADDLE_RANK_IN_NODE 0 | - | PADDLE_LOCAL_DEVICE_IDS 0 | - | PADDLE_WORLD_DEVICE_IDS 0,1 | - | FLAGS_selected_gpus 0 | - | FLAGS_selected_accelerators 0 | - +=======================================================================================+ + LAUNCH INFO 2022-05-31 02:47:23,595 ----------- Configuration ---------------------- + LAUNCH INFO 2022-05-31 02:47:23,596 devices: None + LAUNCH INFO 2022-05-31 02:47:23,596 elastic_level: -1 + LAUNCH INFO 2022-05-31 02:47:23,596 elastic_timeout: 30 + LAUNCH INFO 2022-05-31 02:47:23,596 gloo_port: 6767 + LAUNCH INFO 2022-05-31 02:47:23,596 host: None + LAUNCH INFO 2022-05-31 02:47:23,596 job_id: default + LAUNCH INFO 2022-05-31 02:47:23,596 legacy: False + LAUNCH INFO 2022-05-31 02:47:23,596 log_dir: log + LAUNCH INFO 2022-05-31 02:47:23,596 log_level: INFO + LAUNCH INFO 2022-05-31 02:47:23,596 master: None + LAUNCH INFO 2022-05-31 02:47:23,596 max_restart: 3 + LAUNCH INFO 2022-05-31 02:47:23,596 nnodes: 1 + LAUNCH INFO 2022-05-31 02:47:23,596 nproc_per_node: None + LAUNCH INFO 2022-05-31 02:47:23,596 rank: -1 + LAUNCH INFO 2022-05-31 02:47:23,596 run_mode: collective + LAUNCH INFO 2022-05-31 02:47:23,596 server_num: None + LAUNCH INFO 2022-05-31 02:47:23,596 servers: + LAUNCH INFO 2022-05-31 02:47:23,596 trainer_num: None + LAUNCH INFO 2022-05-31 02:47:23,596 trainers: + LAUNCH INFO 2022-05-31 02:47:23,596 training_script: pp.py + LAUNCH INFO 2022-05-31 02:47:23,596 training_script_args: [] + LAUNCH INFO 2022-05-31 02:47:23,596 with_gloo: 1 + LAUNCH INFO 2022-05-31 02:47:23,596 -------------------------------------------------- + LAUNCH INFO 2022-05-31 02:47:23,597 Job: default, mode collective, replicas 1[1:1], elastic False + LAUNCH INFO 2022-05-31 02:47:23,605 Run Pod: ldmpbt, replicas 2, status ready + LAUNCH INFO 2022-05-31 02:47:23,629 Watching Pod: ldmpbt, replicas 2, status running 日志信息位于log目录下: .. code-block:: bash - single_loss: [2.299683] pp_loss: [2.2996738] - single_loss: [2.287039] pp_loss: [2.2870412] - single_loss: [2.3449194] pp_loss: [2.3449283] - single_loss: [2.3162396] pp_loss: [2.3162327] - single_loss: [2.3100634] pp_loss: [2.310072] - -四、注意事项 ---------------------- + pp_loss: [2.3267765] + pp_loss: [2.3299088] + pp_loss: [2.2849925] + pp_loss: [2.2974687] + pp_loss: [2.3173313] -与静态图的流水线不一样的是每张卡都会输出loss,并且流水线loss的值是相等的,与普通模型的loss在小数点后三位应该是相等的。