diff --git a/docs/advanced_guide/index_cn.rst b/docs/advanced_guide/index_cn.rst deleted file mode 100644 index d866963d281..00000000000 --- a/docs/advanced_guide/index_cn.rst +++ /dev/null @@ -1,14 +0,0 @@ -######## -进阶指南 -######## - -如果您已经学会使用飞桨来完成常规任务,期望了解更多飞桨在工业部署方面的能力,请阅读: - - - - `预测与部署 <../advanced_guide/inference_deployment/index_cn.html>`_ :介绍如何应用训练好的模型进行预测 - -.. toctree:: - :hidden: - - inference_deployment/index_cn.rst - flags/flags_cn.rst diff --git a/docs/advanced_guide/index_en.rst b/docs/advanced_guide/index_en.rst deleted file mode 100644 index f44212dadfb..00000000000 --- a/docs/advanced_guide/index_en.rst +++ /dev/null @@ -1,19 +0,0 @@ -.. _user_guide_en_: - -#################### -Advanced User Guides -#################### - -.. todo:: - -So far you have already been familiar with PaddlePaddle. And the next expectation, read more on: - - - - `Deploy Inference Model `_ :How to deploy the trained network to perform practical inference - - -.. toctree:: - :hidden: - - inference_deployment/index_en.rst - flags/flags_en.rst diff --git a/docs/advanced_guide/performance_improving/amp/amp.md b/docs/advanced_guide/performance_improving/amp/amp.md deleted file mode 100644 index db053907dd2..00000000000 --- a/docs/advanced_guide/performance_improving/amp/amp.md +++ /dev/null @@ -1,171 +0,0 @@ -# 混合精度训练最佳实践 - -Automatic Mixed Precision (AMP) 是一种自动混合使用半精度(FP16)和单精度(FP32)来加速模型训练的技术。AMP 技术可方便用户快速将使用 FP32 训练的模型修改为使用混合精度训练,并通过黑白名单和动态`loss scaling`来保证训练时的数值稳定性进而避免梯度 Infinite 或者 NaN(Not a Number)。借力于新一代 NVIDIA GPU 中 Tensor Cores 的计算性能,PaddlePaddle AMP 技术在 ResNet50、Transformer 等模型上训练速度相对于 FP32 训练加速比可达 1.5~2.9。 - -### 半精度浮点类型 FP16 - -如图 1 所示,半精度(Float Precision16,FP16)是一种相对较新的浮点类型,在计算机中使用 2 字节(16 位)存储。在 IEEE 754-2008 标准中,它亦被称作 binary16。与计算中常用的单精度(FP32)和双精度(FP64)类型相比,FP16 更适于在精度要求不高的场景中使用。 - -
- missing -
图 1. 半精度和单精度数据示意图
-
- -### 英伟达 GPU 的 FP16 算力 - -在使用相同的超参数下,混合精度训练使用半精度浮点(FP16)和单精度(FP32)浮点即可达到与使用纯单精度训练相同的准确率,并可加速模型的训练速度。这主要得益于英伟达推出的 Volta 及 Turing 架构 GPU 在使用 FP16 计算时具有如下特点: - -* FP16 可降低一半的内存带宽和存储需求,这使得在相同的硬件条件下研究人员可使用更大更复杂的模型以及更大的 batch size 大小。 -* FP16 可以充分利用英伟达 Volta 及 Turing 架构 GPU 提供的 Tensor Cores 技术。在相同的 GPU 硬件上,Tensor Cores 的 FP16 计算吞吐量是 FP32 的 8 倍。 - -### PaddlePaddle AMP 功能——牛刀小试 - -如前文所述,使用 FP16 数据类型可能会造成计算精度上的损失,但对深度学习领域而言,并不是所有计算都要求很高的精度,一些局部的精度损失对最终训练效果影响很微弱,却能使吞吐和训练速度带来大幅提升。因此,混合精度计算的需求应运而生。具体而言,训练过程中将一些对精度损失不敏感且能利用 Tensor Cores 进行加速的运算使用半精度处理,而对精度损失敏感部分依然保持 FP32 计算精度,用以最大限度提升访存和计算效率。 - -为了避免对每个具体模型人工地去设计和尝试精度混合的方法,PaddlePaddle 框架提供自动混合精度训练(AMP)功能,解放"炼丹师"的双手。在 PaddlePaddle 中使用 AMP 训练是一件十分容易的事情,用户只需要增加一行代码即可将原有的 FP32 训练转变为 AMP 训练。下面以`MNIST`为例介绍 PaddlePaddle AMP 功能的使用示例。 - -**MNIST 网络定义** - -```python -import paddle.fluid as fluid - -def MNIST(data, class_dim): - conv1 = fluid.layers.conv2d(data, 16, 5, 1, act=None, data_format='NHWC') - bn1 = fluid.layers.batch_norm(conv1, act='relu', data_layout='NHWC') - pool1 = fluid.layers.pool2d(bn1, 2, 'max', 2, data_format='NHWC') - conv2 = fluid.layers.conv2d(pool1, 64, 5, 1, act=None, data_format='NHWC') - bn2 = fluid.layers.batch_norm(conv2, act='relu', data_layout='NHWC') - pool2 = fluid.layers.pool2d(bn2, 2, 'max', 2, data_format='NHWC') - fc1 = fluid.layers.fc(pool2, size=64, act='relu') - fc2 = fluid.layers.fc(fc1, size=class_dim, act='softmax') - return fc2 -``` - -针对 CV(Computer Vision)类模型组网,为获得更高的训练性能需要注意如下三点: - -* `conv2d`、`batch_norm`以及`pool2d`等需要将数据布局设置为`NHWC`,这样有助于使用 TensorCore 技术加速计算过程1。 -* Tensor Cores 要求在使用 FP16 加速卷积运算时 conv2d 的输入/输出通道数为 8 的倍数2,因此设计网络时推荐将 conv2d 层的输入/输出通道数设置为 8 的倍数。 -* Tensor Cores 要求在使用 FP16 加速矩阵乘运算时矩阵行数和列数均为 8 的倍数3,因此设计网络时推荐将 fc 层的 size 参数设置为 8 的倍数。 - - -**FP32 训练** - -为了训练 MNIST 网络,还需要定义损失函数来更新权重参数,此处使用的优化器是 SGDOptimizer。为了简化说明,这里省略了迭代训练的相关代码,仅体现损失函数及优化器定义相关的内容。 - -```python -import paddle -import numpy as np - -data = fluid.layers.data( - name='image', shape=[None, 28, 28, 1], dtype='float32') -label = fluid.layers.data(name='label', shape=[None, 1], dtype='int64') - -out = MNIST(data, class_dim=10) -loss = fluid.layers.cross_entropy(input=out, label=label) -avg_loss = fluid.layers.mean(loss) - -sgd = fluid.optimizer.SGDOptimizer(learning_rate=1e-3) -sgd.minimize(avg_loss) -``` - -**AMP 训练** - -与 FP32 训练相比,用户仅需使用 PaddlePaddle 提供的`fluid.contrib.mixed_precision.decorate` 函数将原来的优化器 SGDOptimizer 进行封装,然后使用封装后的优化器(mp_sgd)更新参数梯度即可完成向 AMP 训练的转换,代码如下所示: - -```python -sgd = SGDOptimizer(learning_rate=1e-3) -# 此处只需要使用 fluid.contrib.mixed_precision.decorate 将 sgd 封装成 AMP 训练所需的 -# 优化器 mp_sgd,并使用 mp_sgd.minimize(avg_loss)代替原来的 sgd.minimize(avg_loss)语句即可。 -mp_sgd = fluid.contrib.mixed_precision.decorator.decorate(sgd) -mp_sgd.minimize(avg_loss) -``` - -运行上述混合精度训练 python 脚本时为得到更好的执行性能可配置如下环境参数,并保证 cudnn 版本在 7.4.1 及以上。 - -```shell -export FLAGS_conv_workspace_size_limit=1024 # MB,根据所使用的 GPU 显存容量及模型特点设置数值,值越大越有可能选择到更快的卷积算法 -export FLAGS_cudnn_exhaustive_search=1 # 使用穷举搜索方法来选择快速卷积算法 -export FLAGS_cudnn_batchnorm_spatial_persistent=1 # 用于触发 batch_norm 和 relu 的融合 -``` - -上述即为最简单的 PaddlePaddle AMP 功能使用方法。ResNet50 模型的 AMP 训练示例可[点击此处](https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/image_classification/README.md#%E6%B7%B7%E5%90%88%E7%B2%BE%E5%BA%A6%E8%AE%AD%E7%BB%83)查看,其他模型使用 PaddlePaddle AMP 的方法也与此类似。若 AMP 训练过程中出现连续的 loss nan 等不收敛现象,可尝试使用[check nan inf 工具](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/flags/check_nan_inf_cn.html#span-id-speed-span)进行调试。 - - -### PaddlePaddle AMP 功能——进阶使用 - -上一小节所述均为默认 AMP 训练行为,用户当然也可以改变一些默认的参数设置来满足特定的模型训练场景需求。接下来的章节将介绍 PaddlePaddle AMP 功能使用中用户可配置的参数行为,即进阶使用技巧。 - -#### 自定义黑白名单 - -PaddlePaddle AMP 功能实现中根据 FP16 数据类型计算稳定性和加速效果在框架内部定义了算子(Op)的黑白名单。具体来说,将对 FP16 计算友好且能利用 Tensor Cores 的 Op 归类于白名单,将使用 FP16 计算会导致数值不稳定的 Op 归类于黑名单,将对 FP16 计算没有多少影响的 Op 归类于灰名单。然而,框架开发人员不可能考虑到所有的网络模型情况,尤其是那些特殊场景中使用到的模型。用户可以在使用`fluid.contrib.mixed_precision.decorate` 函数时通过指定自定义的黑白名单列表来改变默认的 FP16 计算行为。 - -```python -sgd = SGDOptimizer(learning_rate=1e-3) -# list1 是白名单 op 列表,list2 是黑名单 op 列表,list3 是黑名单 var_name 列表(凡是以这些黑名单 var_name 为输入或输出的 op 均会被视为黑名单 op) -amp_list = AutoMixedPrecisionLists(custom_white_list=list1, custom_black_list=list2, custom_black_varnames=list3) -mp_sgd = fluid.contrib.mixed_precision.decorator.decorate(sgd, amp_list) -mp_sgd.minimize(avg_loss) -``` - -#### 自动 loss scaling - -为了避免梯度 Infinite 或者 NAN,PaddlePaddle AMP 功能支持根据训练过程中梯度的数值自动调整 loss scale 值。用户在使用`fluid.contrib.mixed_precision.decorate` 函数时也可以改变与 loss scaling 相关的参数设置,示例如下: - -```python -sgd = SGDOptimizer(learning_rate=1e-3) -mp_sgd = fluid.contrib.mixed_precision.decorator.decorate(sgd, - amp_lists=None, - init_loss_scaling=2**8, - incr_every_n_steps=500, - decr_every_n_nan_or_inf=4, - incr_ratio=2.0, - decr_ratio=0.5, - use_dynamic_loss_scaling=True) -mp_sgd.minimize(avg_loss) -``` - -`init_loss_scaling `、`incr_every_n_steps` 以及`decr_every_n_nan_or_inf`等参数控制着自动 loss scaling 的行为。它们仅当 `use_dynamic_loss_scaling`设置为 True 时有效。下面详述这些参数的意义: - -* init_loss_scaling(float):初始 loss scaling 值。 -* incr_every_n_steps(int):每经过 incr_every_n_steps 个连续的正常梯度值才会增大 loss scaling 值。 -* decr_every_n_nan_or_inf(int):每经过 decr_every_n_nan_or_inf 个连续的无效梯度值(nan 或者 inf)才会减小 loss scaling 值。 -* incr_ratio(float):每次增大 loss scaling 值的扩增倍数,其为大于 1 的浮点数。 -* decr_ratio(float):每次减小 loss scaling 值的比例系数,其为小于 1 的浮点数。 - -### 多卡 GPU 训练的优化 - -PaddlePaddle AMP 功能对多卡 GPU 训练进行了深度优化。如图 2 所示,优化之前的参数梯度更新特点:梯度计算时虽然使用的是 FP16 数据类型,但是不同 GPU 卡之间的梯度传输数据类型仍为 FP32。 - -
- missing -
图 2. 不同 GPU 卡之间传输梯度使用 FP32 数据类型(优化前)
-
- -为了降低 GPU 多卡之间的梯度传输带宽,我们将梯度传输提前至`Cast`操作之前,而每个 GPU 卡在得到对应的 FP16 梯度后再执行`Cast`操作将其转变为 FP32 类型,具体操作详见图 2。这一优化在训练大模型时对减少带宽占用尤其有效,如多卡训练 BERT-Large 模型。 - -
- missing -
图 3. 不同 GPU 卡之间传输梯度使用 FP16 数据类型(优化后)
-
- -### 训练性能对比(AMP VS FP32) - -PaddlePaddle AMP 技术在 ResNet50、Transformer 等模型上训练速度相对于 FP32 训练上均有可观的加速比,下面是 ResNet50 和 ERNIE Large 模型的 AMP 训练相对于 FP32 训练的加速效果。 - - - - - - - -
图 4. Paddle AMP 训练加速效果(横坐标为卡数,如 8*8 代表 8 机 8 卡)
missing missing
- -从图 4 所示的图表可以看出,ResNet50 的 AMP 训练相对与 FP32 训练加速比可达$2.8 \times$以上,而 ERNIE Large 的 AMP 训练相对与 FP32 训练加速比亦可达 $1.7 \times -- 2.1 \times$ 。 - -### 参考文献 - -*

Mixed Precision Training

-*

使用自动混合精度加速 PaddlePaddle 训练

-*

Tensor Layouts In Memory: NCHW vs NHWC

-*

Channels In And Out Requirements

-*

Matrix-Matrix Multiplication Requirements

diff --git a/docs/advanced_guide/performance_improving/index_cn.rst b/docs/advanced_guide/performance_improving/index_cn.rst deleted file mode 100644 index b50f091f8c7..00000000000 --- a/docs/advanced_guide/performance_improving/index_cn.rst +++ /dev/null @@ -1,16 +0,0 @@ -######## -性能调优 -######## - -.. toctree:: - :maxdepth: 1 - - singlenode_training_improving/training_best_practice.rst - singlenode_training_improving/memory_optimize.rst - device_switching/device_switching.md - amp/amp.md - multinode_training_improving/cpu_train_best_practice.rst - multinode_training_improving/dist_training_gpu.rst - multinode_training_improving/gpu_training_with_recompute.rst - inference_improving/paddle_tensorrt_infer.md - analysis_tools/index_cn.rst diff --git a/docs/advanced_guide/performance_improving/index_en.rst b/docs/advanced_guide/performance_improving/index_en.rst deleted file mode 100644 index f57e2a3d060..00000000000 --- a/docs/advanced_guide/performance_improving/index_en.rst +++ /dev/null @@ -1,12 +0,0 @@ -############### -Practice Improving -############### - -.. toctree:: - :maxdepth: 1 - - singlenode_training_improving/memory_optimize_en.rst - multinode_training_improving/cpu_train_best_practice_en.rst - multinode_training_improving/gpu_training_with_recompute_en.rst - inference_improving/paddle_tensorrt_infer_en.md - analysis_tools/index_en.rst diff --git a/docs/advanced_guide/performance_improving/inference_improving/paddle_xpu_infer_cn.md b/docs/advanced_guide/performance_improving/inference_improving/paddle_xpu_infer_cn.md deleted file mode 100644 index 8818e259155..00000000000 --- a/docs/advanced_guide/performance_improving/inference_improving/paddle_xpu_infer_cn.md +++ /dev/null @@ -1,120 +0,0 @@ -# 使用昆仑预测 - -百度的昆仑芯⽚是一款⾼性能的 AI SoC 芯⽚,⽀持推理和训练。昆仑芯⽚采⽤百度的先进 AI 架构,⾮常适合常⽤的深度学习和机器学习算法的云端计算需求,并能适配诸如⾃然语⾔处理、⼤规模语⾳识别、⾃动驾驶、⼤规模推荐等多种终端场景的计算需求。 - -Paddle Inference 集成了[Paddle-Lite 预测引擎](https://www.paddlepaddle.org.cn/lite/develop/demo_guides/kunlunxin_xpu.html)在昆仑 xpu 上进行预测部署。 - -## 编译注意事项 - -请确保编译的时候设置了 WITH_LITE=ON,且 XPU_SDK_ROOT 设置了正确的路径。 - -## 使用介绍 - -在使用 Predictor 时,我们通过配置 Config 中的接口,在 XPU 上运行。 - -```c++ -config->EnableLiteEngine( - precision_mode=PrecisionType::kFloat32, - zero_copy=false, - passes_filter={}, - ops_filter={}, -) -``` - -- **`precision_mode`**,类型:`enum class PrecisionType {kFloat32 = 0, kHalf, kInt8,};`, 默认值为`PrecisionType::kFloat32`。指定 lite 子图的运行精度。 -- **`zero_copy`**,类型:bool,lite 子图与 Paddle 之间的数据传递是否是零拷贝模式。 -- **`passes_filter`**,类型:`std::vector`,默认为空,扩展借口,暂不使用。 -- **`ops_filer`**,类型:`std::vector`,默认为空,显示指定哪些 op 不使用 lite 子图运行。 - -Python 接口如下: - -```python -config.enable_lite_engine( - precision_mode=PrecisionType.Float32, - zero_copy=False, - passes_filter=[], - ops_filter=[] -) -``` - -### Python demo - -因目前 Paddle-Inference 目前未将 xpu sdk 打包到 whl 包内,所以需要用户下载 xpu sdk,并加入到环境变量中,之后会考虑解决该问题。 - -下载[xpu_tool_chain](https://paddle-inference-dist.bj.bcebos.com/inference_demo/xpu_tool_chain.tgz),解压后将 shlib 加入到 LD_LIBRARY_PATH - -``` -tar xzf xpu_tool_chain.tgz -``` -``` -export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PWD/output/XTDK/shlib/:$PWD/output/XTDK/runtime/shlib/ -``` - -下载[resnet50](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50.tar.gz)模型,并解压,运行如下命令将会调用预测引擎 - -```bash -python resnet50_subgraph.py --model_file ./ResNet50/model --params_file ./ResNet50/params -``` - -resnet50_subgraph.py 的内容是: - -``` -import argparse -import time -import numpy as np -from paddle.inference import Config, PrecisionType -from paddle.inference import create_predictor - -def main(): - args = parse_args() - - config = set_config(args) - - predictor = create_predictor(config) - - input_names = predictor.get_input_names() - input_handle = predictor.get_input_handle(input_names[0]) - - fake_input = np.ones((args.batch_size, 3, 224, 224)).astype("float32") - input_handle.reshape([args.batch_size, 3, 224, 224]) - input_handle.copy_from_cpu(fake_input) - - for i in range(args.warmup): - predictor.run() - - start_time = time.time() - for i in range(args.repeats): - predictor.run() - - output_names = predictor.get_output_names() - output_handle = predictor.get_output_handle(output_names[0]) - output_data = output_handle.copy_to_cpu() - end_time = time.time() - print(output_data[0, :10]) - print('time is: {}'.format((end_time-start_time)/args.repeats * 1000)) - -def parse_args(): - parser = argparse.ArgumentParser() - parser.add_argument("--model_dir", type=str, help="model dir") - parser.add_argument("--model_file", type=str, help="model filename") - parser.add_argument("--params_file", type=str, help="parameter filename") - parser.add_argument("--batch_size", type=int, default=1, help="batch size") - parser.add_argument("--warmup", type=int, default=0, help="warmup") - parser.add_argument("--repeats", type=int, default=1, help="repeats") - parser.add_argument("--math_thread_num", type=int, default=1, help="math_thread_num") - - return parser.parse_args() - -def set_config(args): - config = Config(args.model_file, args.params_file) - config.enable_lite_engine(PrecisionType.Float32, True) - # use lite xpu subgraph - config.enable_xpu(10 * 1024 * 1024) - # use lite cuda subgraph - # config.enable_use_gpu(100, 0) - config.set_cpu_math_library_num_threads(args.math_thread_num) - return config - -if __name__ == "__main__": - main() -``` diff --git a/docs/advanced_guide/performance_improving/multinode_training_improving/cpu_train_best_practice.rst b/docs/advanced_guide/performance_improving/multinode_training_improving/cpu_train_best_practice.rst deleted file mode 100644 index 6df05163486..00000000000 --- a/docs/advanced_guide/performance_improving/multinode_training_improving/cpu_train_best_practice.rst +++ /dev/null @@ -1,161 +0,0 @@ -.. _api_guide_cpu_training_best_practice: - -#################### -分布式 CPU 训练优秀实践 -#################### - -提高 CPU 分布式训练的训练速度,主要要从四个方面来考虑: -1)提高训练速度,主要是提高 CPU 的使用率;2)提高通信速度,主要是减少通信传输的数据量;3)提高数据 IO 速度;4)更换分布式训练策略,提高分布式训练速度。 - -提高 CPU 的使用率 -============= - -提高 CPU 使用率主要依赖 :code:`ParallelExecutor`,可以充分利用多个 CPU 的计算能力来加速计算。 - -简单实例用法: - -.. code-block:: python - - # 配置执行策略,主要是设置线程数 - exec_strategy = fluid.ExecutionStrategy() - exec_strategy.num_threads = 8 - - # 配置构图策略,对于 CPU 训练而言,应该使用 Reduce 模式进行训练 - build_strategy = fluid.BuildStrategy() - if int(os.getenv("CPU_NUM")) > 1: - build_strategy.reduce_strategy = fluid.BuildStrategy.ReduceStrategy.Reduce - - pe = fluid.ParallelExecutor( - use_cuda=False, - loss_name=avg_cost.name, - main_program=main_program, - build_strategy=build_strategy, - exec_strategy=exec_strategy) - -以上参数中: - -- :code:`num_threads` : 模型训练使用的线程数,最好和训练所在机器的物理 CPU 核数接近 -- :code:`reduce_strategy` : 对于 CPU 训练而言,应该选择 fluid.BuildStrategy.ReduceStrategy.Reduce - - -通用环境变量配置: - -- :code:`CPU_NUM` :模型副本 replica 的个数,最好和 num_threads 一致 - - -提高通信速度 -========== - -要减少通信数据量,提高通信速度,主要是使用稀疏更新 ,目前支持 :ref:`api_guide_sparse_update` 的主要是 :ref:`cn_api_fluid_layers_embedding` 。 - -.. code-block:: python - - data = fluid.layers.data(name='ids', shape=[1], dtype='int64') - fc = fluid.layers.embedding(input=data, size=[dict_size, 16], is_sparse=True) - -以上参数中: - -- :code:`is_sparse` : 配置 embedding 使用稀疏更新,如果 embedding 的 dict_size 很大,而每次数据 data 很少,建议使用 sparse 更新方式。 - - -提高数据 IO 速度 -========== - -要提高 CPU 分布式的数据 IO 速度,可以首先考虑使用 dataset API 进行数据读取。 dataset 是一种多生产者多消费者模式的数据读取方法,默认情况下耦合数据读取线程与训练线程,在多线程的训练中,dataset 表现出极高的性能优势。 - -API 接口介绍可以参考: :ref:`cn_api_distributed_QueueDataset` - -结合实际的网络,比如 CTR-DNN 模型,引入的方法可以参考:https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleRec/ctr/dnn - -最后使用 :code:`train_from_dataset` 接口来进行网络的训练: - -.. code-block:: python - - dataset = fluid.DatasetFactory().create_dataset() - exe = fluid.Executor(fluid.CPUPlace()) - exe.run(fluid.default_startup_program()) - exe.train_from_dataset(program=fluid.default_main_program(),dataset=dataset) - - -更换分布式训练策略 -========== - -CPU 分布式训练速度进一步提高的核心在于选择合适的分布式训练策略,比如定义通信策略、编译策略、执行策略等等。paddlepaddle 于 v1.7 版本发布了 :code:`DistributedStrategy` 功能,可以十分灵活且方便的指定分布式运行策略。 - -首先需要在代码中引入相关库: - -.. code-block:: python - - from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet - import paddle.fluid.incubate.fleet.base.role_maker as role_maker - from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler.distributed_strategy_factory import DistributedStrategyFactory - -然后指定 CPU 分布式运行的训练策略,目前可选配置有四种:同步训练(Sync)、异步训练(Async)、半异步训练(Half-Async)以及 GEO 训练。 - - -通过如下代码引入上述策略的默认配置,并进行 CPU 分布式训练: - -.. code-block:: python - - # step1: 引入 CPU 分布式训练策略 - # 同步训练策略 - strategy = DistributedStrategyFactory.create_sync_strategy() - # 半异步训练策略 - strategy = DistributedStrategyFactory.create_half_async_strategy() - # 异步训练策略 - strategy = DistributedStrategyFactory.create_async_strategy() - # GEO 训练策略 - strategy = DistributedStrategyFactory.create_geo_strategy(update_frequency=400) - - # step2: 定义节点角色 - role = role_maker.PaddleCloudRoleMaker() - fleet.init(role) - - # step3: 分布式训练 program 构建 - optimizer = fluid.optimizer.SGD(learning_rate) # 以 SGD 优化器为例 - optimizer = fleet.distributed_optimizer(optimizer, strategy) - optimizer.minimize(loss) - - # step4.1: 启动参数服务器节点(Server) - if fleet.is_server(): - fleet.init_server() - fleet.run_server() - - # step4.2: 启动训练节点(Trainer) - elif fleet.is_worker(): - fleet.init_worker() - exe.run(fleet.startup_program) - # Do training - exe.run(fleet.main_program) - fleet.stop_worker() - - -paddlepaddle 支持对训练策略中的细节进行调整: - -- 创建 compiled_program 所需的 build_strategy 及 exec_strategy 可以直接基于 strategy 获得 - -.. code-block:: python - - compiled_program = fluid.compiler.CompiledProgram(fleet.main_program).with_data_parallel( - loss_name=loss.name, - build_strategy=strategy.get_build_strategy(), - exec_strategy=strategy.get_execute_strategy()) - - -- 自定义训练策略细节,支持对 DistributeTranspilerConfig、TrainerRuntimeConfig、ServerRuntimeConfig、fluid.ExecutionStrategy、fluid.BuildStrategy 进行自定义配置。以 DistributeTranspilerConfig 为例,修改方式如下所示: - -.. code-block:: python - - strategy = DistributedStrategyFactory.create_sync_strategy() - - # 方式一(推荐): - config = strategy.get_program_config() - config.min_block_size = 81920 - - - # 方式二:调用 set_program_config 修改组网相关配置,支持 DistributeTranspilerConfig 和 dict 两种数据类型 - config = DistributeTranspilerConfig() - config.min_block_size = 81920 - # config = dict() - # config['min_block_size'] = 81920 - strategy.set_program_config(config) diff --git a/docs/advanced_guide/performance_improving/multinode_training_improving/cpu_train_best_practice_en.rst b/docs/advanced_guide/performance_improving/multinode_training_improving/cpu_train_best_practice_en.rst deleted file mode 100644 index 350606f34ea..00000000000 --- a/docs/advanced_guide/performance_improving/multinode_training_improving/cpu_train_best_practice_en.rst +++ /dev/null @@ -1,164 +0,0 @@ -.. _api_guide_cpu_training_best_practice_en: - -###################################################### -Best practices of distributed training on CPU -###################################################### - -To improve the training speed of CPU distributed training, we must consider two aspects: - -1. Improve the training speed mainly by improving utilization rate of CPU; -2. Improve the communication speed mainly by reducing the amount of data transmitted in the communication; -3. Improve the data IO speed by dataset API; -4. Improve the distributed training speed by changing distributed training strategy. - -Improve CPU utilization -============================= - -The CPU utilization mainly depends on :code:`ParallelExecutor`, which can make full use of the computing power of multiple CPUs to speed up the calculation. - -For detailed API usage, please refer to :ref:`api_fluid_ParallelExecutor` . A simple example: - -.. code-block:: python - - # Configure the execution strategy, mainly to set the number of threads - exec_strategy = fluid.ExecutionStrategy() - exec_strategy.num_threads = 8 - - # Configure the composition strategy, for CPU training, you should use the Reduce mode for training. - build_strategy = fluid.BuildStrategy() - if int(os.getenv("CPU_NUM")) > 1: - build_strategy.reduce_strategy=fluid.BuildStrategy.ReduceStrategy.Reduce - - pe = fluid.ParallelExecutor( - use_cuda=False, - loss_name=avg_cost.name, - main_program=main_program, - build_strategy=build_strategy, - exec_strategy=exec_strategy) - -Among the parameters above: - -- :code:`num_threads` : the number of threads used by the model training. It is preferably close to the number of the physical CPU cores of the machine where the training is performed. -- :code:`reduce_strategy` : For CPU training, you should choose fluid.BuildStrategy.ReduceStrategy.Reduce - - -Configuration of general environment variables: - -- :code:`CPU_NUM`: The number of replicas of the model, preferably the same as num_threads - - -Improve communication speed -============================== - -To reduce the amount of communication data and improve communication speed is achieved mainly by using sparse updates, the current support for `sparse update <../layers/sparse_update_en.html>`_ is mainly :ref:`api_fluid_layers_embedding`. - -.. code-block:: python - - data = fluid.layers.data(name='ids', shape=[1], dtype='int64') - fc = fluid.layers.embedding(input=data, size=[dict_size, 16], is_sparse=True) - -Among the parameters above: - -- :code:`is_sparse`: Use sparse updates to configure embedding. If the dict_size of embedding is large but the number of data are very small each time, it is recommended to use the sparse update method. - - -Improve data IO speed -============================== - -To improve the CPU's distributed training speed, you can first consider using the dataset API as data reader. Dataset is a multi producer and multi consumer data reading method. By default, data reading thread and training thread are coupled. In multi-threaded training, dataset shows a high performance advantage. - -Refer to this page for API introduction: https://www.paddlepaddle.org.cn/documentation/docs/en/api/dataset/QueueDataset.html - -Combined with the actual model CTR-DNN, you can learn more about how to use dataset: https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleRec/ctr/dnn - -Using :code:`train_from_dataset` for network training. - -.. code-block:: python - - dataset = fluid.DatasetFactory().create_dataset() - exe = fluid.Executor(fluid.CPUPlace()) - exe.run(fluid.default_startup_program()) - exe.train_from_dataset(program=fluid.default_main_program(),dataset=dataset) - - -Change distributed training strategy -============================== - -The core of improving CPU distributed training speed is to choose appropriate distributed training strategy, such as defining communication strategy, compiling strategy, executing strategy and so on. PaddlePaddle released :code:`DistributedStrategy` API in V1.7 version , which can be very flexible and convenient to specify distributed operation strategy. - -First, we need to introduce relevant libraries into the code: - -.. code-block:: python - - from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet - import paddle.fluid.incubate.fleet.base.role_maker as role_maker - from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler.distributed_strategy_factory import DistributedStrategyFactory - -At present, there are four kinds of training strategies: synchronous training, asynchronous, half asynchronous training and GEO training. - - -The default configuration of the above policy is introduced by the following code: - -.. code-block:: python - - # step1: get distributed strategy - # Sync - strategy = DistributedStrategyFactory.create_sync_strategy() - # Half-Async - strategy = DistributedStrategyFactory.create_half_async_strategy() - # Async - strategy = DistributedStrategyFactory.create_async_strategy() - # GEO - strategy = DistributedStrategyFactory.create_geo_strategy(update_frequency=400) - - # step2: define role of node - role = role_maker.PaddleCloudRoleMaker() - fleet.init(role) - - # step3: get distributed training program - optimizer = fluid.optimizer.SGD(learning_rate) # 以 SGD 优化器为例 - optimizer = fleet.distributed_optimizer(optimizer, strategy) - optimizer.minimize(loss) - - # step4.1: run parameter server node - if fleet.is_server(): - fleet.init_server() - fleet.run_server() - - # step4.2: run worker node - elif fleet.is_worker(): - fleet.init_worker() - exe.run(fleet.startup_program) - # Do training - exe.run(fleet.main_program) - fleet.stop_worker() - -PaddlePaddle supports adjusting the details of the training strategy: - -- The build_strategy and exec_strategy which used to create compiled_program can generate from strategy: - -.. code-block:: python - - compiled_program = fluid.compiler.CompiledProgram(fleet.main_program).with_data_parallel( - loss_name=loss.name, - build_strategy=strategy.get_build_strategy(), - exec_strategy=strategy.get_execute_strategy()) - - -- Training strategy details can be customized, Paddlepaddle supports customized configuration of distributetranspierconfig, trainerruntimeconfig, serverruntimeconfig, fluid.executionstrategy and fluid.buildstrategy. Take distributetranspillerconfig as an example. The modification method is as follows: - -.. code-block:: python - - strategy = DistributedStrategyFactory.create_sync_strategy() - - # Mode 1 (recommended): - config = strategy.get_program_config() - config.min_block_size = 81920 - - - # Mode 2 - config = DistributeTranspilerConfig() - config.min_block_size = 81920 - # config = dict() - # config['min_block_size'] = 81920 - strategy.set_program_config(config) diff --git a/docs/advanced_guide/performance_improving/multinode_training_improving/dist_training_gpu.rst b/docs/advanced_guide/performance_improving/multinode_training_improving/dist_training_gpu.rst deleted file mode 100644 index 32188831ca9..00000000000 --- a/docs/advanced_guide/performance_improving/multinode_training_improving/dist_training_gpu.rst +++ /dev/null @@ -1,133 +0,0 @@ -.. _best_practice_dist_training_gpu: - -##################### -分布式 GPU 训练优秀实践 -##################### - -开始优化您的 GPU 分布式训练任务 ---------------------------- - -PaddlePaddle Fluid 支持在现代 GPU [#]_ 服务器集群上完成高性能分布式训练。通常可以通过以下方法优化在多机多卡环境训练性能,建议在进行性能优化时,检查每项优化点并验证对应提升,从而提升最终的性能。 - -一个简单的验证当前的训练程序是否需要进一步优化性能的方法,是查看 GPU 的计算利用率 [#]_ ,通常用 :code:`nvidia-smi` 命令查看。如果 GPU 利用率较低,则可能存在较大的优化空间。下面主要从数据准备、训练策略设置和训练方式三个方面介绍 GPU 分布式训练中常用的优化方法。 - -1、数据准备 -=========== - -数据读取的优化在 GPU 训练中至关重要,尤其在不断增加 batch_size 提升吞吐时,计算对 reader 性能会有更高对要求,优化 reader 性能需要考虑的点包括: - - - 使用 :code:`DataLoader` 。参考 `这里 `_ 使用 DataLoader,并建议开启 :code:`use_double_buffer` 。 - - reader 返回 uint8 类型数据。图片在解码后一般会以 uint8 类型存储,如果在 reader 中转换成 float 类型数据,会将数据体积扩大 4 倍。直接返回 uint8 数据,然后在 GPU 上转化成 float 类型进行训练可以提升数据读取效率。 - - 减少 reader 初始化时间 (infinite read)。在训练任务开始执行第一轮训练时,reader 开始不断异步地从磁盘或其他存储中读取数据并执行预处理,然后将处理好的数据填充到队列中供计算使用。从 0 开始填充这个队列直到数据可以源源不断供给计算,需要一定时间的预热。所以,如果每轮训练都重新填充队列,会产生一些时间的开销。所以,在使用 DataLoader 时,可以让 reader 函数不断地产生数据,直到训练循环结束: - - .. code-block:: python - :linenos: - - def infinite_reader(file_path): - while True: - with open(file_path) as fn: - for line in fn: - yield process(line) - - def train(): - ... - for pass_id in xrange(NUM_PASSES): - if pass_id == 0: - data_loader.start() - for batch_id in (iters_per_pass): - exe.run() - data_loader.reset() - - -另外,可以使用 DALI 库提升数据处理性能。DALI 是 NVIDIA 开发的数据加载库,更多内容请参考 `官网文档 `_ 。飞桨中如何结合使用 DALI 库请参考 `使用示例 `_ 。 - -2、训练策略设置 -=========== - -训练参数设置表 - -.. csv-table:: - :header: "选项", "类型", "默认值", "说明" - :widths: 3, 3, 3, 5 - - ":code:`num_threads`", "int", "1", "CPU 线程数" - ":code:`nccl_comm_num`", "int", "1", "nccl 通信器数量" - ":code:`fuse_all_reduce_ops`", "bool", "False", "多卡训练时,将 AllReduce 操纵进行融合" - ":code:`use_hierarchical_allreduce` ", "bool", "False", "分级式 reduce" - ":code:`num_iteration_per_drop_scope`", "int", "1", "scope drop 频率,设置每隔几个 batch 的迭代之后执行一次清理 scope" - ":code:`fetch_frequency`", "int", "1", "fetch 的刷新频率" - ":code:`fuse_bn_act_ops`", "bool", "False", "是否开启 batch normalization 和激活函数的融合" - ":code:`fuse_elewise_add_act_ops`", "bool", "False", "是否开启 elementwise add 函数和激活函数的融合" - -说明: - -- 关于设置合适的 CPU 线程数 :code:`num_threads` 和 nccl 通信器数量 :code:`nccl_comm_num` 。PaddlePaddle Fluid 使用“线程池” [#]_ 模型调度并执行 Op,Op 在启动 GPU 计算之前,通常需要 CPU 的协助,然而如果 Op 本身占用时间很小,“线程池”模型下又会带来额外的调度开销。使用多进程模式时,如果神经网络的计算图 [#]_ 节点间有较高的并发度,即使每个进程只在一个 GPU 上运行,使用多个线程可以更大限度的提升 GPU 利用率。nccl 通信器数量 :code:`nccl_comm_num` 可以加快 GPU 之间的通信效率,建议单机设置为 1,多机设置为 2。针对 CPU 线程数 :code:`num_threads` ,建议单机设置为 1,多机设置为 :code:`nccl_comm_num` +1。 -- 关于 AllReduce 融合 :code:`fuse_all_reduce_ops` ,默认情况下会将同一 layer 中参数的梯度的 AllReduce 操作合并成一个,比如对于 :code:`fluid.layers.fc` 中有 Weight 和 Bias 两个参数,打开该选项之后,原本需要两次 AllReduce 操作,现在只用一次 AllReduce 操作。此外,为支持更大粒度的参数梯度融合,Paddle 提供了 :code:`FLAGS_fuse_parameter_memory_size` 和 :code:`FLAGS_fuse_parameter_groups_size` 两个环境变量选项。用户可以指定融合 AllReduce 操作之后,每个 AllReduce 操作的梯度字节数,比如希望每次 AllReduce 调用传输 16MB 的梯度,:code:`export FLAGS_fuse_parameter_memory_size=16` ,经验值为总通信量的十分之一。可以指定每次 AllReduce 操作的最大层数,即到达该层数就进行 AllReduce,如指定 50 层 :code:`export FLAGS_fuse_parameter_groups_size=50` 。注意:目前不支持 sparse 参数梯度。 -- 关于使用分级式 reduce :code:`use_hierarchical_allreduce` 。对于多机模式,针对小数据量的通信,Ring AllReduce 通信效率低,采用 Hierarchical AllReduce 可以解决该问题。 -- 关于降低 scope drop 频率 :code:`num_iteration_per_drop_scope` 和 fetch 频率 :code:`fetch_frequency` 。减少 scope drop 和 fetch 频率,可以减少频繁的变量内存申请、释放和拷贝,从而提升性能。 -- 关于操作融合:通过参数融合可以提升训练性能。 - -设置这些参数可以参考: - -.. code-block:: python - :linenos: - - dist_strategy = DistributedStrategy() - dist_strategy.nccl_comm_num = 2 #建议多机设置为 2,单机设置为 1 - exec_strategy = fluid.ExecutionStrategy() - exe_st.num_threads = 3 #建议多机设置为 nccl_comm_num+1,单机设置为 1 - exec_strategy.num_iteration_per_drop_scope = 30 #scope drop 频率 - dist_strategy.exec_strategy = exec_strategy - dist_strategy.fuse_all_reduce_ops = True #AllReduce 是否融合 - ... - with fluid.program_guard(main_prog, startup_prog): #组网 - params = model.params - optimizer = optimizer_setting(params) - dist_optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy) - dist_optimizer.minimize(avg_cost) - ... - for pass_id in range(PASS_NUM): - batch_id = 0 - while True: - if batch_id % fetch_frequency == 0: #fetch 频率 - fetched = exe.run(main_prog, fetch_list) - else: - exe.run([]) - - -3、训练方式 -=========== - -1、Local SGD - -GPU 多机多卡同步训练过程中存在慢 trainer 现象,即每步中训练快的 trainer 的同步通信需要等待训练慢的 trainer。由于每步中慢 trainer 的 rank 具有随机性,因此我们使用局部异步训练的方式——LocalSGD,通过多步异步训练(无通信阻塞)实现慢 trainer 时间均摊,从而提升同步训练性能。Local SGD 训练方式主要有三个参数,分别是: - -.. csv-table:: - :header: "选项", "类型", "可选值", "说明" - :widths: 3, 3, 3, 5 - - ":code:`use_local_sgd`", "bool", "False/True", "是否开启 Local SGD,默认不开启" - ":code:`local_sgd_is_warm_steps`", "int", "大于 0", "训练多少轮之后才使用 Local SGD 方式训练" - ":code:`local_sgd_steps`", "int", "大于 0", "Local SGD 的步长" - -说明: - -- Local SGD 的 warmup 步长 :code:`local_sgd_is_warm_steps` 影响最终模型的泛化能力,一般需要等到模型参数稳定之后在进行 Local SGD 训练,经验值可以将学习率第一次下降时的 epoch 作为 warmup 步长,之后再进行 Local SGD 训练。 -- Local SGD 步长 :code:`local_sgd_steps` ,一般该值越大,通信次数越少,训练速度越快,但随之而来的时模型精度下降。经验值设置为 2 或者 4。 - -具体的 Local SGD 的训练代码可以参考:https://github.com/PaddlePaddle/PaddleFleetX/tree/old_develop/deprecated/examples/local_sgd/resnet - - -2、使用混合精度训练 - -V100 GPU 提供了 `Tensor Core `_ 可以在混合精度计算场景极大的提升性能。使用混合精度计算的例子可以参考:https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification#using-mixed-precision-training - -目前 Paddle 只提供在两个模型(ResNet, BERT)的混合精度计算实现并支持 static loss scaling,其他模型使用混合精度也可以参考以上的实现完成验证。 - -附录 ----- - -.. [#] 现代 GPU:指至少支持运行 `CUDA `_ 版本 7.5 以上的 GPU -.. [#] GPU 利用率:这里指 GPU 计算能力被使用部分所占的百分比 -.. [#] https://en.wikipedia.org/wiki/Thread_pool -.. [#] https://en.wikipedia.org/wiki/Data-flow_diagram diff --git a/docs/advanced_guide/performance_improving/multinode_training_improving/gpu_training_with_low_bandwidth_dgc.md b/docs/advanced_guide/performance_improving/multinode_training_improving/gpu_training_with_low_bandwidth_dgc.md deleted file mode 100644 index ef1017c61c4..00000000000 --- a/docs/advanced_guide/performance_improving/multinode_training_improving/gpu_training_with_low_bandwidth_dgc.md +++ /dev/null @@ -1,123 +0,0 @@ -# 低配网络的分布式 GPU 训练 - -## 1. 背景 - 大规模分布式训练需要较高的网络带宽以便进行梯度的聚合更新,这限制了多节点训练时的可扩展性同时也需要昂贵的高带宽设备。在低带宽云网络等环境下进行分布式训练会变得更加糟糕。现有[Deep Gradient Compression](https://arxiv.org/abs/1712.01887)研究表明,分布式 SGD 中有 99.9%的梯度交换都是冗余的,可以使用深度梯度压缩选择重要梯度进行通信来减少通信量,降低对通信带宽的依赖。Paddle 目前实现了 DGC 的稀疏通信方式,可有效在低配网络下进行 GPU 分布式训练。下面将介绍 DGC 稀疏通信方式的使用方法、适用场景及基本原理。 - -## 2. 使用方法 -`注意:使用 DGC 请使用 1.6.2 及其之后版本,之前版本存在有若干 bug。` -DGC 稀疏通信算法以 DGCMomentumOptimizer 接口的形式提供,目前只支持 GPU 多卡及 GPU 多机分布式,由于现有 fuse 策略会造成 DGC 失效,所以使用 DGC 时需设置`strategy.fuse_all_reduce_ops=False`关闭 fuse。DGC 只支持 Momentum 优化器,使用时把当前代码中的 Momentum 优化器替换为 DGCMomentumOptimizer,并添加 DGC 所需参数即可。如下代码所示,其中 rampup_begin_step 表示从第几步开始使用 DGC,更详细参数可见[api 文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/optimizer_cn/DGCMomentumOptimizer_cn.html#dgcmomentumoptimizer)。 -``` python -import paddle.fluid as fluid -# optimizer = fluid.optimizer.Momentum(learning_rate=0.001, momentum=0.9) -# 替换 Momentum 优化器,添加 DGC 所需参数 -optimizer = fluid.optimizer.DGCMomentumOptimizer( - learning_rate=0.001, momentum=0.9, rampup_begin_step=0) -optimizer.minimize(cost) -``` -在 fleet 中我们提供了[DGC 的示例](https://github.com/PaddlePaddle/PaddleFleetX/tree/old_develop/deprecated/examples/dgc_example)。示例中以数字手写体识别为例,将程序移植为分布式版本(注:DGC 亦支持单机多卡),再加上 DGC 优化器。可参照此示例将单机单卡程序迁移到 DGC。在单机单卡迁移到 DGC 过程中,一般需要先对齐多机 Momentum 的精度,再对齐 DGC 的精度。 - -## 3. 调参&适用场景 -### 3.1 预热调参 -对于正常的训练,使用 DGC 一般需进行预热训练,否则可能会有精度损失。如下图是 ResNet50 模型 Imagenet 数据集的训练结果,未进行预热训练的 DGC 最终损失了约 0.3%的精度。 -
-![DGC Resnet50 acc1](images/dgc_resnet50_acc1.png) -
- -预热训练调参可参照论文的设置。对图像分类,论文在 Cifar10 和 ImageNet 数据集上共 164 和 90 个 epochs 的训练中都采用了 4 个 epochs 的预热训练。在语言模型 PTB 数据集上,在共 40 个 epochs 的训练中选择了 1 个 epoch 进行预热训练。在语音识别 AN4 数据集上,80 个 epochs 中选择 1 个 epoch 进行预热训练。 -论文中使用了 75%, 93.75%, 98.4375%, 99.6%, 99.9%稀疏度逐渐提升的策略。由于 paddle 稀疏梯度聚合通信使用了 AllGather,通信量会随卡数增加而增长,所以在卡数较多时不推荐较低稀疏度的预热训练。如 75%稀疏度时每张卡会选择 25%的梯度进行通信,卡数为 32 时通信量是正常 dense 通信的 32\*(1-0.75)=8 倍,所以前几个 epoch 使用正常的 dense 通信为佳。可参照如下写法 -``` python -# 1. 以 1252 个 step 为一个 epoch,前 2 个 epochs 使用正常 dense 通信,后 3 个 epochs 逐步提升稀疏度为 99.9% -optimizer = fluid.optimizer.DGCMomentumOptimizer( - learning_rate=0.001, momentum=0.9, rampup_begin_step=1252*2, - rampup_step=1252*3, sparsity=[0.984375, 0.996, 0.999]) -# 2. 前面 4 个 epochs 都使用 dense 通信,之后默认 0.999 稀疏度运行 -optimizer = fluid.optimizer.DGCMomentumOptimizer( - learning_rate=0.001, momentum=0.9, rampup_begin_step=1252*4) -``` -对于 Fine-tuning 训练,现测试可无需预热训练,从第 0 个 epoch 直接使用 DGC 即可。 -``` python -# 从第 0 步开始 DGC 稀疏通信 -optimizer = fluid.optimizer.DGCMomentumOptimizer( - learning_rate=0.001, momentum=0.9, rampup_begin_step=0) -``` -### 3.2 适用场景 -DGC 稀疏通信在低带宽通信瓶颈时会有较大的性能提升,但在单机多卡及 RDMA 网络通信并非瓶颈情况下,并不会带来性能上的提升。同时由于 AllGather 的通信量会随卡数的增多而增大,所以 DGC 的多机训练规模也不宜过大。故 DGC 适用于低配网络,同时节点规模不宜过大,如>128 张卡。在云网络或高带宽网络设备昂贵时,DGC 可有效降低训练成本。 - -## 4. 原理 -本节原理部分基本来自[Deep Gradient Compression](https://arxiv.org/abs/1712.01887)论文,本文进行了部分理解翻译,英文较好者建议直接阅读论文。 -### 4.1 梯度稀疏 -DGC 的基本思路是通过只传送重要梯度,即只发送大于给定阈值的梯度来减少通信带宽的使用。为避免信息的丢失,DGC 会将剩余梯度在局部累加起来,最终这些梯度会累加大到足以传输。 -换个角度,从理论依据上来看,局部梯度累加等同于随时间推移增加 batch size,(DGC 相当于每一个梯度有自己的 batch size)。设定 $F(w)$ 为需要优化的 loss 函数,则有着 N 个训练节点的同步分布式 SGD 更新公式如下 -$$ -F(w)=\\frac{1}{\|\\chi\|}\\sum\_{x\\in\\chi}f(x, w), \\qquad w\_{t+1}=w\_{t}-\\eta\\frac{1}{N b}\\sum\_{k=0}^{N}\\sum\_{x\\in\\mathcal{B}\_{k,t}}\\nabla f\\left(x, w\_{t}\\right) \\tag{1} -$$ -其中$\chi$是训练集,$w$是网络权值,$f(x, w)$是每个样本$x \in \chi$的 loss,$\eta$是学习率,N 是训练节点个数,$\mathcal{B}_{k, t}$代表第$k$个节点在第$t$个迭代时的 minibatch,大小为 b。 -考虑权重的第 i 个值,在 T 次迭代后,可获得 -$$ -w\_{t+T}^{(i)}=w\_{t}^{(i)}-\\eta T \\cdot \\frac{1}{N b T} \\sum\_{k=1}^{N}\\left(\\sum\_{\\tau=0}^{T-1} \\sum\_{x \\in \\mathcal{B}\_{k, t+\\tau}} \\nabla^{(i)} f\\left(x, w\_{t+\\tau}\\right)\\right) \\tag{2} -$$ -等式 2 表明局部梯度累加可以被认为 batch size 从$Nb$增大为$NbT$,其中 T 是$w^{(i)}$两次更新的稀疏通信间隔。 -### 4.2 局部梯度累加改进 -正常情况,稀疏更新会严重影响收敛性。DGC 中采用动量修正(Momentum Correction)和局部梯度裁减(local gradient clipping)来解决这个问题。 -#### 4.2.1 动量修正 -有着 N 个节点分布式训练中 vanilla momentum SGD 公式, -$$ -u\_{t}=m u\_{t-1}+\\sum\_{k=1}^{N}\\left(\\nabla\_{k, t}\\right), \\quad w\_{t+1}=w\_{t}-\\eta u\_{t} \\tag{3} -$$ -其中$m$是动量因子,$N$是节点数,$\nabla_{k, t}=\frac{1}{N b} \sum_{x \in \mathcal{B}_{k, t}} \nabla f\left(x, w_{t}\right)$。 -考虑第 i 个权重$w^{(i)}$,在 T 次迭代后,权重更新公式如下, -$$ -w\_{t+T}^{(i)}=w\_{t}^{(i)}-\\eta\\left[\\cdots+\\left(\\sum\_{\\tau=0}^{T-2} m^{\\tau}\\right) \\nabla\_{k, t+1}^{(i)}+\\left(\\sum\_{\\tau=0}^{T-1} m^{\\tau}\\right) \\nabla\_{k, t}^{(i)}\\right] \\tag{4} -$$ -如果直接应用动量 SGD 到稀疏梯度更新中,则有公式, -$$ -v_{k, t}=v_{k, t-1}+\\nabla_{k, t}, \\quad u_{t}=m u_{t-1}+\\sum_{k=1}^{N} \\operatorname{sparse}\\left(v_{k, t}\\right), \\quad w_{t+1}=w_{t}-\\eta u_{t} \\tag{5} -$$ -其中$v_k$是训练节点 k 上的局部梯度累加项,一旦$v_k$大于某一阈值,则会在第二项中压缩梯度进行动量更新,并使用 sparse()函数获得 mask 清空大于阈值的梯度。 -$w^{(i)}$在 T 次稀疏更新后的权重为, -$$ -w_{t+T}^{(i)}=w_{t}^{(i)}-\\eta\\left(\\cdots+\\nabla_{k, t+1}^{(i)}+\\nabla_{k, t}^{(i)}\\right) \\tag{6} -$$ -相比传统动量 SGD,方程 6 缺失了累积衰减因子$\sum_{\tau=0}^{T-1} m^{\tau}$,会导致收敛精度的损失。如下图 A,正常梯度更新从 A 点到 B 点,但是方程 6 则从 A 点到 C 点。当稀疏度很高时,会显著降低模型性能,所以需要在方程 5 基础上对梯度进行修正。 -
- - -
-若将方程 3 中速度项$u_t$当作“梯度”,则方程 3 第二项可认为是在”梯度“$u_t$上应用传统 SGD,前面已经证明了局部梯度累加在传统 SGD 上是有效的。因此,可以使用方程 3 局部累加速度项$u_t$而非累加真实的梯度$\nabla_{k, t}$来修正方程 5, -$$ -u_{k, t}=m u_{k, t-1}+\\nabla_{k, t}, \\quad v_{k, t}=v_{k, t-1}+u_{k, t}, \\quad w_{t+1}=w_{t}-\\eta \\sum_{k=1}^{N} \\operatorname{sparse}\\left(v_{k, t}\\right) \\tag{7} -$$ -修正后,如上图(b),方程可正常从 A 点到 B 点。除了传统动量方程修正,论文还给出了 Nesterov 动量 SGD 的修正方程。 -#### 4.2.2 局部梯度修剪 -梯度修剪是防止梯度爆炸的常用方法。这方法由 Pascanu 等人在 2013 年提出,当梯度的 l2-norms 和大于给定阈值时,就对梯度 rescale。正常梯度修剪在梯度聚合后使用,而 DGC 因为每个节点独立的进行局部梯度累加,所以 DGC 在使用$G_t$累加前对其进行局部梯度修剪。阈值缩放为原来的$N^{-1/2}$ -$$ -thr_{G^{k}}=N^{-1 / 2} \\cdot thr_{G} \\tag{8} -$$ -### 4.3 克服迟滞效应 -因为推迟了较小梯度更新权重的时间,所以会有权重陈旧性问题。稀疏度为 99.9%时大部分参数需 600 到 1000 步更新一次。迟滞效应会减缓收敛并降低模型精度。DGC 中采用动量因子掩藏和预热训练来解决这问题。 -#### 4.3.1 动量因子掩藏 -DGC 中使用下面方程来掩藏动量因子减缓陈旧性问题。 -$$ -Mask \\leftarrow\\left|v_{k, t}\\right|>t h r, \\quad v_{k, t} \\leftarrow v_{k, t} \\odot \\neg Mask, \\quad u_{k, t} \\leftarrow u_{k, t} \\odot \\neg Mask \\tag{9} -$$ -此掩码可以停止延迟梯度产生的动量,防止陈旧梯度把权重引入错误的方向。 - -#### 4.3.2 预热训练 -在训练初期,梯度变动剧烈,需要及时更新权重,此时迟滞效应影响会很大。为此 DGC 采用预热训练的方法,在预热期间使用更小的学习率来减缓网络的变化速度,并使用较小的稀疏度来减少需推迟更新的梯度数量。预热期间会线性增大学习率,指数型增加稀疏度到最终值。 - -### 4.4 正则化(Weight Decay)项修正 -Paddle 框架以 Weight Decay 的形式实现正则化。以 L2Decay 为例,公式(3)中传统 momentum 添加 weight decay 后公式为 -$$ -G_{t}=\\sum_{k=1}^{N}\\left(\\nabla_{k, t}\\right)+\\lambda w_{t}, \\quad u_{t}=m u_{t-1}+G_{t}, \\quad w_{t+1}=w_{t}-\\eta u_{t} \\tag{10} -$$ -其中$\lambda$为 Weight Decay 系数,$G_{t}$为添加 L2Decay 项之后的聚合梯度。由于在公式 7 中进行了局部动量修正,所以按照相同思路在局部梯度上运用修正的 Weight Decay 项。如下公式在局部梯度上添加局部 Weight Decay 项即可。 -$$ -\\nabla_{k, t}=\\nabla_{k, t}+\\frac{\\lambda}{N} w_{t} \\tag{11} -$$ -在模型实际训练中,通常会设置 weight decay 的系数$\lambda=10^{-4}$,在卡数较多如 4 机 32 卡的情况下局部 weight decay 系数为$\frac{\lambda}{N}=\frac{10^{-4}}{32}=3.125*10^{-6}$,在数值精度上偏低,测试训练时会损失一定精度。为此还需对局部 weight decay 项进行数值修正。如下公式, -$$ -\\nabla_{k, t}^{'}=N \\nabla_{k, t}+\\lambda w_{t}, \\quad -G_{t}^{'}=\\sum_{k=1}^{N}\\left(\\nabla_{k, t}^{'}\\right)=N\\sum_{k=1}^{N}\\left(\\nabla_{k, t}\\right)+N\\lambda w_{t}, \\quad -G_{t}=\\frac{G_{t}^{'}}{N}=\\sum_{k=1}^{N}\\left(\\nabla_{k, t}\\right)+\\lambda w_{t} \\tag{12} -$$ -具体做法为对局部梯度乘以卡数求得$\nabla_{k, t}^{'}$,此时$\lambda$项则无需除以卡数,聚合梯度求得$G_{t}^{'}$再对聚合梯度除以卡数得到$G_{t}$即可。 diff --git a/docs/advanced_guide/performance_improving/multinode_training_improving/gpu_training_with_recompute.rst b/docs/advanced_guide/performance_improving/multinode_training_improving/gpu_training_with_recompute.rst deleted file mode 100644 index ad16707dd87..00000000000 --- a/docs/advanced_guide/performance_improving/multinode_training_improving/gpu_training_with_recompute.rst +++ /dev/null @@ -1,160 +0,0 @@ - -重计算:大 Batch 训练特性 -============= - -背景 ---------- - -随着训练数据规模的逐渐增加,训练更大、更深的深度学习模型成为一个主流趋势。目前的深度学习模型训练,通常要求保留前向计算的隐层结果,并且需要保存结果的数量会随着模型层数的增加线性增加,这对于目前能够使用的 AI 芯片的内存大小是个挑战。Forward Recomputation Backpropagation(FRB)可以在额外增加少量计算的情况下,显著增加模型的层数和宽度,同时也可以显著提升模型训练的 batch 大小。 - -原理 ---------- - -我们知道,深度学习网络的一次训练迭代包含三个步骤: - -- **前向计算**:运行前向算子(Operator) 来计算中间隐层(Variable)的值 -- **反向计算**:运行反向算子来计算参数(Parameter)的梯度 -- **优化**:应用优化算法以更新参数值 - -在前向计算过程中,前向算子会输出大量的中间计算结果,在 Paddle 中,使用 -Variable 来存储这些隐层的中间结果。当模型层数加深时,其数量可达成千上万个, -占据大量的内存。Paddle 的 `显存回收机制 `_ -会及时清除无用的中间结果,以节省存储。 -然而,有些中间结果是反向算子的输入,这些 Variable 必须存储在内存中,直到相应的反向算子计算完毕。 - -举个简单的例子, 我们定义一个由 mul 算子构成的网络,其前向计算为: - -.. math:: - - y = W_1 * x - - z = W_2 * y - -其中 :math:`x, y, z` 为向量, :math:`W_1, W_2` 为矩阵。容易知道,求 :math:`W_2` 梯度的反向计算为: - -.. math:: - W_{2}^{'} = z^{'} / y - -可以看到反向计算中用到了前向计算生成的变量 :math:`y` ,因此变量 :math:`y` 必须存储在内存中,直到这个反向算子计算完毕。当模型加深时,我们会有大量的“ :math:`y` ”,占据了大量的内存。 - -Forward Recomputation Backpropagation(FRB)的思想是将深度学习网络切分为 k 个部分(segments)。对每个 segment 而言:前向计算时,除了小部分必须存储在内存中的 Variable 外(我们后续会讨论这些特殊 Variable),其他中间结果都将被删除;在反向计算中,首先重新计算一遍前向算子,以获得中间结果,再运行反向算子。简而言之,FRB 和普通的网络迭代相比,多计算了一遍前向算子。 - -我们把切分网络的变量叫做 checkpoints。 -那么问题来了,如何选择 checkpoints 呢?自从 FRB 方法提出以来 \ :sup:`[1], [2]`,大量学者在研究这一关键问题。 -我们知道深度学习网络通常是由一个个模块串联得到的,比如 ResNet-50 由 16 个 block 串联而成, -Bert-Large 由 24 个 transformer 串联而成,以两个子模块中间的变量作为切分点就是一个很好的选择。 -对于非串联的网络(比如含有大量 shortcut 结构的网络),FRB 也支持对其做切分, -只是可能多耗费一点内存(用于存储 shortcut 的 Variable)。 -Mitsuru Kusumoto \ :sup:`[3]` 等提出了一种基于动态规划的算法, -可以根据指定的内存自动搜索合适的 checkpoints,支持各种各样的网络结构。 - -下图是由 4 个 fc Layer、3 个 relu Layer、1 个 sigmoid Layer 和 1 个 log-loss Layer 串联而成的一个网络:最左侧为其前向计算流程、中间是普通的前向计算和反向计算流程、最右侧为添加 FRB 后的前向计算和反向计算流程。其中方框代表算子(Operator),红点代表前向计算的中间结果、蓝点代表 checkpoints。 - -.. image:: images/recompute.png - -注:该例子完整代码位于 `source `_ - -添加 FRB 后,前向计算中需要存储的中间 Variable 从 4 个(红点)变为 2 个(蓝点), -从而节省了这部分内存。当然了,重计算的部分也产生了新的中间变量, -这就需要根据实际情况来做权衡了。这个例子里的网络比较浅,通常来讲, -对层数较深的网络,FRB 节省的内存要远多于新增加的内存。 - -使用方法 ---------- - -我们实现了基于 Paddle 的 FRB 算法,叫做 RecomputeOptimizer, -您可以根据其 `源码 `_ -与 -`文档 `_ -更深入地了解这一算法。我们为用户提供了两个使用 RecomputeOptimizer 的方法: -直接调用与 Fleet API 中使用。在单机单卡或者 CPU 训练中建议您直接调用 RecomputeOptimizer, -在多卡训练或者多机训练任务上建议您在 Fleet API 中使用 Recompute。 - -**1. 直接调用** - -直接调用 RecomputeOptimizer 非常简单,首先要定义一个经典的 Optimizer,比如 Adam; -然后在外面包一层 RecomputeOptimizer;最后设置 checkpoints 即可。 - -.. code-block:: python - - import paddle.fluid as fluid - # 定义网络 - def mlp(input_x, input_y, hid_dim=128, label_dim=2): - print(input_x) - fc_1 = fluid.layers.fc(input=input_x, size=hid_dim) - prediction = fluid.layers.fc(input=[fc_1], size=label_dim, act='softmax') - cost = fluid.layers.cross_entropy(input=prediction, label=input_y) - sum_cost = fluid.layers.reduce_mean(cost) - return sum_cost, fc_1, prediction - input_x = fluid.layers.data(name="x", shape=[32], dtype='float32') - input_y = fluid.layers.data(name="y", shape=[1], dtype='int64') - cost, fc_1, pred = mlp(input_x, input_y) - # 定义 RecomputeOptimizer - sgd = fluid.optimizer.Adam(learning_rate=0.01) - sgd = fluid.optimizer.RecomputeOptimizer(sgd) - # 设置 checkpoints - sgd._set_checkpoints([fc_1, pred]) - # 运行优化算法 - sgd.minimize(cost) - -Recompute 原则上适用于所有 Optimizer。 - -**2. 在 Fleet API 中使用 Recompute** - -`Fleet API `_ -是基于 Fluid 的分布式计算高层 API。在 Fleet API 中添加 RecomputeOptimizer -仅需要 2 步: - -- 设置 dist_strategy.forward_recompute 为 True; - -- 设置 dist_strategy.recompute_checkpoints。 - -.. code-block:: python - - from paddle.fluid.incubate.fleet.collective import fleet, DistributedStrategy - dist_strategy = DistributedStrategy() - dist_strategy.forward_recompute = True - dist_strategy.recompute_checkpoints=checkpoints - optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy) - optimizer.minimize(loss) - -为了帮助您快速地用 Fleet API 使用 Recompute 任务,我们提供了一些例子, -并且给出了这些例子的计算速度、效果和显存节省情况: - -- 用 Recompute 做 Bert Fine-tuning: `source `_ - -- 用 Recompute 做目标检测:开发中. - -Q&A -------- - -- **是否支持带有随机性的 Op?** - - 目前 Paddle 中带随机性的 Op 有:dropout,Recompute 支持 - dropout Operator,可以保证重计算与初次计算结果保持一致。 - -- **有没有更多 Recompute 的官方例子?** - - 更多 Recompute 的例子将更新在 `examples `_ - 和 `Fleet `_ 库下,欢迎关注。 - -- **有没有添加 checkpoints 的建议?** - - 我们建议将子网络连接部分的变量添加为 checkpoints,即: - 如果一个变量能将网络完全分为前后两部分,那么建议将其加入 checkpoints。 - checkpoints 的数目会影响内存的消耗:如果 checkpoints 很少, - 那么 Recompute 起的作用有限;如果 checkpoints 数量过多, - 那么 checkpoints 本身占用的内存量就较大,内存消耗可能不降反升。 - - 我们后续会添加一个估算内存用量的工具, - 可以对每个 Operator 运算前后的显存用量做可视化, - 帮助用户定位问题。 - -[1] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin . Training deep nets with sublinear memory cost. -arXiv preprint, arXiv:1604.06174, 2016. - -[2] Audrunas Gruslys , Rémi Munos , Ivo Danihelka , Marc Lanctot , and Alex Graves. Memory efficient -backpropagation through time. In Advances in Neural Information Processing Systems (NIPS), pages 4125 4133, -2016. - -[3] Kusumoto, Mitsuru, et al. "A Graph Theoretic Framework of Recomputation Algorithms for Memory-Efficient Backpropagation." arXiv preprint arXiv:1905.11722 (2019). diff --git a/docs/advanced_guide/performance_improving/multinode_training_improving/gpu_training_with_recompute_en.rst b/docs/advanced_guide/performance_improving/multinode_training_improving/gpu_training_with_recompute_en.rst deleted file mode 100644 index d9a556bb74c..00000000000 --- a/docs/advanced_guide/performance_improving/multinode_training_improving/gpu_training_with_recompute_en.rst +++ /dev/null @@ -1,196 +0,0 @@ - -Recompute: Training with bigger batch size -============= - -Context ---------- - -As the amount of training data increases, training deeper neural network models becomes more and more popular. Current deep-learning training usually keeps the hidden layer outputs in memory during the forward propagation, -and the number of outputs increases linearly with -the increase of the number of model layers, -which becomes a challenge of the memory size -for common devices. - - -Theory ---------- - -As we know, a training process of a deep-learning network contains 3 steps: - -- **Forward Propagation**:Running forward operators and generate temporary variables as output -- **Backward Propagation**:Running backward operators to compute gradients of parameters -- **Optimization**:Applying optimization algorithm to update parameters - -When the model becomes deeper, the number of temporary variables -generated in the forward propagation process can reach tens -of thousands, occupying a large amount of memory. -The `Garbage Collection mechanism `_ -in Paddle can delete useless variables for the sake of saving memory. -However, some variables serve as inputs of backward operators, -they must be kept in memory until particular operator finish. - -Take a simple example, define a network contains two `mul` operators, -the forward propagation works as follows: - -.. math:: - - y = W_1 * x - - z = W_2 * y - -where :math:`x, y, z` are vectors, :math:`W_1, W_2` are matrix。It is easy to conduct that the gradient of :math:`W_2` is: - -.. math:: - W_{2}^{'} = z^{'} / y - -We can see that :math:`y` is used in the backward propagation process, -thus it must be kept in the memory during the whole forward propagation. -When network grows deeper, more 'y's need to be stored, -adding more requirements to the memory. - -Forward Recomputation Backpropagation(FRB) splits a deep network to k segments. -For each segment, in forward propagation, -most of the temporary variables are erased in time, -except for some special variables (we will talk about that later); -in backward propagation, the forward operators will be recomputed -to get these temporary variables before running backward operators. -In short, FBR runs forward operators twice. - -But how to split the network? A deep learning network usually consists -of connecting modules in series: -ResNet-50 contains 16 blocks and Bert-Large contains 24 transformers. -It is a good choice to treat such modules as segments. -The variables among segments are -called as checkpoints. - -The following picture is a network with 4 fc layers, 3 relu layers, -1 sigmoid layer and 1 log-loss layer in series. -The left column is the forward propagation, -the middle column is the normal backward propagation, -and the right column is the FRB. -Rectangular boxes represent the operators, red dots represent -the intermediate variables in forward computation, blue dots -represent checkpoints and arrows represent the dependencies between operators. - -.. image:: images/recompute.png - -Note: the complete source code of this example: `source `_ - -After applying FBR, the forward computation only needs to store -2 variables (the blue dots) instead of 4 variables (the red -dots), saving the corresponding memories. It is notable that -recomputing operators generate new intermediate variables at the same time, -a trade-off needs to be considered in this situation. -While according to our experiments, -FBR usually saves rather than increase the memory load. - -Usage ---------- - -We have implemented the FRB algorithm named "RecomputeOptimizer" -based on Paddle. More information about this algorithm can -be learned by the `source code `_ -and the -`document `_ -of RecomputeOptimizer. - -There are 2 methods to apply RecomputeOptimizer in your Paddle -program: call RecomputeOptimizer directly or use it with Fleet -API. For single-GPU card training or CPU training, we recommend -directly calling; For multi-GPU training, we -recommend using with Fleet API. - -**1. Directly calling** - -Calling RecomputeOptimizer is very easy: first, define a classic -optimizer, such as Adam; second, wrap it with RecomputeOptimizer; -third, set the checkpoints. - -.. code-block:: python - - import paddle.fluid as fluid - # Define the network - def mlp(input_x, input_y, hid_dim=128, label_dim=2): - print(input_x) - fc_1 = fluid.layers.fc(input=input_x, size=hid_dim) - prediction = fluid.layers.fc(input=[fc_1], size=label_dim, act='softmax') - cost = fluid.layers.cross_entropy(input=prediction, label=input_y) - sum_cost = fluid.layers.reduce_mean(cost) - return sum_cost, fc_1, prediction - input_x = fluid.layers.data(name="x", shape=[32], dtype='float32') - input_y = fluid.layers.data(name="y", shape=[1], dtype='int64') - cost, fc_1, pred = mlp(input_x, input_y) - # define RecomputeOptimizer - sgd = fluid.optimizer.Adam(learning_rate=0.01) - sgd = fluid.optimizer.RecomputeOptimizer(sgd) - # set checkpoints - sgd._set_checkpoints([fc_1, pred]) - # apply optimization - sgd.minimize(cost) - -In principle, recompute is for all kinds of optimizers in Paddle. - -**2. Using Recompute in Fleet API** - -`Fleet API `_ -is a high-level API for distributed training in Fluid. Adding -RecomputeOptimizer to Fluid takes two steps: - -- set dist_strategy.forward_recompute to True - -- set dist_strategy.recompute_checkpoints - -.. code-block:: python - - from paddle.fluid.incubate.fleet.collective import fleet, DistributedStrategy - dist_strategy = DistributedStrategy() - dist_strategy.forward_recompute = True - dist_strategy.recompute_checkpoints=checkpoints - optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy) - optimizer.minimize(loss) - -We supply some examples of using recompute in Fleet API for users. -We also post corresponding training speed, -test results and memory usages of these examples for reference. - - -- Fine-tuning Bert Large model with recomputing: `source `_ - -- Training object detection models with recomputing:developing. - -Q&A -------- - -- **Does RecomputeOptimizer support operators with random outputs?** - -We currently found that the dropout operator has random results -and RecomputeOptimizer is able to keep the outputs of -first-computation and recomputation consistent. - - -- **Are there more official examples of Recompute?** - - More examples will be updated at `examples `_ -and `Fleet `_ . Feel free to -raise issues if you get any problem with these examples. - -- **How should I set checkpoints?** - -The position of checkpoints is important: -we suggest setting the variable between the sub-model as checkpoints, -that is, set a variable as a checkpoint if it -can separate the network into two parts without short-cut connections. -The number of checkpoints is also important: -too few checkpoints will reduce the memory saved by recomputing while -too many checkpoints will occupy a lot of memory themselves. -We will add a tool to estimate the memory usage with specific checkpoints, -helping users to choose checkpointing variables. - -[1] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin . Training deep nets with sublinear memory cost. -arXiv preprint, arXiv:1604.06174, 2016. - -[2] Audrunas Gruslys , Rémi Munos , Ivo Danihelka , Marc Lanctot , and Alex Graves. Memory efficient -backpropagation through time. In Advances in Neural Information Processing Systems (NIPS), pages 4125 4133, -2016. - -[3] Kusumoto, Mitsuru, et al. "A Graph Theoretic Framework of Recomputation Algorithms for Memory-Efficient Backpropagation." arXiv preprint arXiv:1905.11722 (2019). diff --git a/docs/advanced_guide/performance_improving/multinode_training_improving/images/dgc_resnet50_acc1.png b/docs/advanced_guide/performance_improving/multinode_training_improving/images/dgc_resnet50_acc1.png deleted file mode 100644 index 6fe02f64a5e..00000000000 Binary files a/docs/advanced_guide/performance_improving/multinode_training_improving/images/dgc_resnet50_acc1.png and /dev/null differ diff --git a/docs/advanced_guide/performance_improving/multinode_training_improving/images/dgc_with_momentum_correction.png b/docs/advanced_guide/performance_improving/multinode_training_improving/images/dgc_with_momentum_correction.png deleted file mode 100644 index 22f169ab479..00000000000 Binary files a/docs/advanced_guide/performance_improving/multinode_training_improving/images/dgc_with_momentum_correction.png and /dev/null differ diff --git a/docs/advanced_guide/performance_improving/multinode_training_improving/images/dgc_without_momentum_correction.png b/docs/advanced_guide/performance_improving/multinode_training_improving/images/dgc_without_momentum_correction.png deleted file mode 100644 index 533a4c293df..00000000000 Binary files a/docs/advanced_guide/performance_improving/multinode_training_improving/images/dgc_without_momentum_correction.png and /dev/null differ diff --git a/docs/advanced_guide/performance_improving/multinode_training_improving/images/recompute.png b/docs/advanced_guide/performance_improving/multinode_training_improving/images/recompute.png deleted file mode 100644 index 11e9778305c..00000000000 Binary files a/docs/advanced_guide/performance_improving/multinode_training_improving/images/recompute.png and /dev/null differ diff --git a/docs/dev_guides/api_contributing_guides/new_cpp_op_en.md b/docs/dev_guides/api_contributing_guides/new_cpp_op_en.md deleted file mode 100755 index b9d52690770..00000000000 --- a/docs/dev_guides/api_contributing_guides/new_cpp_op_en.md +++ /dev/null @@ -1,478 +0,0 @@ -# How to write a new operator - - -## Background - -Here are the base types needed. For details, please refer to the design docs. - -- `class OpProtoAndCheckerMaker`: Describes an Operator's input, output, attributes and description, mainly used to interface with Python API. -- `framework::OperatorBase`: Operator (Op)base class. -- `framework::OpKernel`: Base class for Op computation kernel. -- `framework::OperatorWithKernel`: Inherited from OperatorBase, describing an operator with computation kernels. - - -Operators can be categorized into two groups: operator with kernel(s) and operator without kernel(s). An operator with kernel(s) inherits from `OperatorWithKernel` while the one without kernel(s) inherits from `OperatorBase`. This tutorial focuses on implementing operators with kernels. In short, an operator includes the following information: - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Information Where is it defined
OpProtoMake definition `.cc`files, Backward Op does not need an OpProtoMake interface.
Op definition `.cc` files
Kernel implementation The kernel methods shared between CPU and CUDA are defined in `.h` files. CPU-specific kernels live in `.cc` files, while CUDA-specific kernels are implemented in `.cu`files.
Registering the Op Ops are registered in `.cc` files; For Kernel registration, `.cc` files contain the CPU implementation, while `.cu` files contain the CUDA implementation.
- - -New Operator implementations are added to the list [paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators), with file names in the format `*_op.h` (if applicable), `*_op.cc`, `*_op.cu` (if applicable).** The system will use the naming scheme to automatically build operators and their corresponding Python extensions.** - - -Let's take matrix multiplication operator, [MulOp](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc), as an example to introduce the writing of an Operator with Kernel. - - -## Implementing C++ Types - - -### Defining ProtoMaker - -Matrix Multiplication can be written as $Out = X * Y$, meaning that the operation consists of two inputs and one output. - -First, define `ProtoMaker` to describe the Operator's input, output, and additional comments: - -```cpp -class MulOpMaker : public framework::OpProtoAndCheckerMaker { - public: - MulOpMaker(OpProto *proto, OpAttrChecker *op_checker) - : OpProtoAndCheckerMaker(proto, op_checker) { - AddInput("X", "(Tensor), 2D tensor of size (M x K)"); - AddInput("Y", "(Tensor), 2D tensor of size (K x N)"); - AddOutput("Out", "(Tensor), 2D tensor of size (M x N)"); - AddComment(R"DOC( -Two Element Mul Operator. -The equation is: Out = X * Y -)DOC"); - } -}; -``` - -[`MulOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L76-L127)is inherited from`framework::OpProtoAndCheckerMaker`, consisting of 2 variables in the constructor: - - - `framework::OpProto` stores Operator input and variable attribute, used for generating Python API interfaces. - - `framework::OpAttrChecker` is used to validate variable attributes. - -The constructor utilizes `AddInput` to add input parameter, `AddOutput` to add output parameter, and `AddComment` to add comments for the Op, so that the corresponding information will be added to `OpProto`. - -The code above adds two inputs `X` and `Y` to `MulOp`, an output `Out`, and their corresponding descriptions. Names are given in accordance to Paddle's [naming convention](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/name_convention.md). - - -An additional example [`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/scale_op.cc#L38-L55) is implemented as follows: - - -```cpp -template -class ScaleOpMaker : public framework::OpProtoAndCheckerMaker { - public: - ScaleOpMaker(OpProto *proto, OpAttrChecker *op_checker) - : OpProtoAndCheckerMaker(proto, op_checker) { - AddInput("X", "(Tensor) Input tensor of scale operator."); - AddOutput("Out", "(Tensor) Output tensor of scale operator."); - AddComment(R"DOC( -Scale operator -$$Out = scale*X$$ -)DOC"); - AddAttr("scale", - "(float, default 1.0)" - "The scaling factor of the scale operator.") - .SetDefault(1.0); - } -}; -``` - -Note `AddAttr("scale", "...").SetDefault(1.0);` adds `scale`constant as an attribute, and sets the default value to 1.0. - - -### Defining the GradProtoMaker class - -Each Op must have a corresponding GradProtoMaker. If GradProtoMaker corresponding to the forward Op is not customized, Fluid provides DefaultGradProtoMaker. The default registration will use all input and output, including Input, Output, Output@Grad and so on. Using unnecessary variables will cause waste of memory. -The following example defines ScaleOp's GradProtoMaker. - -```cpp -class ScaleGradMaker : public framework::SingleGradOpDescMaker { - public: - using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; - - std::unique_ptr Apply() const override { - auto *grad_op = new framework::OpDesc(); - grad_op->SetType("scale"); - grad_op->SetInput("X", OutputGrad("Out")); - grad_op->SetOutput("Out", InputGrad("X")); - grad_op->SetAttr("scale", GetAttr("scale")); - return std::unique_ptr(grad_op); - } -}; -``` - -### Defining Operator - -The following code defines the interface for MulOp: - -```cpp -class MulOp : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - protected: - void InferShape(const framework::InferShapeContext &ctx) const override { - //never use Input or Output if you want a to get a LoDTensor. - auto dim0 = ctx.Input("X")->dims(); - auto dim1 = ctx.Input("Y")->dims(); - PADDLE_ENFORCE_EQ(dim0.size(), 2, - "input X(%s) should be a tensor with 2 dims, a matrix", - ctx.op_.Input("X")); - PADDLE_ENFORCE_EQ(dim1.size(), 2, - "input Y(%s) should be a tensor with 2 dims, a matrix", - ctx.op_.Input("Y")); - PADDLE_ENFORCE_EQ( - dim0[1], dim1[0], - "First matrix's width must be equal with second matrix's height."); - ctx.Output("Out")->Resize({dim0[0], dim1[1]}); - } -}; -``` - -[`MulOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L24) is inherited from `OperatorWithKernel`. Its `public` member - -```cpp -using framework::OperatorWithKernel::OperatorWithKernel; -``` - -expresses an operator constructor using base class `OperatorWithKernel`, alternatively written as - -```cpp -MulOp(const std::string &type, const framework::VariableNameMap &inputs, - const framework::VariableNameMap &outputs, - const framework::AttributeMap &attrs) - : OperatorWithKernel(type, inputs, outputs, attrs) {} -``` - -`InferShape` interface needs to be re-written.`InferShape` is a const method and cannot modify Op's member variables. Its constant member `const framework::InferShapeContext &ctx` can be used to extract input, output, and attributes. Its functions are - - - 1). validate and error out early: it checks input data dimensions and types. - - 2). configures the tensor shape in the output. - -Usually `OpProtoMaker` and `Op` definitions are written in `.cc` files, which also include the registration methods introduced later. - - -### Defining OpKernel - -`MulKernel` is derived from `framework::OpKernel`, which includes the following templates: - -- `typename DeviceContext` denotes device context type. When different devices, namely the CPU and the CUDA, share the same kernel, this template needs to be added. If they don't share kernels, this must not be added. An example of a non-sharing kernel is [`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.h#L43). - -- `typename T` denotes data type, such as `float` or `double`. - -`MulKernel` types need to rewrite the interface for `Compute`. - -- `Compute` takes one input parameter: `const framework::ExecutionContext& context`. -- Compared with `InferShapeContext`, `ExecutionContext` includes device types, and can similarly extract input, output, and attribute variables. -- `Compute` function implements the computation logics of an `OpKernel`. - -The input and output of Op can be obtained by `ExecutionContext::Input()` and `ExecutionContext::Output()` respectively. - -**Note:** If the input/output variable type of op is `LoDTensor` (In Fluid, all Tensors are LoDTensor type by default), please write `ExecutionContext::Input()` and `ExecutionContext:: Output()`, do not write `ExecutionContext::Input()` and `ExecutionContext::Output()`. Because if the actual variable type is `SelectedRows`, the `Input()` and `Output()` methods will specialize the `SelectedRows` type to `Tensor`, causing a potential error. - - -`MulKernel`'s implementation of `Compute` is as follows: - -```cpp -template -class MulKernel : public framework::OpKernel { -public: -void Compute(const framework::ExecutionContext& context) const override { - auto* X = context.Input("X"); - auto* Y = context.Input("Y"); - auto* Z = context.Output("Out"); - Z->mutable_data(context.GetPlace()); - auto& device_context = context.template device_context(); - math::matmul(*X, false, *Y, false, 1, Z, 0, device_context); -} -}; -``` - -Note that **different devices (CPU, CUDA)share one Op definition; whether or not they share the same `OpKernel` depends on whether functions called by `Compute`can support both devices.** - -`MulOp`'s CPU and CUDA share the same `Kernel`. A non-sharing `OpKernel` example can be seen in [`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.cc). - -To ease the writing of `OpKernel` compute, and for reusing code cross-device, [`Eigen-unsupported Tensor`](https://bitbucket.org) module is used to implement `Compute` interface. To learn about how the Eigen library is used in PaddlePaddle, please see [usage document](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/use_eigen_cn.md). - - -This concludes the forward implementation of an operator. Next its operation and kernel need to be registered in a `.cc` file. - -The definition of its corresponding backward operator, if applicable, is similar to that of an forward operator. **Note that a backward operator does not include a `ProtoMaker`**. - - - -### Registering Operator and OpKernel - -- In `.cc` files, register forward and backward operator classes and the CPU kernel. - - ```cpp - namespace ops = paddle::operators; - REGISTER_OPERATOR(mul, ops::MulOp, ops::MulOpMaker, - paddle::framework::DefaultGradOpDescMaker) - REGISTER_OPERATOR(mul_grad, ops::MulGradOp) - REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel); - REGISTER_OP_CPU_KERNEL(mul_grad, - ops::MulGradKernel); - ``` - - In that code block, - - - `REGISTER_OPERATOR` registers the `ops::MulOp` class, with the type named `mul`. Its `ProtoMaker` is `ops::MulOpMaker`. Register `ops::MulOpGrad` as type named `mul_grad`. - - `REGISTER_OP_CPU_KERNEL` registers `ops::MulKernel` class and specializes template parameters as type `paddle::platform::CPUPlace` and `float`, and also registers `ops::MulGradKernel`. - - -- Registering CUDA Kernel in `.cu` files - - Note that if CUDA Kernel is implemented using the `Eigen unsupported` module, then on top of `.cu`, a macro definition `#define EIGEN_USE_GPU` is needed, such as - - ```cpp - // if use Eigen unsupported module before include head files - #define EIGEN_USE_GPU - - namespace ops = paddle::operators; - REGISTER_OP_CUDA_KERNEL(mul, ops::MulKernel); - REGISTER_OP_CUDA_KERNEL(mul_grad, - ops::MulGradKernel); - - ``` - - -### Compilation - -In folder `build/paddle/fluid/operators`, run the following commands to compile. - -``` -make mul_op -``` - - -## Python Binding - -The system will automatically bind the new op to Python and link it to a generated library. - - -## Unit Tests - -Unit tests for an operator include - -1. comparing a forward operator's implementations on different devices (CPU, CUDA) - -2. comparing a backward operator's implementation on different devices (CPU, CUDA) - -3. a gradient test for the backward operator. - -Here, we introduce the [unit tests for `MulOp`](https://github.com/PaddlePaddle/Paddle/tree/develop/test/legacy_test/test_mul_op.py). - - - -### Unit Test for Forward Operators - -The Op unit test is inherited from `OpTest`. More specific unit tests are done in `TestMulOp`. To test the Operator, you need to: - -1. Define input, output, and related property parameters in the `setUp` function. -2. Generate random input data. -3. Implement the same calculation logic as the forward operator in the Python script to get the output, which is to be compared with the output of the forward operator calculation. -4. The backward calculation has been automatically integrated into the test framework and the corresponding interface can be called directly. - - ```python - import unittest - import numpy as np - from op_test import OpTest - - - class TestMulOp(OpTest): - def setUp(self): - self.op_type = "mul" - self.inputs = { - 'X': np.random.random((32, 84)).astype("float32"), - 'Y': np.random.random((84, 100)).astype("float32") - } - self.outputs = {'Out': np.dot(self.inputs['X'], self.inputs['Y'])} - - def test_check_output(self): - self.check_output() - - def test_check_grad_normal(self): - self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5) - - def test_check_grad_ingore_x(self): - self.check_grad( - ['Y'], 'Out', max_relative_error=0.5, no_grad_set=set("X")) - - def test_check_grad_ingore_y(self): - self.check_grad( - ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y')) - ``` - - -The code above first loads required packages. In addition, we have - -- `self.op_type = "mul" ` defines the type that is identical to what the operator's registered type. -- `self.inputs` defines input, with type `numpy.array` and initializes it. -- `self.outputs` defines output and completes the same operator computation in the Python script, and returns its result from the Python script. - - -### Unit Test for Backward Operators - -In the backward operator test: - -- `check_grad` is called in `test_check_grad_normal` to use numerical methods to detect gradient correctness and stability. -- The first parameter `["X", "Y"]` : specifies gradient check for the input variables `X`, `Y`. -- The second parameter `"Out"` : specifies the final output target variable `Out` of the forward network. -- The third parameter `max_relative_error`: specifies the maximum error value that can be tolerated when checking gradients. -- The `test_check_grad_ingore_x` and `test_check_grad_ingore_y` branches are used to test cases where only one input gradient needs to be calculated. - - - -### Compiling and Running - - -Any new unit testing file of the format `test_*.py` added to the directory `test/legacy_test` is automatically added to the project to compile. - -Note that **running unit tests requires compiling the entire project** and requires compiling with flag `WITH_TESTING` on i.e. `cmake paddle_dir -DWITH_TESTING=ON`. - -After successfully compiling the project, run the following command to run unit tests: - -```bash -make test ARGS="-R test_mul_op -V" -``` - -Or, - -```bash -ctest -R test_mul_op -``` - - - -## Remarks - -- The type with which an operator is registered needs to be identical to the Op's name. Registering `REGISTER_OPERATOR(B, ...)` in `A_op.cc` will cause unit testing failures. -- If the operator does not implement a CUDA kernel, please refrain from creating an empty `*_op.cu` file, or else unit tests will fail. -- If multiple operators rely on some shared methods, a file NOT named `*_op.*` can be created to store them, such as `gather.h`. - - - - - -### PADDLE_ENFORCE Usage Note - -To check the validity of data when implementing Op, you need to use macro definitions such as PADDLE_ENFORCE and PADDLE_ENFORCE_EQ. The basic format is as follows: - -``` -PADDLE_ENFORCE (expression, error message) -PADDLE_ENFORCE_EQ (comparison object A, comparison object B, error message) -``` - -If the expression is true, or the comparison object A=B, the check will be passed, otherwise the program will be terminated and the corresponding error message will be fed back to the user. -In order to ensure that the feedbacks are user-friendly and easy to understand, developers need to pay attention to how to use them. - - - -#### General Principles - -Any place where PADDLE_ENFORCE and PADDLE_ENFORCE_EQ are used must have a properly detailed explanation of the comments! **Error message** can't be empty! - - - -#### Error Message Standard - -1. [required] Where does it go wrong? Why is it wrong? - - - For example: `ValueError: Mismatched label shape` - -2. [optional] What is the expected input? What is the actual input? - - - For example: `Expected labels dimension=1. Received 4.` - -3. [optional] Can you come up with a suggestion? - - - For example: `Suggested Fix: If your classifier expects one-hot encoding label, check your n_classes argument to the estimatorand/or the shape of your label.Otherwise, check the shape of your label.` - -If it is not necessary or concise description is enough to clearly express the above points, just write based on actual needs. - - - -#### Typical Problems - - -1.No error message exists or error message is too short to provide effective notification to the user. - - Problem example 1: Absent message - ``` - PADDLE_ENFORCE(ctx->HasInput("X"), ""); - ``` - Problem example 2: The prompt message is too short - ``` - PADDLE_ENFORCE(i != nullptr, "i must be set"); // What is i? - ``` - -2.Using developer-defined variable abbreviations in error messages is not easy to understand. - - Example of the problem: - ``` - PADDLE_ENFORCE(forward_pd != nullptr, - "Fail to find eltwise_fwd_pd in device context"); //eltwise_fwd_pduser may not be understood - ``` - -3.The OP internally calls the illegal interface: If Op appears inside Output = ShareDataWith(Input) - Example of the problem: - ```cpp - auto *out = ctx.Output("Out"); - auto *in = ctx.Input("X"); - out->ShareDataWith(*in); - ``` - - If there is Output = ShareDataWith(Input) inside Op, it will equivalently indicate a hidden edge in the operator graph, which connects Input and Output. This edge cannot be expressed in graph analysis, causing error based on graph optimization. - -4.Performance of OP implementation. It called eigen's broadcast, chop and other operations, the performance will be over several times worse than the handwritten cuda kernel. At this point, the implementation of cpu can reuse eigen, and the gpu implementation can implement cuda kernel. - - - - -#### Special Instructions for OP InferShape Check Message - -- Check input and output variables, please follow the following format -`Input(variable name) of OP name operator should not be null.` - - The correct example: - ``` - PADDLE_ENFORCE(ctx->HasInput("Input"), - "Input(Input) of LSTMP operator should not be null."); - ``` - -- Backward Op input and output check, to write the name of the backward Op - - The correct example: - ``` - PADDLE_ENFORCE(ctx->HasInput("X"), - "Input(X) of LoDResetGrad opreator should not be null."); - ``` diff --git a/docs/dev_guides/api_contributing_guides/new_cpp_op_notes_en.md b/docs/dev_guides/api_contributing_guides/new_cpp_op_notes_en.md deleted file mode 100644 index d2a12943d7a..00000000000 --- a/docs/dev_guides/api_contributing_guides/new_cpp_op_notes_en.md +++ /dev/null @@ -1,183 +0,0 @@ -# Notes on operator development - -## Building logic of Fluid's op -### 1.Building logic of Fluid's op -All Ops in Fluid are derived from `OperatorBase` , and all Ops are stateless. Each Op contains only four variable members: type, inputs, outputs, and attribute. - -The core method of Op is Run. The Run method requires two resources: data resources and computing resources. These two resources are obtained respectively from `Scope` and `Place`. Inside the framework, there is a global `DeviceContextPool`, which is used to record the mapping relationship between `Place` and `DeviceContext`, which means each `Place` has only one `DeviceContext` corresponding to it, and `DeviceContext` stores the computing resources of the current device. For example, for GPU, these resources include `cudnn_handle`, `cublas_handle`, `stream`, and so on. All the internal calculations (data copy and CUDA Kernel, etc.) of Op must be done in `DeviceContext`. - -The Fluid framework is designed to run on a variety of devices and third-party libraries, and some Op implementations may vary on different the devices or third-party libraries. Therefore, Fluid introduced the OpKernel's approach, which means an Op can have multiple OpKernels. Such Ops are derived from `OperatorWithKernel`, and the representative of such Ops is conv, the OpKernels of conv_op are: `GemmConvKernel`, `CUDNNConvOpKernel`, `ConvMKLDNNOpKernel`, and each OpKernel has two data types, double and float. Ops that do not need OpKernel inclue `WhileOp` and so on. - -Operator inheritance diagram: -![op_inheritance_relation_diagram](./op_inheritance_relation_diagram.png) - -For further information, please refer to: [multi_devices](https://github.com/PaddlePaddle/docs/blob/develop/docs/design/multi_devices) , [scope](https://github.com/PaddlePaddle/docs/blob/develop/docs/design/concepts/scope.md) , [Developer's_Guide_to_Paddle_Fluid](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/getstarted/Developer's_Guide_to_Paddle_Fluid.md) - -### 2.Op's registration logic -The registration entries for each Operator include: - ```C++ - OpCreator creator_; - GradOpMakerFN grad_op_maker_; - proto::OpProto* proto_{nullptr}; - OpAttrChecker* checker_{nullptr}; - InferVarTypeFN infer_var_type_; - InferShapeFN infer_shape_; - ``` - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Registration EntryTypeDescriptionUsage
proto::OpProto Class Store the input/output/properties/Op type of Op Call at compile time
GradOpMakerFN Functor Return a set of OpDescs of the reverse Op corresponding to the current Op, because the reverse ones of the forward Op may consist of multiple Ops Call at compile time
OpAttrChecker Class Check the Op's attr Call at compile time
InferVarTypeFN Functor Used to infer the type of the output Var, such as LoDTensor, SelectedRows, or others Call at compile time
InferShapeFN Functor Used to infer the Shape of the Output The usage is different at compile time and runtime. At compile time, it is called in Python side; If the Op is derived from OperatorWithKernel, at the runtime it will be called at op.run
OpCreator Functor Create a new OperatorBase for each call Call at runtime
- -Usually you need to call REGISTER_OPERATOR when you make comments on Op, which is: - ``` - REGISTER_OPERATOR(op_type, - OperatorBase - Op_maker_and_checker_maker, - Op_grad_opmaker, - Op_infer_var_shape, - Op_infer_var_type) - ``` - -**Note:** - -1. For all Op, the first three parameters are required, op_type specifies the name of op, OperatorBase is the object instance of this Op, op_maker_and_checker_maker is the maker of op and the checker of attr in op. -2. If the Op has a reverse, it must have op_grad_opmaker, because in backward, the reverse Op's Maker will be obtained from the forward Op. -3. The framework provides a default op_grad_opmaker:`DefaultGradOpDescMaker`, which will use the input and output of the forward Op as the input of the reverse Op, and the gradients of the input to forward Op's as the output of the reverse Op, and copy the attributes of the forward Op to it. **Note:** DefaultGradOpDescMaker will take all the input and output of the forward Op as the reverse Op input. Even if this input is not necessary, the absence of this will prevent us from doing memory optimization for the unused variables. -4. The framework does not provide a default op_infer_var_shape method. If the Op has no OpKernel, you usually need to add the corresponding op_infer_var_shape method. If the Op has OpKernel, you need to implement the `InferShape` method of `OperatorWithKernel`. You don't need to provide the op_infer_var_shape method. For details, refer to [while_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/controlflow/while_op.cc), [conv_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/conv_op.cc). -5. The framework does not provide a default op_infer_var_type method, the user needs to add op_infer_var_type according to the actual situation. Strictly speaking, every Op should register an InferVarType, and op_infer_var_type infers the type and dtype of the output Var according to the type and dtype of the input Var. **Note:** In the Python-side LayerHelper, the create_variable_for_type_inference operation returns a Variable which is a LoDTensor. The C++-side InferVarType can modify the type and dtype of the `Variable`. - - -For more details, please refer to: [How to write a new Op](new_op_en.html) - -## Notes on Writing an Op -### 1. input and output types supported by Op -The input and output of Fluid's Ops are `Variable`. In design, `Variable` can store any type. Op's input and output `Variable` may be of any type, and usually the `Variable` stores `LoDTensor` and `SelectedRows` . - -**Note:** - -- `context.Input("Input")` often appears in the code. It does not mean that the `Variable` of "Input" is `Tensor`, but indicates that the `Tensor` is obtained from `LoDTensor` in the `Variable` of the "Input". If the `Variable` of "Input" is `SelectedRows`, an error will be reported. -- If "Input" is `SelectedRows`, `context->GetInputDim("Input")` will return `var->Get().GetCompleteDims()` instead of Dim of `Tensor` in `SelectedRows` . - -### 2. Do not modify the input data inside Op. -Never make any modification of the input data inside Op, as there may be other Ops that need to read this input. - -### 3. The data type needs to be registered for OpKernel -Currently all OpKernel are required to register double and float data types. - -### 4.Op compatibility issue -The modification of Op needs to consider the compatibility problem. Please ensure that the previous model can be loaded and run normally after the modification of Op which means that the model trained by the old version can be loaded and run with Paddle inference library of new version. **So developers should ensure that the Input, Output and Attribute of OPs cannot be modified (except for documents) or deleted. And developers can add Input, Output and Attribute, but the added Input and Output must be set to be dispensable, and the default value of added Attribute must be set. For more details, please refer to [OP Input/Output/Attribute Compatibility Modification](https://github.com/PaddlePaddle/Paddle/wiki/OP-Input-Output-Attribute-Compatibility-Modification(English-Version))**. - -### 5.Call ShareDataWith -The function of ShareDataWith is to make the two Tensors share the underlying buffer. When calling this operation, special attention should be paid. In the Op, the ShareDataWith cannot be applied to the output of Op. In other words, the Tensor of the Op output must be from Malloc. - -### 6. Sparse gradient parameter's update method -At present, the sparse gradient will first merge the gradient when updating, which is to add up the gradients of the same parameter, and then update the parameters and additional parameters (such as velocity). - -### 7. (Video) Memory optimization -If the reverse of Op does not require all of the input and output of the forward op as its input, please do not use `DefaultGradOpDescMaker`, which will prevent Memory/Video Memory optimization for unused variables. - -### 8. Calls made on Hybrid device -Since the GPU is executed asynchronously, the GPU side may not be actually executed after the CPU call returns. Therefore, if you create a temporary variable in Op that you need to use at the GPU runtime, when the GPU starts running, the temporary variable may have been released on the CPU side, which may cause GPU calculation errors. - -Some of the synchronous and asynchronous operations in the GPU: -``` -The following device operations are asynchronous with respect to the host: - Kernel launches; - Memory copies within a single device's memory; - Memory copies from host to device of a memory block of 64 KB or less; - Memory copies performed by functions that are suffixed with Async; - Memory set function calls. -``` - -Note on cudaMemCpy and cudaMemCpyAsync: - -- If the data transfer is from the GPU side to the CPU side with non-pinned memory , the data transfer will be synchronous, even if an asynchronous copy operation is called. -- If the data is transferred from the CPU side to the CPU side, the data transfer will be synchronous, even if an asynchronous copy operation is called. - -For more information, please refer to: [Asynchronous Concurrent Execution](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#asynchronous-concurrent-execution) , [API synchronization behavior](https://Docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior) - -## Op Performance Optimization -### 1. Selection of third-party libraries -In the process of writing Op, the operations provided by high-performance libraries (such as cudnn, mkldnn, mklml, eigen, etc.) are preferred, but the benchmark must be done. Some operations in the library may be slower in deep learning tasks. Because the operations provided in high-performance libraries (such as eigen, etc.) are more generalized and in terms of performance, they may not be sufficient. Usually the amount of data in the deep learning model is small, so in some cases some of the high-performance libraries may be compromised to a slower speed. For example, all Op (forward and reverse) of the Elementwise set. The Elementwise operation is called relatively frequently in the model. Especially Elementwise_add, which is used to add offset to many operations. In the previous implementation, Elementwise_op directly calls the Eigen library. Since the Elementwise operation needs to broadcast the data in many cases, and the experiment finds that the Eigen library is slower to broadcast, whose reason is in this PR[#6229](https://github.com/PaddlePaddle/Paddle/pull/6229). - -### 2.Op performance optimization -The calculation speed of Op is related to the amount of data input. For some Op, different calculation methods can be selected according to the attribute parameters in Op and Shape of the input data. For example, concat_op, when axis>=1, in the process of concatenating multiple tensors, you need to make many copies for each tensor. If it is on GPU, you need to call cudaMemCopy. Relative to the CPU, the GPU is an external device. So each time the GPU is called, there will a certain overhead. And when more times of copying are required, the overhead is more prominent. At present, the implementation of concat_op will select different calling methods according to the Shape and axis values of the input data. If there are a relatively large number of input tensors, and the axis is not equal to 0, the multiple copy operations will be converted into a CUDA Kernel to complete the process; if input tensor are less, and the axis is equal to 0, direct copy will be used. The relevant experiment is described in this PR ([#8669](https://github.com/PaddlePaddle/Paddle/pull/8669)) . - -Since the call of CUDA Kernel has a certain overhead, multiple calls of the CUDA Kernel in Op may affect the execution speed of Op. For example, the previous sequence_expand_op contains many CUDA Kernels. Usually, these CUDA Kernels process a small amount of data, so frequent calls to such Kernels will affect the calculation speed of Op. In this case, it is better to combine these small CUDA Kernels into one. This idea is used in the optimization of the sequence_expand_op procedure (related PR[#9289](https://github.com/PaddlePaddle/Paddle/pull/9289)). The optimized sequence_expand_op is about twice as fast as the previous implementation, the relevant experiments are introduced in the PR ([#9289](https://github.com/PaddlePaddle/Paddle/pull/9289)). - -Reduce the number of copy and sync operations between the CPU and the GPU. For example, the fetch operation will update the model parameters and get a loss after each iteration, and the copy of the data from the GPU to the Non-Pinned-Memory CPU is synchronous, so frequent fetching for multiple parameters will reduce the model training speed. - -## Op numerical stability -### 1. Some Ops have numerical stability problems -The main reason for numerical stability is that when the program is run multiple times, the order in which the floating-point data is processed may be different, resulting in different final calculation results. The GPU is accelerated by multi-threaded parallel computing, so it is commonplace that the order of operations on floating-point numbers is not fixed. - -At present, it is found that the result of the convolution operation in cudnn, MaxPooling in cudnn, CudaAtomicXX in CUDA, and aggregation of parameter gradients in Reduce mode of ParallelExecutor are not certain. - -For this purpose, some FLAGS is added to the Fluid. For example, FLAGS_cudnn_deterministic is used to force cudnn to use the deterministic algorithm, and FLAGS_cpu_deterministic to force the CPU-side calculation to use the deterministic method. - -## Other -### 1. Error message -The Enforce prompt message cannot be empty and needs to be written, because the error message can analyze the cause of the error more quickly and conveniently. - -### 2.Op's mathematical formula -If Op has a mathematical formula, be sure to write the mathematical formula in the code and display it in the Doc of the Python API, because the user may need to understand how Paddle implements Op when comparing the calculation results among different frameworks. - -**Note:** The formula preview must be done before the merge to the develop branch. Example: [dynamic_lstmp](../../../api/layers/nn.html#dynamic-lstmp). - -### 3. The order of parameters in the Python-side Op interface -The order of the parameters in the Python API is generally ranked by importance, taking fc as an example: -``` -def fc(input, - size, - num_flatten_dims=1, - param_attr=None, - bias_attr=None, - act=None, - is_test=False, - name=None) -``` diff --git a/docs/faq/train_cn.md b/docs/faq/train_cn.md index 153b9eabe65..b8bc46bf421 100644 --- a/docs/faq/train_cn.md +++ b/docs/faq/train_cn.md @@ -110,7 +110,7 @@ export FLAGS_fast_eager_deletion_mode=1 export FLAGS_fraction_of_gpu_memory_to_use=0 ``` -详细请参考官方文档[存储分配与优化](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/performance_improving/singlenode_training_improving/memory_optimize.html) 调整相关配置。 +详细请参考官方文档[存储分配与优化](https://www.paddlepaddle.org.cn/documentation/docs/zh/dev_guides/api_contributing_guides/new_cpp_op_cn.html#xiancunyouhua) 调整相关配置。 此外,建议您使用[AI Studio 学习与 实训社区训练](https://aistudio.baidu.com/aistudio/index),获取免费 GPU 算力,提升您的训练效率。 @@ -130,7 +130,7 @@ export FLAGS_fraction_of_gpu_memory_to_use=0 ##### 问题:如何处理变长 ID 导致程序内存占用过大的问题? -+ 答复:请先参考[显存分配与优化文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/performance_improving/singlenode_training_improving/memory_optimize.html) 开启存储优化开关,包括显存垃圾及时回收和 Op 内部的输出复用输入等。若存储空间仍然不够,建议: ++ 答复:请先参考[显存分配与优化文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/dev_guides/api_contributing_guides/new_cpp_op_cn.html#xiancunyouhua) 开启存储优化开关,包括显存垃圾及时回收和 Op 内部的输出复用输入等。若存储空间仍然不够,建议: 1. 降低 `batch_size` ; 2. 对 index 进行排序,减少 padding 的数量。 diff --git a/docs/guides/performance_improving/analysis_tools/benchmark_cn.md b/docs/guides/performance_improving/analysis_tools/benchmark_cn.md deleted file mode 100644 index d98936676f7..00000000000 --- a/docs/guides/performance_improving/analysis_tools/benchmark_cn.md +++ /dev/null @@ -1,90 +0,0 @@ -如何进行基准测试 -=============== -本文介绍如何给深度学习框架做基准测试。基准测试主要包含验证模型的精度和性能两方面,下文包含搭建测试环境,选择基准测试模型,验证测试结果等几方面内容。 - -验证深度学习框架,可分为训练和测试两个阶段, 验证指标略有不同,本文只介绍训练阶段的指标验证。训练阶段关注的是模型训练集上的精度,训练集是完备的,因此关注大 batch\_size 下的训练速度,关注吞吐量,例如图像模型常用的 batch\_size=128, 多卡情况下会加大;预测阶段关注的是在测试集上的精度,线上服务测试数据不能提前收集,因此关注小 batch\_size 下的预测速度,关注延迟,例如预测服务常用的 batch\_size=1, 4 等。 - -[Fluid](https://github.com/PaddlePaddle/Paddle>)是 PaddlePaddle 从 0.11.0 版本开始引入的设计,本文的基准测试在该版本上完成。 - - -环境搭建 -======== - -基准测试中模型精度和硬件、框架无关,由模型结构和数据共同决定;性能方面由测试硬件和框架性能决定。框架基准测试为了对比框架之间的差异,控制硬件环境,系统库等版本一致。下文中的对比实验都在相同的硬件条件和系统环境条件下进行. - - -不同架构的 GPU 卡性能差异巨大,在验证模型在 GPU 上训练性能时,可使用 NVIDIA 提供的命令:```nvidia-smi``` 检验当前使用的 GPU 型号,如果测试多卡训练性能,需确认硬件连接是 [nvlink](https://zh.wikipedia.org/zh/NVLink)或 [PCIe](https://zh.wikipedia.org/zh-hans/PCI_Express)。 同样地,CPU 型号会极大影响模型在 CPU 上的训练性能。可读取`/proc/cpuinfo`中的参数,确认当前正在使用的 CPU 型号。 - -下载 GPU 对应的 Cuda Tool Kit 和 Cudnn,或者使用 NVIDIA 官方发布的 nvidia-docker 镜像 [nvidia-docker](https://github.com/NVIDIA/nvidia-docker), 镜像内包含了 Cuda 和 Cudnn,本文采用这种方式。 Cuda Tool Kit 包含了 GPU 代码使用到的基础库,影响在此基础上编译出的 Fluid 二进制运行性能。 - -准备好 Cuda 环境后,从 github 上下载 Paddle 代码并编译,会生成对应的最适合当前 GPU 的 sm\_arch 二进制[sm\_arch](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html)。另外,cudnn 对卷积类任务影响巨大,在基准测试中需要小版本一致,例如 Cudnn7.0.2 与 Cudnn7.1.4 在 Resnet 上有 5%以上差异。 - - -选择基准模型 -============ - -对框架做基准测试,需要覆盖不同训练任务和不同大小的模型,本文中选取了图像和 NLP 的最为常用的 5 个模型。 - -任务种类| 模型名称| 网络结构| 数据集 -:---:|:--:|:---:|:---: -图像生成| CycleGAN| GAN| horse2zebra -图像分类| SE-ResNeXt50| Resnet-50| image-net -语义分割| DeepLab_V3+| ResNets| cityscapes -自然语言| Bert| Transformer| Wikipedia -机器翻译| Transformer| Attention| Wikipedia - -CycleGAN, SE-ResNeXt50, DeepLab_V3+属于 CNN 模型, Bert, Transformer 是一种比传统 RNN 模型更好的 NLP 模型。 -[benchmark](https://github.com/PaddlePaddle/Paddle/tree/develop/benchmark/fluid) -基准模型测试脚本中,均跳过了前几个 batch 的训练过程,原因是加载数据和分配显存受系统当前运行情况影响,会导致统计性能不准确。运行完若干个轮次后,统计对应指标。 - - -基准模型的数据的选择方面,数据量大且验证效果多的公开数据集为首选。图像模型 CycleGAN 选择了 horse2zebra 数据集,SE-ResNeXt50 选择了[image-net](http://www.image-net.org/challenges/LSVRC/2012/nnoupb)数据集,图像大小预处理为和 Imagenet 相同大小,因此性能可直接对比。 -NLP 模型的公开且影响力大数据集较少,Bert 和 Transformer 模型都选择了[Wikipedia](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2)数据集。 - - -注意,图像模型每条样本大小相同,图像经过变换后大小一致,因此经过的计算路径基本相同,计算速度和显存占用波动较小,可以从若干个 batch 的数据中采样得到当前的训练性能数据。而 NLP 模型由于样本长度不定,计算路径和显存占用也不相同,因此只能完整运行若干个轮次后,统计速度和显存消耗。 -显存分配是特别耗时的操作,因此 Fluid 默认会占用所有可用显存空间形成显存池,用以加速计算过程中的显存分配。如果需要统计模型真实显存消耗,可设置环境变量`FLAGS_fraction_of_gpu_memory_to_use=0.0`,观察最大显存开销。 - - -测试过程 -======== - -- GPU 单机单卡测试 - -本教程使用了 Cuda9, Cudnn7.0.1。来源为:```nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04``` - -``` - nvidia-docker run -it --name CASE_NAME --security-opt seccomp=unconfined -v $PWD/benchmark:/benchmark -v /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu paddlepaddle/paddle:latest-dev /bin/bash -``` -在单卡上测试,设置 CUDA 的环境变量使用一块 GPU,``CUDA_VISIBLE_DEVICES=0`` -然后代码中设置为使用 CUDAPlace,如果使用 Paddle 代码库中的脚本,只需要命令行参数传入 use_gpu=True 即可。 - -``` - >>> import paddle.fluid as fluid - >>> place = fluid.CUDAPlace(0) // 0 指第 0 块 GPU -``` - -测试结果 -======== - -本教程对比相同环境下的 Fluid1.4, PyTorch1.1.0 和 TensorFlow1.12.0 的性能表现。 -硬件环境为 CPU: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, GPU: Tesla v100(volta) 21729MiB x 1, Nvidia-Driver 384.66。 -系统环境为 Ubuntu 16.04.3 LTS, 本文中采用了 docker 环境,系统版本为 nvidia-docker17.05.0-ce。 -测试的 Fluid 版本为[v.1.4.1](https://github.com/PaddlePaddle/Paddle/tree/v1.4.1) 。 -TensorFlow 版本为[v.1.12.0-rc2](https://github.com/tensorflow/tensorflow/tree/v1.12.0-rc2)。 -PyTorch 版本为[v.1.1.0](https://github.com/pytorch/pytorch/tree/v1.1.0)。 -使用的脚本和配置见[benchmark](https://github.com/PaddlePaddle/Paddle/tree/develop/benchmark/fluid) 。 -SE-ResNeXt50 对比的框架是 PyTorch,因为 TensorFlow 上没有对应的模型。 -图表中统计单位为 samples/秒。 - - - -- GPU 单机单卡测试结果 - - Model|Fluid GPU| TensorFlow/PyTorch GPU - :---:|:--:|:---: - CycleGAN| 7.3 samples/s| 6.1 samples/s - SE-ResNeXt50| 169.4 samples/s | 153.1 samples/s - DeepLab_V3+| 12.8 samples/s | 6.4 samples/s - Bert| 4.0 samples/s | 3.4 samples/s - Transformer| 4.9 samples/s | 4.7 samples/s diff --git a/docs/guides/performance_improving/analysis_tools/cpu_profiling_cn.md b/docs/guides/performance_improving/analysis_tools/cpu_profiling_cn.md deleted file mode 100644 index 8b9c492f162..00000000000 --- a/docs/guides/performance_improving/analysis_tools/cpu_profiling_cn.md +++ /dev/null @@ -1,183 +0,0 @@ -# CPU 性能调优 - -此教程会介绍如何使用 Python 的 cProfile 包、Python 库 yep、Google perftools 来进行性能分析 (profiling) 与调优(performance tuning)。 - -Profling 指发现性能瓶颈。系统中的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。Tuning 指消除瓶颈。性能优化的过程通常是不断重复地 profiling 和 tuning。 - -PaddlePaddle 用户一般通过调用 Python API 编写深度学习程序。大部分 Python API 调用用 C++ 写的 libpaddle.so。所以 PaddlePaddle 的性能分析与调优分为两个部分: - -* Python 代码的性能分析 -* Python 与 C++ 混合代码的性能分析 - - -## Python 代码的性能分析 - -### 生成性能分析文件 - -Python 标准库中提供了性能分析的工具包,[cProfile](https://docs.python.org/2/library/profile.html)。生成 Python 性能分析的命令如下: - -```bash -python -m cProfile -o profile.out main.py -``` - -其中 `main.py` 是我们要分析的程序,`-o`标识了一个输出的文件名,用来存储本次性能分析的结果。如果不指定这个文件,`cProfile`会打印到标准输出。 - -### 查看性能分析文件 - -`cProfile` 在 main.py 运行完毕后输出`profile.out`。我们可以使用[`cprofilev`](https://github.com/ymichael/cprofilev)来查看性能分析结果。`cprofilev`是一个 Python 的第三方库。使用它会开启一个 HTTP 服务,将性能分析结果以网页的形式展示出来: - -```bash -cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py -``` - -其中`-a`标识 HTTP 服务绑定的 IP。使用`0.0.0.0`允许外网访问这个 HTTP 服务。`-p`标识 HTTP 服务的端口。`-f`标识性能分析的结果文件。`main.py`标识被性能分析的源文件。 - -用 Web 浏览器访问对应网址,即可显示性能分析的结果: - -``` - ncalls tottime percall cumtime percall filename:lineno(function) - 1 0.284 0.284 29.514 29.514 main.py:1() - 4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/executor.py:20(run) - 4696 12.040 0.003 12.040 0.003 {built-in method run} - 1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14() -``` - -每一列的含义是: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
列名含义
ncalls 函数的调用次数
tottime 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间
percall tottime 的每次调用平均时间
cumtime 函数总时间。包含这个函数调用其他函数的时间
percall cumtime 的每次调用平均时间
filename:lineno(function) 文件名, 行号,函数名
- - -### 寻找性能瓶颈 - -通常`tottime`和`cumtime`是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。 - -将性能分析结果按照 tottime 排序,效果如下: - -```text - 4696 12.040 0.003 12.040 0.003 {built-in method run} - 300005 0.874 0.000 1.681 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader) - 107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:219(__init__) - 4697 0.626 0.000 2.291 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp) - 1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/__init__.py:1() -``` - -可以看到最耗时的函数是 C++端的`run`函数。这需要联合我们第二节`Python`与`C++`混合代码的性能分析来进行调优。而`sync_with_cpp`函数的总共耗时很长,每次调用的耗时也很长。于是我们可以点击`sync_with_cpp`的详细信息,了解其调用关系。 - -```text -Called By: - - Ordered by: internal time - List reduced from 4497 to 2 due to restriction <'sync_with_cpp'> - -Function was called by... - ncalls tottime cumtime -/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp) <- 4697 0.626 2.291 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp) -/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp) <- 4696 0.019 2.316 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:487(clone) - 1 0.000 0.001 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:534(append_backward) - - -Called: - - Ordered by: internal time - List reduced from 4497 to 2 due to restriction <'sync_with_cpp'> -``` - -通常观察热点函数间的调用关系,和对应行的代码,就可以了解到问题代码在哪里。当我们做出性能修正后,再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。 - - - -## Python 与 C++混合代码的性能分析 - -### 生成性能分析文件 - -C++的性能分析工具非常多。常见的包括`gprof`, `valgrind`, `google-perftools`。但是调试 Python 中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而 Python 的一个第三方库`yep`提供了方便的和`google-perftools`交互的方法。于是这里使用`yep`进行 Python 与 C++混合代码的性能分析 - -使用`yep`前需要安装`google-perftools`与`yep`包。ubuntu 下安装命令为 - -```bash -apt update -apt install libgoogle-perftools-dev -pip install yep -``` - -安装完毕后,我们可以通过 - -```bash -python -m yep -v main.py -``` - -生成性能分析文件。生成的性能分析文件为`main.py.prof`。 - -命令行中的`-v`指定在生成性能分析文件之后,在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为 C++与 Python 不同,编译时可能会去掉调试信息,运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果,可以采取下面几点措施: - -1. 编译时指定`-g`生成调试信息。使用 cmake 的话,可以将 CMAKE_BUILD_TYPE 指定为`RelWithDebInfo`。 -2. 编译时一定要开启优化。单纯的`Debug`编译性能会和`-O2`或者`-O3`有非常大的差别。`Debug`模式下的性能测试是没有意义的。 -3. 运行性能分析的时候,先从单线程开始,再开启多线程,进而多机。毕竟单线程调试更容易。可以设置`OMP_NUM_THREADS=1`这个环境变量关闭 openmp 优化。 - -### 查看性能分析文件 - -在运行完性能分析后,会生成性能分析结果文件。我们可以使用[`pprof`](https://github.com/google/pprof)来显示性能分析结果。注意,这里使用了用`Go`语言重构后的`pprof`,因为这个工具具有 web 服务界面,且展示效果更好。 - -安装`pprof`的命令和一般的`Go`程序是一样的,其命令如下: - -```bash -go get github.com/google/pprof -``` - -进而我们可以使用如下命令开启一个 HTTP 服务: - -```bash -pprof -http=0.0.0.0:3213 `which python` ./main.py.prof -``` - -这行命令中,`-http`指开启 HTTP 服务。`which python`会产生当前 Python 二进制的完整路径,进而指定了 Python 可执行文件的路径。`./main.py.prof`输入了性能分析结果。 - -访问对应的网址,我们可以查看性能分析的结果。结果如下图所示: - -![result](./pprof_1.png) - - -### 寻找性能瓶颈 - -与寻找 Python 代码的性能瓶颈类似,寻找 Python 与 C++混合代码的性能瓶颈也是要看`tottime`和`cumtime`。而`pprof`展示的调用图也可以帮助我们发现性能中的问题。 - -例如下图中, - -![kernel_perf](./pprof_2.png) - -在一次训练中,乘法和乘法梯度的计算占用 2%-4%左右的计算时间。而`MomentumOp`占用了 17%左右的计算时间。显然,`MomentumOp`的性能有问题。 - -在`pprof`中,对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题,再检查其他部分的性能问题,可以更有次序的完成性能的优化。 diff --git a/docs/guides/performance_improving/analysis_tools/cpu_profiling_en.md b/docs/guides/performance_improving/analysis_tools/cpu_profiling_en.md deleted file mode 100644 index 216694965b3..00000000000 --- a/docs/guides/performance_improving/analysis_tools/cpu_profiling_en.md +++ /dev/null @@ -1,224 +0,0 @@ -# Tune CPU performance - -This tutorial introduces techniques we use to profile and tune the -CPU performance of PaddlePaddle. We will use Python packages -`cProfile` and `yep`, and Google's `perftools`. - -Profiling is the process that reveals performance bottlenecks, -which could be very different from what's in the developers' mind. -Performance tuning is done to fix these bottlenecks. Performance optimization -repeats the steps of profiling and tuning alternatively. - -PaddlePaddle users program AI applications by calling the Python API, which calls -into `libpaddle.so.` written in C++. In this tutorial, we focus on -the profiling and tuning of - -1. the Python code and -1. the mixture of Python and C++ code. - -## Profiling the Python Code - -### Generate the Performance Profiling File - -We can use Python standard -package, [`cProfile`](https://docs.python.org/2/library/profile.html), -to generate Python profiling file. For example: - -```bash -python -m cProfile -o profile.out main.py -``` - -where `main.py` is the program we are going to profile, `-o` specifies -the output file. Without `-o`, `cProfile` would outputs to standard -output. - -### Look into the Profiling File - -`cProfile` generates `profile.out` after `main.py` completes. We can -use [`cprofilev`](https://github.com/ymichael/cprofilev) to look into -the details: - -```bash -cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py -``` - -where `-a` specifies the HTTP IP, `-p` specifies the port, `-f` -specifies the profiling file, and `main.py` is the source file. - -Open the Web browser and points to the local IP and the specifies -port, we will see the output like the following: - -``` - ncalls tottime percall cumtime percall filename:lineno(function) - 1 0.284 0.284 29.514 29.514 main.py:1() - 4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/executor.py:20(run) - 4696 12.040 0.003 12.040 0.003 {built-in method run} - 1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14() -``` - -where each line corresponds to Python function, and the meaning of -each column is as follows: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
columnmeaning
ncalls the number of calls into a function
tottime the total execution time of the function, not including the execution time of other functions called by the function
percall tottime divided by ncalls
cumtime the total execution time of the function, including the execution time of other functions being called
percall cumtime divided by ncalls
filename:lineno(function) where the function is define
- -### Identify Performance Bottlenecks - -Usually, `tottime` and the related `percall` time is what we want to -focus on. We can sort above profiling file by tottime: - -```text - 4696 12.040 0.003 12.040 0.003 {built-in method run} - 300005 0.874 0.000 1.681 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader) - 107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:219(__init__) - 4697 0.626 0.000 2.291 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp) - 1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/__init__.py:1() -``` - -We can see that the most time-consuming function is the `built-in -method run`, which is a C++ function in `libpaddle.so`. We will -explain how to profile C++ code in the next section. At this -moment, let's look into the third function `sync_with_cpp`, which is a -Python function. We can click it to understand more about it: - -``` -Called By: - - Ordered by: internal time - List reduced from 4497 to 2 due to restriction <'sync_with_cpp'> - -Function was called by... - ncalls tottime cumtime -/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp) <- 4697 0.626 2.291 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp) -/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp) <- 4696 0.019 2.316 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:487(clone) - 1 0.000 0.001 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:534(append_backward) - - -Called: - - Ordered by: internal time - List reduced from 4497 to 2 due to restriction <'sync_with_cpp'> -``` - -The lists of the callers of `sync_with_cpp` might help us understand -how to improve the function definition. - -## Profiling Python and C++ Code - -### Generate the Profiling File - -To profile a mixture of Python and C++ code, we can use a Python -package, `yep`, that can work with Google's `perftools`, which is a -commonly-used profiler for C/C++ code. - -In Ubuntu systems, we can install `yep` and `perftools` by running the -following commands: - -```bash -apt update -apt install libgoogle-perftools-dev -pip install yep -``` - -Then we can run the following command - -```bash -python -m yep -v main.py -``` - -to generate the profiling file. The default filename is -`main.py.prof`. - -Please be aware of the `-v` command line option, which prints the -analysis results after generating the profiling file. By examining the - the print result, we'd know that if we stripped debug -information from `libpaddle.so` at build time. The following hints -help make sure that the analysis results are readable: - -1. Use GCC command line option `-g` when building `libpaddle.so` so to - include the debug information. The standard building system of - PaddlePaddle is CMake, so you might want to set - `CMAKE_BUILD_TYPE=RelWithDebInfo`. - -1. Use GCC command line option `-O2` or `-O3` to generate optimized - binary code. It doesn't make sense to profile `libpaddle.so` - without optimization, because it would anyway run slowly. - -1. Profiling the single-threaded binary file before the - multi-threading version, because the latter often generates tangled - profiling analysis result. You might want to set environment - variable `OMP_NUM_THREADS=1` to prevents OpenMP from automatically - starting multiple threads. - -### Examining the Profiling File - -The tool we used to examine the profiling file generated by -`perftools` is [`pprof`](https://github.com/google/pprof), which -provides a Web-based GUI like `cprofilev`. - -We can rely on the standard Go toolchain to retrieve the source code -of `pprof` and build it: - -```bash -go get github.com/google/pprof -``` - -Then we can use it to profile `main.py.prof` generated in the previous -section: - -```bash -pprof -http=0.0.0.0:3213 `which python` ./main.py.prof -``` - -Where `-http` specifies the IP and port of the HTTP service. -Directing our Web browser to the service, we would see something like -the following: - -![result](./pprof_1.png) - -### Identifying the Performance Bottlenecks - -Similar to how we work with `cprofilev`, we'd focus on `tottime` and -`cumtime`. - -![kernel_perf](./pprof_2.png) - -We can see that the execution time of multiplication and the computing -of the gradient of multiplication takes 2% to 4% of the total running -time, and `MomentumOp` takes about 17%. Obviously, we'd want to -optimize `MomentumOp`. - -`pprof` would mark performance critical parts of the program in -red. It's a good idea to follow the hints. diff --git a/docs/guides/performance_improving/analysis_tools/host_memory_profiling_en.md b/docs/guides/performance_improving/analysis_tools/host_memory_profiling_en.md deleted file mode 100644 index b1dbee1bd45..00000000000 --- a/docs/guides/performance_improving/analysis_tools/host_memory_profiling_en.md +++ /dev/null @@ -1,87 +0,0 @@ -# Heap Memory Profiling and Optimization - -Any computer program has the danger of memory leak. Generally, **Memory Leak** is caused by the unreleased heap memory allocated by the program. As the memory occupied by the program becomes larger and larger, it will affect the stability of the program, which may make the running speed slower or give rise to OoM(Out of Memory). It even compromises the stability of the machine in use, and leads to *downtime* . - - -There are many memory leak analysis tools at present. Typical ones include, [valgrind](http://valgrind.org/docs/manual/quick-start.html#quick-start.intro), [gperftools](https://gperftools.github.io/gperftools/). - -Because Fluid runs in C++ core driven by Python, It is very difficult for valgrind to analyze directly. You need to compile the debug version and dedicated Python version with valgrind support, and most of the output information is Python's own symbols and call information. In addition, valgrind will make the program run very slowly, so it is not recommended. - -Here we mainly introduce the use of [gperftools](https://gperftools.github.io/gperftools/) . - -gperftool mainly supports four functions: - -- thread-caching malloc -- heap-checking using tcmalloc -- heap-profiling using tcmalloc -- CPU profiler - -Paddle also provides a [tutorial on CPU performance analysis](./cpu_profiling_en.html) based on gperftool. - -For the analysis for heap, we mainly use thread-caching malloc and heap-profiling using tcmalloc. - -## Environment - -This tutorial is based on the Docker development environment paddlepaddle/paddle:latest-dev provided by paddle, based on the Ubuntu 16.04.4 LTS environment. - -## Manual - -- Install google-perftools - -``` -apt-get install libunwind-dev -apt-get install google-perftools -``` - -- Install pprof - -``` -go get -u github.com/google/pprof -``` - -- Configure Running Environment - -``` -export PPROF_PATH=/root/gopath/bin/pprof -export PPROF_BINARY_PATH=/root/gopath/bin/pprof -export LD_PRELOAD=/usr/lib/libtcmalloc.so.4 -``` - -- Use heap profile to run python program. The essence of it is to get a snapshot of the heap allocation periodically. - -``` -# HEAPPROFILE sets the directory and file prefix of the generated heap analysis file -# HEAP_PROFILE_ALLOCATION_INTERVAL Sets how many storage dumps are allocated for each dump, default 1GB -env HEAPPROFILE="./perf_log/test.log" HEAP_PROFILE_ALLOCATION_INTERVAL=209715200 python trainer.py -``` - -As the program runs, a lot of files will be generated in the perf_log folder as follows: - -``` --rw-r--r-- 1 root root 1.0M Jun 1 15:00 test.log.0001.heap --rw-r--r-- 1 root root 1.0M Jun 1 15:00 test.log.0002.heap --rw-r--r-- 1 root root 1.0M Jun 1 15:00 test.log.0003.heap --rw-r--r-- 1 root root 1.0M Jun 1 15:00 test.log.0004.heap --rw-r--r-- 1 root root 1.0M Jun 1 15:00 test.log.0005.heap --rw-r--r-- 1 root root 1.0M Jun 1 15:00 test.log.0006.heap -``` - -- Analyze the heap files with pprof. There are two modes of analysis: - - Complete mode. An analysis of the current heap is performed, showing some of the call paths for the current allocation of memory. - - ``` - pprof --pdf python test.log.0012.heap - ``` - The command above will generate a file of profile00x.pdf, which can be opened directly, for example, [memory_cpu_allocator](https://github.com/jacquesqiao/Paddle/blob/bd2ea0e1f84bb6522a66d44a072598153634cade/doc/fluid/howto/optimization/memory_cpu_allocator.pdf). As demonstrated in the chart below, during the running of the CPU version fluid, the module CPUAllocator is allocated with most memory. Other modules are allocated with relatively less memory, so they are ignored. It is very inconvenient for inspecting memory leak for memory leak is a chronic process which cannot be inspected in this picture. - ![result](https://user-images.githubusercontent.com/3048612/40964027-a54033e4-68dc-11e8-836a-144910c4bb8c.png) - - - Diff mode. You can do diff on the heap at two moments, which removes some modules whose memory allocation has not changed, and displays the incremental part. - ``` - pprof --pdf --base test.log.0010.heap python test.log.1045.heap - ``` - The generated result: [`memory_leak_protobuf`](https://github.com/jacquesqiao/Paddle/blob/bd2ea0e1f84bb6522a66d44a072598153634cade/doc/fluid/howto/optimization/memory_leak_protobuf.pdf) - - As shown from the figure: The structure of ProgramDesc has increased by 200MB+ between the two versions, so there is a large possibility that memory leak happens here, and the final result does prove a leak here. - - ![result](https://user-images.githubusercontent.com/3048612/40964057-b434d5e4-68dc-11e8-894b-8ab62bcf26c2.png) - ![result](https://user-images.githubusercontent.com/3048612/40964063-b7dbee44-68dc-11e8-9719-da279f86477f.png) diff --git a/docs/guides/performance_improving/analysis_tools/index_cn.rst b/docs/guides/performance_improving/analysis_tools/index_cn.rst deleted file mode 100644 index 1e1a5d1cbe2..00000000000 --- a/docs/guides/performance_improving/analysis_tools/index_cn.rst +++ /dev/null @@ -1,18 +0,0 @@ -.. _api_guide_analysis_tools: - -############### -性能优化分析及工具 -############### - -.. toctree:: - :hidden: - - cpu_profiling_cn.md - host_memory_profiling_cn.md - timeline_cn.md - -本模块介绍 Fluid 使用过程中的调优方法,包括: - -- `CPU 性能调优 `_:介绍如何使用 cProfile 包、yep 库、Google perftools 进行性能分析与调优 -- `堆内存分析和优化 `_:介绍如何使用 gperftool 进行堆内存分析和优化,以解决内存泄漏的问题 -- `Timeline 工具简介 `_ :介绍如何使用 Timeline 工具进行性能分析和调优 diff --git a/docs/guides/performance_improving/analysis_tools/index_en.rst b/docs/guides/performance_improving/analysis_tools/index_en.rst deleted file mode 100644 index abacd2fb5fe..00000000000 --- a/docs/guides/performance_improving/analysis_tools/index_en.rst +++ /dev/null @@ -1,18 +0,0 @@ -####################################### -Performance Profiling and Optimization -####################################### - -.. toctree:: - :hidden: - - - cpu_profiling_en.md - host_memory_profiling_en.md - timeline_en.md - -This section illustrates how to optimize performance of Fluid: - - -- `CPU profiling `_:How to use cProfile, yep, and Google perftools to profile and optimize model performance -- `Heap Memory Profiling and Optimization `_:Use gperftool to perform Heap Memory Profiling and Optimization to solve memory leaks. -- `How to use timeline tool to do profiling `_ :How to use timeline tool to do profile and optimization diff --git a/docs/guides/performance_improving/analysis_tools/nvvp1.png b/docs/guides/performance_improving/analysis_tools/nvvp1.png deleted file mode 100644 index 1af23ac3c52..00000000000 Binary files a/docs/guides/performance_improving/analysis_tools/nvvp1.png and /dev/null differ diff --git a/docs/guides/performance_improving/analysis_tools/nvvp2.png b/docs/guides/performance_improving/analysis_tools/nvvp2.png deleted file mode 100644 index 177c9db708d..00000000000 Binary files a/docs/guides/performance_improving/analysis_tools/nvvp2.png and /dev/null differ diff --git a/docs/guides/performance_improving/analysis_tools/nvvp3.png b/docs/guides/performance_improving/analysis_tools/nvvp3.png deleted file mode 100644 index d8f393667d6..00000000000 Binary files a/docs/guides/performance_improving/analysis_tools/nvvp3.png and /dev/null differ diff --git a/docs/guides/performance_improving/analysis_tools/nvvp4.png b/docs/guides/performance_improving/analysis_tools/nvvp4.png deleted file mode 100644 index 51f2f3e1832..00000000000 Binary files a/docs/guides/performance_improving/analysis_tools/nvvp4.png and /dev/null differ diff --git a/docs/guides/performance_improving/analysis_tools/pprof_1.png b/docs/guides/performance_improving/analysis_tools/pprof_1.png deleted file mode 100644 index 8e9edbf3776..00000000000 Binary files a/docs/guides/performance_improving/analysis_tools/pprof_1.png and /dev/null differ diff --git a/docs/guides/performance_improving/analysis_tools/pprof_2.png b/docs/guides/performance_improving/analysis_tools/pprof_2.png deleted file mode 100644 index 172ba20399b..00000000000 Binary files a/docs/guides/performance_improving/analysis_tools/pprof_2.png and /dev/null differ diff --git a/docs/guides/performance_improving/analysis_tools/timeline.jpeg b/docs/guides/performance_improving/analysis_tools/timeline.jpeg deleted file mode 100644 index 38ec3f80c98..00000000000 Binary files a/docs/guides/performance_improving/analysis_tools/timeline.jpeg and /dev/null differ diff --git a/docs/guides/performance_improving/analysis_tools/timeline_cn.md b/docs/guides/performance_improving/analysis_tools/timeline_cn.md deleted file mode 100644 index ef5b98d65e1..00000000000 --- a/docs/guides/performance_improving/analysis_tools/timeline_cn.md +++ /dev/null @@ -1,77 +0,0 @@ -# timeline 工具简介 - -## 本地使用 - -1. 在训练的主循环外加上`profiler.start_profiler(...)`和`profiler.stop_profiler(...)`。运行之后,代码会在`/tmp/profile`目录下生成一个 profile 的记录文件。 - - **提示:** - 请不要在 timeline 记录信息时运行太多次迭代,因为 timeline 中的记录数量和迭代次数是成正比的。 - - ```python - import numpy as np - import paddle - import paddle.fluid as fluid - from paddle.fluid import profiler - - place = fluid.CPUPlace() - - def reader(): - for i in range(100): - yield [np.random.random([4]).astype('float32'), np.random.random([3]).astype('float32')], - - main_program = fluid.Program() - startup_program = fluid.Program() - - with fluid.program_guard(main_program, startup_program): - data_1 = fluid.layers.data(name='data_1', shape=[1, 2, 2]) - data_2 = fluid.layers.data(name='data_2', shape=[1, 1, 3]) - out = fluid.layers.fc(input=[data_1, data_2], size=2) - # ... - - feeder = fluid.DataFeeder([data_1, data_2], place) - exe = fluid.Executor(place) - exe.run(startup_program) - pass_num = 10 - - for pass_id in range(pass_num): - for batch_id, data in enumerate(reader()): - if pass_id == 0 and batch_id == 5: - profiler.start_profiler("All") - elif pass_id == 0 and batch_id == 10: - profiler.stop_profiler("total", "/tmp/profile") - outs = exe.run(program=main_program, - feed=feeder.feed(data), - fetch_list=[out]) - - ``` - -1. 运行`python paddle/tools/timeline.py`来处理`/tmp/profile`,这个程序默认会生成一个`/tmp/timeline`文件,你也可以用命令行参数来修改这个路径,请参考[timeline.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/timeline.py)。 -```python -python Paddle/tools/timeline.py --profile_path=/tmp/profile --timeline_path=timeline -``` - -1. 打开 chrome 浏览器,访问,用`load`按钮来加载生成的`timeline`文件。 - - -1. 结果如下图所示,可以放大来查看 timeline 的细节信息。 - - ![chrome timeline](./timeline.jpeg) - -## 分布式使用 -一般来说,分布式的训练程序都会有两种程序:pserver 和 trainer。我们提供了把 pserver 和 trainer 的 profile 日志用 timeline 来显示的方式。 - -1. trainer 打开方式与[本地使用](#local)部分的第 1 步相同 - -1. pserver 可以通过加两个环境变量打开 profile,例如: -``` -FLAGS_rpc_server_profile_period=10 FLAGS_rpc_server_profile_path=./tmp/pserver python train.py -``` - -3. 把 pserver 和 trainer 的 profile 文件生成一个 timeline 文件,例如: -``` -python /paddle/tools/timeline.py - --profile_path trainer0=local_profile_10_pass0_0,trainer1=local_profile_10_pass0_1,pserver0=./pserver_0,pserver1=./pserver_1 - --timeline_path ./dist.timeline -``` - -4. 在 chrome 中加载 dist.timeline 文件,方法和[本地使用](#local)第 4 步相同。 diff --git a/docs/guides/performance_improving/analysis_tools/timeline_en.md b/docs/guides/performance_improving/analysis_tools/timeline_en.md deleted file mode 100644 index fb51802a168..00000000000 --- a/docs/guides/performance_improving/analysis_tools/timeline_en.md +++ /dev/null @@ -1,79 +0,0 @@ -# How to use timeline tool to do profile - -## Local - -1. Add `profiler.start_profiler(...)` and `profiler.stop_profiler(...)` to the main training loop. After run, the code will generate a profile record file `/tmp/profile`. **Warning**: Please do not run too many batches when use profiler to record timeline information, for the profile record will grow with the batch number. - - ```python - - import numpy as np - import paddle - import paddle.fluid as fluid - from paddle.fluid import profiler - - place = fluid.CPUPlace() - - def reader(): - for i in range(100): - yield [np.random.random([4]).astype('float32'), np.random.random([3]).astype('float32')], - - main_program = fluid.Program() - startup_program = fluid.Program() - - with fluid.program_guard(main_program, startup_program): - data_1 = fluid.layers.data(name='data_1', shape=[1, 2, 2]) - data_2 = fluid.layers.data(name='data_2', shape=[1, 1, 3]) - out = fluid.layers.fc(input=[data_1, data_2], size=2) - # ... - - feeder = fluid.DataFeeder([data_1, data_2], place) - exe = fluid.Executor(place) - exe.run(startup_program) - pass_num = 10 - - for pass_id in range(pass_num): - for batch_id, data in enumerate(reader()): - if pass_id == 0 and batch_id == 5: - profiler.start_profiler("All") - elif pass_id == 0 and batch_id == 10: - profiler.stop_profiler("total", "/tmp/profile") - outs = exe.run(program=main_program, - feed=feeder.feed(data), - fetch_list=[out]) - - ``` - -2. Run `python paddle/tools/timeline.py` to process `/tmp/profile`, it will generate another -file `/tmp/timeline` by default. You can change the path by cmd parameter, please take a look at -[timeline.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/timeline.py) for details. -```python -python Paddle/tools/timeline.py --profile_path=/tmp/profile --timeline_path=timeline -``` - -3. Open chrome and visit , use `load` button to load the generated `timeline` file. - - - - -4. The result timeline should be like: - - ![chrome timeline](./timeline.jpeg) - -## Distributed -This tool can support distributed train programs(pserver and trainer) too. - -1. Open traniner profiler just like how to use in [local](#local). - -2. Open pserver profiler: add two environment variables, e.g.: -``` -FLAGS_rpc_server_profile_period=10 FLAGS_rpc_server_profile_path=./tmp/pserver python train.py -``` - -3. Merge pservers' and trainers' profiler file, e.g.: -``` -python /paddle/tools/timeline.py - --profile_path trainer0=local_profile_10_pass0_0,trainer1=local_profile_10_pass0_1,pserver0=./pserver_0,pserver1=./pserver_1 - --timeline_path ./dist.timeline -``` - -4. Load `dist.timeline` in chrome just like the [fourth step in Local](#local_step_4) diff --git a/docs/guides/performance_improving/device_switching.md b/docs/guides/performance_improving/device_switching.md deleted file mode 100644 index 3c15919c503..00000000000 --- a/docs/guides/performance_improving/device_switching.md +++ /dev/null @@ -1,199 +0,0 @@ -# 运行时设备切换 - -Paddle 提供了[fluid.CUDAPlace](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/CUDAPlace_cn.html)以及[fluid.CPUPlace](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/CPUPlace_cn.html)用于指定运行时的设备。这两个接口用于指定全局的设备,从 1.8 版本开始,Paddle 提供了[device_guard](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/fluid_cn/device_guard_cn.html)接口,用于指定部分 OP 的运行设备,此教程会介绍 device_guard 的使用场景,以及如何使用该接口对模型进行优化。 - -如果使用了`fluid.CUDAPlace`设置了全局的执行设备,框架将尽可能地将 OP 设置在 GPU 上执行,因此有可能会遇到显存不够的情况。`device_guard`可以用于设置 OP 的执行设备,如果将部分层设置在 CPU 上运行,就能够充分利用 CPU 大内存的优势,避免显存超出。 - -有时尽管指定了全局的执行设备为 GPU,但框架在自动分配 OP 执行设备时,可能会将部分 OP 设置在 CPU 上执行。另外,个别 OP 会将输出存储在 CPU 上。在以上的场景中,常常会发生不同设备间的数据传输,可能会影响模型的性能。使用`device_guard`可以避免模型运行中不必要的数据传输。在下面的内容中,将会详细介绍如何通过[profile](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/profiler_cn.html)工具分析数据传输开销,以及如何使用`device_guard`避免不必要的数据传输,从而提升模型性能。 - -## 如何避免显存超出 - -下面示例代码中的`embedding`层,其参数`size`包含两个元素,第一个元素为`vocab_size` (词表大小), 第二个为`emb_size`(`embedding`层维度)。实际场景中,词表可能会非常大。示例代码中,词表大小被设置为 10000000。如果在 GPU 模式下运行,该层创建的权重矩阵的大小为(10000000, 150),仅这一层就需要 5.59G 的显存,如果词表大小继续增加,极有可能会导致显存超出。 - -```python -import paddle.fluid as fluid - -data = fluid.layers.fill_constant(shape=[1], value=128, dtype='int64') -label = fluid.layers.fill_constant(shape=[1, 150], value=0.5, dtype='float32') -emb = fluid.embedding(input=data, size=(10000000, 150), dtype='float32') -out = fluid.layers.l2_normalize(x=emb, axis=-1) - -cost = fluid.layers.square_error_cost(input=out, label=label) -avg_cost = fluid.layers.mean(cost) -sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001) -sgd_optimizer.minimize(avg_cost) - -place = fluid.CUDAPlace(0) -exe = fluid.Executor(place) -exe.run(fluid.default_startup_program()) -result = exe.run(fluid.default_main_program(), fetch_list=[avg_cost]) -``` - -`embedding`是根据`input`中的`id`信息从`embedding`矩阵中查询对应`embedding`信息,在 CPU 上进行计算,其速度也是可接受的。因此,可以参考如下代码,使用`device_guard`将`embedding`层设置在 CPU 上,以利用 CPU 内存资源。那么,除了`embedding`层,其他各层都会在 GPU 上运行。 - -```python -import paddle.fluid as fluid - -data = fluid.layers.fill_constant(shape=[1], value=128, dtype='int64') -label = fluid.layers.fill_constant(shape=[1, 150], value=0.5, dtype='float32') -with fluid.device_guard("cpu"): - emb = fluid.embedding(input=data, size=(10000000, 150), dtype='float32') -out = fluid.layers.l2_normalize(x=emb, axis=-1) - -cost = fluid.layers.square_error_cost(input=out, label=label) -avg_cost = fluid.layers.mean(cost) -sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001) -sgd_optimizer.minimize(avg_cost) - -place = fluid.CUDAPlace(0) -exe = fluid.Executor(place) -exe.run(fluid.default_startup_program()) -result = exe.run(fluid.default_main_program(), fetch_list=[avg_cost]) -``` - -在显存足够的情况下,可不必进行这样的设置。 - -## 如何减少数据传输 -### 使用 profile 工具确认是否发生了数据传输 -首先对模型的性能数据进行分析,找到发生数据传输的原因。如下列代码所示,可以利用[profile](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/profiler_cn.html)工具进行分析。 - -```python -import paddle.fluid as fluid -import paddle.fluid.compiler as compiler -import paddle.fluid.profiler as profiler - -data1 = fluid.layers.fill_constant(shape=[1, 3, 8, 8], value=0.5, dtype='float32') -data2 = fluid.layers.fill_constant(shape=[1, 3, 5, 5], value=0.5, dtype='float32') -shape = fluid.layers.shape(data2) -shape = fluid.layers.slice(shape, axes=[0], starts=[0], ends=[4]) -out = fluid.layers.crop_tensor(data1, shape=shape) -place = fluid.CUDAPlace(0) -exe = fluid.Executor(place) -exe.run(fluid.default_startup_program()) -compiled_prog = compiler.CompiledProgram(fluid.default_main_program()) -with profiler.profiler('All', 'total') as prof: - for i in range(10): - result = exe.run(program=compiled_prog, fetch_list=[out]) -``` - -在程序运行结束后,将会自动地打印出 profile report。在下面的 profile report 中,可以看到 `GpuMemCpy Summary`中给出了 2 项数据传输的调用耗时。在 OP 执行过程中,如果输入 Tensor 所在的设备与 OP 执行的设备不同,就会发生`GpuMemcpySync`,通常我们可以直接优化的就是这一项。进一步分析,可以看到`slice`和`crop_tensor`执行中都发生了`GpuMemcpySync`。尽管我们在程序中设置了 GPU 模式运行,但是框架中有些 OP,例如 shape,会将输出结果放在 CPU 上。 - -```text --------------------------> Profiling Report <------------------------- - -Note! This Report merge all thread info into one. -Place: All -Time unit: ms -Sorted by total time in descending order in the same thread - -Total time: 26.6328 - Computation time Total: 13.3133 Ratio: 49.9884% - Framework overhead Total: 13.3195 Ratio: 50.0116% - -------------------------- GpuMemCpy Summary ------------------------- - -GpuMemcpy Calls: 30 Total: 1.47508 Ratio: 5.5386% - GpuMemcpyAsync Calls: 10 Total: 0.443514 Ratio: 1.66529% - GpuMemcpySync Calls: 20 Total: 1.03157 Ratio: 3.87331% - -------------------------- Event Summary ------------------------- - -Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio. -FastThreadedSSAGraphExecutorPrepare 10 9.16493 9.152509 (0.998645) 0.012417 (0.001355) 0.025192 8.85968 0.916493 0.344122 -shape 10 8.33057 8.330568 (1.000000) 0.000000 (0.000000) 0.030711 7.99849 0.833057 0.312793 -fill_constant 20 4.06097 4.024522 (0.991025) 0.036449 (0.008975) 0.075087 0.888959 0.203049 0.15248 -slice 10 1.78033 1.750439 (0.983212) 0.029888 (0.016788) 0.148503 0.290851 0.178033 0.0668471 - GpuMemcpySync:CPU->GPU 10 0.45524 0.446312 (0.980388) 0.008928 (0.019612) 0.039089 0.060694 0.045524 0.0170932 -crop_tensor 10 1.67658 1.620542 (0.966578) 0.056034 (0.033422) 0.143906 0.258776 0.167658 0.0629515 - GpuMemcpySync:GPU->CPU 10 0.57633 0.552906 (0.959357) 0.023424 (0.040643) 0.050657 0.076322 0.057633 0.0216398 -Fetch 10 0.919361 0.895201 (0.973721) 0.024160 (0.026279) 0.082935 0.138122 0.0919361 0.0345199 - GpuMemcpyAsync:GPU->CPU 10 0.443514 0.419354 (0.945526) 0.024160 (0.054474) 0.040639 0.059673 0.0443514 0.0166529 -ScopeBufferedMonitor::post_local_exec_scopes_process 10 0.341999 0.341999 (1.000000) 0.000000 (0.000000) 0.028436 0.057134 0.0341999 0.0128413 -eager_deletion 30 0.287236 0.287236 (1.000000) 0.000000 (0.000000) 0.005452 0.022696 0.00957453 0.010785 -ScopeBufferedMonitor::pre_local_exec_scopes_process 10 0.047864 0.047864 (1.000000) 0.000000 (0.000000) 0.003668 0.011592 0.0047864 0.00179718 -InitLocalVars 1 0.022981 0.022981 (1.000000) 0.000000 (0.000000) 0.022981 0.022981 0.022981 0.000862883 -``` -### 通过 log 查看发生数据传输的具体位置 - -以上的示例程序比较简单,我们只用看 profile report 就能知道具体是哪些算子发生了数据传输。但是当模型比较复杂时,可能需要去查看更加详细的调试信息,可以打印出运行时的 log 去确定发生数据传输的具体位置。依然以上述程序为例,执行`GLOG_vmodule=operator=3 python test_case.py`,会得到如下 log 信息,会发现发生了 2 次数据传输: - -- `shape`输出的结果在 CPU 上,在`slice`运行时,`shape`的输出被拷贝到 GPU 上 -- `slice`执行完的结果在 GPU 上,当`crop_tensor`执行时,它会被拷贝到 CPU 上。 - -```text -I0406 14:56:23.286592 17516 operator.cc:180] CUDAPlace(0) Op(shape), inputs:{Input[fill_constant_1.tmp_0:float[1, 3, 5, 5]({})]}, outputs:{Out[shape_0.tmp_0:int[4]({})]}. -I0406 14:56:23.286628 17516 eager_deletion_op_handle.cc:107] Erase variable fill_constant_1.tmp_0 on CUDAPlace(0) -I0406 14:56:23.286725 17516 operator.cc:1210] Transform Variable shape_0.tmp_0 from data_type[int]:data_layout[NCHW]:place[CPUPlace]:library_type[PLAIN] to data_type[int]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN] -I0406 14:56:23.286763 17516 scope.cc:169] Create variable shape_0.tmp_0 -I0406 14:56:23.286784 17516 data_device_transform.cc:21] DeviceTransform in, src_place CPUPlace dst_place: CUDAPlace(0) -I0406 14:56:23.286867 17516 tensor_util.cu:129] TensorCopySync 4 from CPUPlace to CUDAPlace(0) -I0406 14:56:23.287099 17516 operator.cc:180] CUDAPlace(0) Op(slice), inputs:{EndsTensor[], EndsTensorList[], Input[shape_0.tmp_0:int[4]({})], StartsTensor[], StartsTensorList[]}, outputs:{Out[slice_0.tmp_0:int[4]({})]}. -I0406 14:56:23.287140 17516 eager_deletion_op_handle.cc:107] Erase variable shape_0.tmp_0 on CUDAPlace(0) -I0406 14:56:23.287220 17516 tensor_util.cu:129] TensorCopySync 4 from CUDAPlace(0) to CPUPlace -I0406 14:56:23.287473 17516 operator.cc:180] CUDAPlace(0) Op(crop_tensor), inputs:{Offsets[], OffsetsTensor[], Shape[slice_0.tmp_0:int[4]({})], ShapeTensor[], X[fill_constant_0.tmp_0:float[1, 3, 8, 8]({})]}, outputs:{Out[crop_tensor_0.tmp_0:float[1, 3, 5, 5]({})]}. -``` - -### 使用 device_guard 避免不必要的数据传输 - -在上面的例子中,`shape`输出的是一个 1-D 的 Tensor,因此对于`slice`而言计算量很小。这种情况下如果将`slice`设置在 CPU 上运行,就可以避免 2 次数据传输。修改后的程序如下: - -```python -import paddle.fluid as fluid -import paddle.fluid.compiler as compiler -import paddle.fluid.profiler as profiler - -data1 = fluid.layers.fill_constant(shape=[1, 3, 8, 8], value=0.5, dtype='float32') -data2 = fluid.layers.fill_constant(shape=[1, 3, 5, 5], value=0.5, dtype='float32') -shape = fluid.layers.shape(data2) -with fluid.device_guard("cpu"): - shape = fluid.layers.slice(shape, axes=[0], starts=[0], ends=[4]) -out = fluid.layers.crop_tensor(data1, shape=shape) -place = fluid.CUDAPlace(0) -exe = fluid.Executor(place) -exe.run(fluid.default_startup_program()) -compiled_prog = compiler.CompiledProgram(fluid.default_main_program()) -with profiler.profiler('All', 'total') as prof: - for i in range(10): - result = exe.run(program=compiled_prog, fetch_list=[out]) -``` -再次观察 profile report 中`GpuMemCpy Summary`的内容,可以看到`GpuMemCpySync`已经被消除。在实际的模型中,若`GpuMemCpySync` 调用耗时占比较大,并且可以通过设置`device_guard`避免,那么就能够带来一定的性能提升。 - -```text --------------------------> Profiling Report <------------------------- - -Note! This Report merge all thread info into one. -Place: All -Time unit: ms -Sorted by total time in descending order in the same thread - -Total time: 14.5345 - Computation time Total: 4.47587 Ratio: 30.7948% - Framework overhead Total: 10.0586 Ratio: 69.2052% - -------------------------- GpuMemCpy Summary ------------------------- - -GpuMemcpy Calls: 10 Total: 0.457033 Ratio: 3.14447% - GpuMemcpyAsync Calls: 10 Total: 0.457033 Ratio: 3.14447% - -------------------------- Event Summary ------------------------- - -Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio. -FastThreadedSSAGraphExecutorPrepare 10 7.70113 7.689066 (0.998433) 0.012064 (0.001567) 0.032657 7.39363 0.770113 0.529852 -fill_constant 20 2.62299 2.587022 (0.986287) 0.035968 (0.013713) 0.071097 0.342082 0.13115 0.180466 -shape 10 1.93504 1.935040 (1.000000) 0.000000 (0.000000) 0.026774 1.6016 0.193504 0.133134 -Fetch 10 0.880496 0.858512 (0.975032) 0.021984 (0.024968) 0.07392 0.140896 0.0880496 0.0605797 - GpuMemcpyAsync:GPU->CPU 10 0.457033 0.435049 (0.951898) 0.021984 (0.048102) 0.037836 0.071424 0.0457033 0.0314447 -crop_tensor 10 0.705426 0.671506 (0.951916) 0.033920 (0.048084) 0.05841 0.123901 0.0705426 0.0485346 -slice 10 0.324241 0.324241 (1.000000) 0.000000 (0.000000) 0.024299 0.07213 0.0324241 0.0223084 -eager_deletion 30 0.250524 0.250524 (1.000000) 0.000000 (0.000000) 0.004171 0.016235 0.0083508 0.0172365 -ScopeBufferedMonitor::post_local_exec_scopes_process 10 0.047794 0.047794 (1.000000) 0.000000 (0.000000) 0.003344 0.014131 0.0047794 0.00328831 -InitLocalVars 1 0.034629 0.034629 (1.000000) 0.000000 (0.000000) 0.034629 0.034629 0.034629 0.00238254 -ScopeBufferedMonitor::pre_local_exec_scopes_process 10 0.032231 0.032231 (1.000000) 0.000000 (0.000000) 0.002952 0.004076 0.0032231 0.00221755 -``` - -### 总结 - -- 使用 profile 工具对模型进行分析,看是否存在 GpuMemcpySync 的调用耗时。若存在,则进一步分析发生数据传输的原因。 -- 可以通过 profile report 找到发生 GpuMemcpySync 的 OP。如果需要,可以通过打印 log,找到 GpuMemcpySync 发生的具体位置。 -- 尝试使用`device_guard`设置部分 OP 的运行设备,来减少 GpuMemcpySync 的调用。 -- 最后可以通过比较修改前后模型的 profile report,或者其他用来衡量性能的指标,确认修改后是否带来了性能提升。 diff --git a/docs/guides/performance_improving/memory_optimize.rst b/docs/guides/performance_improving/memory_optimize.rst deleted file mode 100644 index f4a892e1dd8..00000000000 --- a/docs/guides/performance_improving/memory_optimize.rst +++ /dev/null @@ -1,156 +0,0 @@ -.. _api_guide_memory_optimize: - -########### -存储分配与优化 -########### - -1. PaddlePaddle 的显存分配策略 -=========================== - -1.1. 显存自增长 AutoGrowth 策略 --------------------------- -自 1.6+的版本起,PaddlePaddle 支持显存自增长 AutoGrowth 策略,按需分配显存,且已于 1.7+版本中默认开启,方便用户在同一张 GPU 卡上同时运行多个任务。 - -由于原生的 CUDA 系统调用 :code:`cudaMalloc` 和 :code:`cudaFree` 均是同步操作,非常耗时。 -因此显存自增长 AutoGrowth 策略会缓存已分配到的显存,供后续分配使用,具体方式为: - -- 在前几次显存分配时,框架会调用 :code:`cudaMalloc` 按需分配,但释放时不会调用 :code:`cudaFree` 返回给 GPU,而是在框架内部缓存起来。 - -- 在随后的显存分配时,框架会首先检查缓存的显存中是否有合适的块,若有则从中分割出所需的显存空间返回,否则才调用 :code:`cudaMalloc` 直接从 GPU 中分配。随后的显存释放亦会缓存起来供后续分配使用。 - -因此,显存自增长 AutoGrowth 策略会在前几个 batch 训练时分配较慢(因为频繁调用 :code:`cudaMalloc` ),在随后训练过程中基本不会影响模型训练速度。 - -1.2. 显存预分配策略 ----------------- - -除了显存自增长 AutoGrowth 策略以外,PaddlePaddle 还提供了显存预分配策略。显存预分配策略是 PaddlePaddle 1.7 版本前的默认显存分配策略。 - -显存预分配策略会在第一次分配时分配很大 chunk_size 的显存块,随后的显存分配大多从预分配的显存块中切分获得。 -其中,chunk_size 由环境变量 :code:`FLAGS_fraction_of_gpu_memory_to_use` 确定,chunk_size 的计算公式为: - -.. code-block:: python - - chunk_size = FLAGS_fraction_of_gpu_memory_to_use * 单张 GPU 卡的当前可用显存值 - -:code:`FLAGS_fraction_of_gpu_memory_to_use` 的默认值为 0.92,即框架预先分配显卡 92%的当前可用显存值。 - -显存预分配策略分配显存的具体方式为: - -- 在分配 requested_size 大小的显存时, - - 若 requested_size <= chunk_size,则框架会预先分配 chunk_size 大小的显存池 chunk,并从 chunk 中分出 requested_size 大小的块返回。之后每次申请显存都会从 chunk 中分配。 - - 若 requested_size > chunk_size,则框架会直接调用 :code:`cudaMalloc` 分配 requested_size 大小的显存返回。 - -- 在释放 free_size 大小的显存时, - - 若 free_size <= chunk_size,则框架会将该显存放回预分配的 chunk 中,而不是直接返回给 CUDA。 - - 若 free_size > chunk_size,则框架会直接调用 :code:`cudaFree` 将显存返回给 CUDA。 - -若你的 GPU 卡上有其他任务占用显存,你可以适当将 :code:`FLAGS_fraction_of_gpu_memory_to_use` 减少,保证框架能预分配到合适的显存块,例如: - -.. code-block:: shell - - export FLAGS_fraction_of_gpu_memory_to_use=0.4 # 预先 40%的 GPU 显存 - -若 :code:`FLAGS_fraction_of_gpu_memory_to_use` 设为 0,则每次显存分配和释放均会调用 :code:`cudaMalloc` 和 :code:`cudaFree` ,会严重影响性能,不建议你使用。 -只有当你想测量网络的实际显存占用量时,你可以设置 :code:`FLAGS_fraction_of_gpu_memory_to_use` 为 0,观察 nvidia-smi 显示的显存占用情况。 - -1.3. 显存分配策略的选择方式 ------------------------ -自 1.6+版本起,PaddlePaddle 同时支持显存自增长 AutoGrowth 策略和显存预分配策略,并通过环境变量 :code:`FLAGS_allocator_strategy` 控制。 - -选择显存自增长 AutoGrowth 的方式为: - -.. code-block:: shell - - export FLAGS_allocator_strategy=auto_growth # 选择显存自增长 AutoGrowth 策略 - -选择显存预分配策略的方式为: - -.. code-block:: shell - - export FLAGS_allocator_strategy=naive_best_fit # 选择显存预分配策略 - -此外,自 1.7.2+版本起,PaddlePaddle 提供了环境变量 :code:`FLAGS_gpu_memory_limit_mb` ,用于控制单个任务进程可分配的最大显存,单位是 MB。默认值是 0,表示没有限制,可分配全部显存。如果设置为大于 0 的值,则会在分配的显存超过限制时报错,即使此时系统还存在空闲的显存空间。 - -2. PaddlePaddle 的存储优化策略 -=========================== - -PaddlePaddle 提供了多种通用存储优化方法,优化你的网络的存储占用(包括显存和内存)。 - -2.1. GC 策略: 存储垃圾及时回收 -------------------------- - -GC(Garbage Collection)的原理是在网络运行阶段及时释放无用变量的存储空间,达到节省存储空间的目的。GC 适用于使用 Executor,ParallelExecutor 做模型训练/预测的场合,但不适用于 C++预测库接口。 - -**GC 策略已于 1.6+版本中默认开启。** - -GC 策略由三个环境变量控制: - - -- :code:`FLAGS_eager_delete_tensor_gb` - -GC 策略的使能开关,double 类型,在<1.6 的版本中默认值为-1,在 1.6+版本中默认值为 0。GC 策略会积攒一定大小的存储垃圾后再统一释放,:code:`FLAGS_eager_delete_tensor_gb` 控制的是存储垃圾的阈值,单位是 GB。**建议用户设置** :code:`FLAGS_eager_delete_tensor_gb=0` 。 - -若 :code:`FLAGS_eager_delete_tensor_gb=0` ,则一旦有存储垃圾则马上回收,最为节省存储空间。 - -若 :code:`FLAGS_eager_delete_tensor_gb=1` ,则存储垃圾积攒到 1G 后才触发回收。 - -若 :code:`FLAGS_eager_delete_tensor_gb<0` ,则 GC 策略关闭。 - - -- :code:`FLAGS_memory_fraction_of_eager_deletion` - -GC 策略的调节 flag,double 类型,默认值为 1,范围为[0,1],仅适用于使用 ParallelExecutor 的场合。 -GC 内部会根据变量占用的存储空间大小,对变量进行降序排列,且仅回收前 :code:`FLAGS_memory_fraction_of_eager_deletion` 大的变量的存储空间。**建议用户维持默认值**,即 :code:`FLAGS_memory_fraction_of_eager_deletion=1` 。 - -若 :code:`FLAGS_memory_fraction_of_eager_deletion=0.6` ,则表示仅回收存储占用 60%大的变量的存储空间。 - -若 :code:`FLAGS_memory_fraction_of_eager_deletion=0` ,则表示不回收任何变量的存储空间,GC 策略关闭。 - -若 :code:`FLAGS_memory_fraction_of_eager_deletion=1` ,则表示回收所有变量的存储空间。 - - -- :code:`FLAGS_fast_eager_deletion_mode` - -快速 GC 策略的开关,bool 类型,默认值为 True,表示使用快速 GC 策略。快速 GC 策略会不等待 CUDA Kernel 结束直接释放显存。**建议用户维持默认值**,即 :code:`FLAGS_fast_eager_deletion_mode=True` 。 - - -2.2. Inplace 策略: Op 内部的输出复用输入 ----------------------------------- - -Inplace 策略的原理是 Op 的输出复用 Op 输入的存储空间。例如,reshape 操作的输出和输入可复用同一片存储空间。 - -Inplace 策略适用于使用 ParallelExecutor 的场合,通过 :code:`BuildStrategy` 设置。此策略不支持使用 Executor+Program 做单卡训练、使用 C++预测库接口等场合。 - -**Inplace 策略已于 1.6+版本中默认开启。** - -具体方式为: - -.. code-block:: python - - build_strategy = fluid.BuildStrategy() - build_strategy.enable_inplace = True # 开启 Inplace 策略 - - compiled_program = fluid.CompiledProgram(train_program, build_strategy=build_strategy) - - -在<1.6 的版本中,由于设计上的一些问题,在开启 Inplace 策略后,必须保证后续 exe.run 中 fetch_list 的变量是 persistable 的,即假如你后续需要 fetch 的变量为 loss 和 acc,则必须设置: - -.. code-block:: python - - loss.persistable = True - acc.persistable = True - - -**在 1.6+的版本中,无需设置 fetch 变量为 persistable。** - - -3. 存储优化 Best Practice -======================= - -我们推荐你的最佳存储优化策略为: - -- 开启 GC 策略:设置 :code:`FLAGS_eager_delete_tensor_gb=0` 。 - -- 开启 Inplace 策略:设置 :code:`build_strategy.enable_inplace = True` ,并在<1.6 版本中设置 fetch_list 中的 :code:`var.persistable = True` 。 - -**在 1.6+的版本中,上述最佳策略均已默认打开,无需手动配置,亦无需设置 fetch_list 变量为 persistable。** diff --git a/docs/guides/performance_improving/memory_optimize_en.rst b/docs/guides/performance_improving/memory_optimize_en.rst deleted file mode 100644 index 0a2eceb9bca..00000000000 --- a/docs/guides/performance_improving/memory_optimize_en.rst +++ /dev/null @@ -1,176 +0,0 @@ -.. _api_guide_memory_optimize_en: - -########### -Memory Allocation and Optimization -########### - -1. Memory Allocation Strategy -=========================== - -1.1. AutoGrowth Strategy --------------------------- - -Since version 1.6+, PaddlePaddle supports the AutoGrowth strategy, which allocates memory on demand. -AutoGrowth strategy has been enabled by default in version 1.7+, making it convenient for users to -run multiple tasks on the same GPU card at the same time. - -Because the native CUDA system calls :code:`cudaMalloc` and :code:`cudaFree` are synchronous operations, -which are very time-consuming, the AutoGrowth strategy will cache the allocated memory for subsequent allocation. -The specific methods are as follows: - -- In the first few memory allocations, PaddlePaddle framework will call :code:`cudaMalloc` and allocate memory on demand. When releasing the allocated memory, it will not call :code:`cudaFree` to return the memory to GPU, but cache the memory inside the framework. - -- In the subsequent allocations, PaddlePaddle framework will first check if there is a fit block (block size larger than the required memory size) in the cached memory. If there is, it will split the required memory from the fit block and return. Otherwise, it will call :code:`cudaMalloc` to allocate memory from GPU. The allocated memory are also cached when being released for subsequent allocation. - -Therefore, the AutoGrowth strategy may slow the speed in the first few batches of model training, -but will not affect the speed in the subsequent training process. - -1.2. Pre-Allocation Strategy ----------------- - -In addition to the AutoGrowth strategy, paddlepaddle also provides a Pre-Allocation strategy, -which is the default memory allocation strategy before paddlepaddle 1.7. - -The Pre-Allocation strategy allocates a large size chunk at the first allocation, and the subsequent memory allocation is mostly obtained from the pre allocated memory chunk. -Among them, the chunk size is determined by the environment variable :code:`FLAGS_fraction_of_gpu_memory_to_use`, and the calculation formula of chunk size is: - -.. code-block:: python - - chunk_size = FLAGS_fraction_of_gpu_memory_to_use * number of current available memory of a single GPU card - -The default value of :code:`FLAGS_fraction_of_gpu_memory_to_use` is 0.92, that is, the framework will pre allocates -92% of the currently available memory of the GPU card. - -The specific way of Pre-Allocation strategy to allocate GPU memory is: - -- When allocating memory of requested_size, - - If requested_size <= chunk_size, the framework will first allocate a memory chunk of chunk_size, then split a block of requested_size and return the block. Every subsequent memory allocation will be performed on the chunk. - - If requested_size > chunk_size, the framework will call :code:`cudaMalloc` to allocate memory block of requested_size and return. - -- When freeing memory of requested_size, - - If free_size <= chunk_size, the framework will put the memory block back into the pre-allocated chunk, instead of returning back to GPU. - - If free_size > chunk_size, the framework will call :code:`cudaFree` and return the memory back to GPU. - -If there are other tasks on your GPU card that occupy the memory, you can appropriately decrease :code:`FLAGS_fraction_of_gpu_memory_to_use` -to ensure that the framework can pre-allocate the memory block of appropriate size, for example - -.. code-block:: shell - - export FLAGS_fraction_of_gpu_memory_to_use=0.4 # Pre-allocate 40% memory of a single GPU card - -If :code:`FLAGS_fraction_of_gpu_memory_to_use` is set to 0, the framework will call :code:`cudaMalloc` and :code:`cudaFree` every time the memory is allocated and released, which will seriously affect the performance and is not recommended. Only when you want to measure the actual memory usage of the network, you could set :code:`FLAGS_fraction_of_gpu_memory_to_use` to 0, and observe the memory usage of command nvidia-smi display. - -1.3. Configuration of memory allocation strategy ------------------------ -Since version 1.6+, PaddlePaddle supports both the AutoGrowth strategy and the Pre-Allocation Strategy, and control the strategy used in framework by -the environment variable :code:`FLAGS_allocator_strategy`. - -Use AutoGrowth strategy: - -.. code-block:: shell - - export FLAGS_allocator_strategy=auto_growth # Use AutoGrowth strategy - -Use Pre-Allocation strategy: - -.. code-block:: shell - - export FLAGS_allocator_strategy=naive_best_fit # Use Pre-Allocation strategy - -Plus, since version 1.7.2+, PaddlePaddle provides an environment variable :code:`FLAGS_gpu_memory_limit_mb`, which controls the maximum gpu memory limit that the process can allocate. -If it is equal to 0, there would be no limit and all gpu memory would be available to the process. If it is larger than 0, the process would raise out of memory error if the allocated -memory exceeds the limit even though there is available memory on the gpu card. The unit is MB and default value is 0. - -2. Memory Optimization Strategy -=========================== - -Paddlepaddle provides several general memory optimization methods to optimize the memory usage of your network (including general memory and GPU memory). - -2.1. GC Strategy: memory garbage eager collection -------------------------- - -The principle of GC(Garbage Collection)is to release the memory space of useless variables eagerly during network running, -in order to save memory space. GC is suitable for training and inference using Executor or ParallelExecutor, but it is not suitable for C++ inference library. - -**Since version 1.6+, GC Strategy is enabled by default.** - -GC Strategy is controlled by 3 environment variable: - - -- :code:`FLAGS_eager_delete_tensor_gb` - -Variable to enable GC, its data type is double. The default value is -1 in PaddlePaddle with version < 1.6, -and is 0 in PaddlePaddle with version >= 1.6. GC Strategy will cache a certain amount of memory garbage and release it uniformly. -:code:`FLAGS_eager_delete_tensor_gb` means the threshold of cached memory garbage, the unit of which is GB. **It is recommended to set** :code:`FLAGS_eager_delete_tensor_gb=0`. - -If :code:`FLAGS_eager_delete_tensor_gb=0`, once there is memory garbage, it will be collected immediately to save memory. - -If :code:`FLAGS_eager_delete_tensor_gb=1`, the memory garbage is collected when the cached amount of garbage reaches 1GB. - -If :code:`FLAGS_eager_delete_tensor_gb<0`, GC Strategy is disabled. - - -- :code:`FLAGS_memory_fraction_of_eager_deletion` - -Variable to control GC Strategy, its data type is double. The default value is 1, range [0,1]. It is only suitable for ParallelExecutor. -GC will sort the variables in descending order according to the memory space occupied by the variables, -and only collect the memory space of top :code:`FLAGS_memory_fraction_of_eager_deletion` variables. -**It is recommended to remain default value**, that is :code:`FLAGS_memory_fraction_of_eager_deletion=1`. - -If :code:`FLAGS_memory_fraction_of_eager_deletion=0.6`, top 60% variables will be collected. - -If :code:`FLAGS_memory_fraction_of_eager_deletion=0`, no variable will be collected, GC Strategy is disabled. - -If :code:`FLAGS_memory_fraction_of_eager_deletion=1`, all variables will be collected. - - -- :code:`FLAGS_fast_eager_deletion_mode` - -Variable to enable fast GC Strategy, its type is bool. The default value is True, which means use fast GC Strategy. -Fast GC Strategy will collect the memory garbage immediately instead of waiting for CUDA Kernel finish. **It is recommended to remain default value**, that is :code:`FLAGS_fast_eager_deletion_mode=True`. - - -2.2. Inplace Strategy: output reuses input inside operator ----------------------------------- - -The principle of Inplace strategy is that the output of some operators can reuses the memory space of input. -For example, the output and input of operator :code:`reshape` can reuse the same memory space. - -Inplace Strategy is suitable for ParallelExecutor, which can be set through :code:`BuildStrategy`. -The Strategy is not suitable for Executor+Program or C++ inference library. - -**Since version 1.6+, Inplace Strategy is enabled by default.** - -The specific way of Inplace strategy is: - -.. code-block:: python - - build_strategy = fluid.BuildStrategy() - build_strategy.enable_inplace = True # Enable Inplace Strategy - - compiled_program = fluid.CompiledProgram(train_program, build_strategy=build_strategy) - - -In PaddlePaddle with version < 1.6, due to of some design problems, when the Inplace Strategy is enabled, -the variable in fetch_list in the subsequent :code:`exe.run` must be persistent. -That is, if you the variables you want to fetch are loss and acc, you must set: - -.. code-block:: python - - loss.persistable = True - acc.persistable = True - - -**Since version 1.6+, setting variables in fetch_list to persistable is not needed.** - - -3. Memory Optimization Best Practice -======================= - -We recommend the best memory optimization strategy as: - -- Enable GC strategy:set :code:`FLAGS_eager_delete_tensor_gb=0`. - -- Enable Inplace strategy:set :code:`build_strategy.enable_inplace = True`, and set variables in fetch_list to persistable using :code:`var.persistable = True` when the version of PaddlePaddle < 1.6. - -**Since version 1.6+, the above optimal strategy have been enabled by default and setting variables in fetch_list to persistable is not needed.** diff --git a/docs/guides/performance_improving/paddle_tensorrt_infer.md b/docs/guides/performance_improving/paddle_tensorrt_infer.md deleted file mode 100644 index 2890eceb4ab..00000000000 --- a/docs/guides/performance_improving/paddle_tensorrt_infer.md +++ /dev/null @@ -1,209 +0,0 @@ -# 使用 Paddle-TensorRT 库预测 - -NVIDIA TensorRT 是一个高性能的深度学习预测库,可为深度学习推理应用程序提供低延迟和高吞吐量。PaddlePaddle 采用子图的形式对 TensorRT 进行了集成,即我们可以使用该模块来提升 Paddle 模型的预测性能。该模块依旧在持续开发中,目前支持的模型如下表所示: - -|分类模型|检测模型|分割模型| -|---|---|---| -|mobilenetv1|yolov3|ICNET| -|resnet50|SSD|| -|vgg16|mask-rcnn|| -|resnext|faster-rcnn|| -|AlexNet|cascade-rcnn|| -|Se-ResNext|retinanet|| -|GoogLeNet|mobilenet-SSD|| -|DPN||| - -在这篇文档中,我们将会对 Paddle-TensorRT 库的获取、使用和原理进行介绍。 - -**Note:** - -1. 从源码编译时,TensorRT 预测库目前仅支持使用 GPU 编译,且需要设置编译选项 TENSORRT_ROOT 为 TensorRT 所在的路径。 -2. Windows 支持需要 TensorRT 版本 5.0 以上。 -3. Paddle-TRT 目前仅支持固定输入 shape。 -4. 下载安装 TensorRT 后,需要手动在`NvInfer.h`文件中为`class IPluginFactory`和`class IGpuAllocator`分别添加虚析构函数: - ``` c++ - virtual ~IPluginFactory() {}; - virtual ~IGpuAllocator() {}; - ``` - -## 内容 -- [Paddle-TRT 使用介绍](#Paddle-TRT 使用介绍) -- [Paddle-TRT 样例编译测试](#Paddle-TRT 样例编译测试) -- [Paddle-TRT INT8 使用](#Paddle-TRT_INT8 使用) -- [Paddle-TRT 子图运行原理](#Paddle-TRT 子图运行原理) -- [Paddle-TRT 性能测试](#Paddle-TRT 性能测试) - -## Paddle-TRT 使用介绍 - -在使用 AnalysisPredictor 时,我们通过配置 AnalysisConfig 中的接口 - -``` c++ -config->EnableTensorRtEngine(1 << 20 /* workspace_size*/, - batch_size /* max_batch_size*/, - 3 /* min_subgraph_size*/, - AnalysisConfig::Precision::kFloat32 /* precision*/, - false /* use_static*/, - false /* use_calib_mode*/); -``` -的方式来指定使用 Paddle-TRT 子图方式来运行。 -该接口中的参数的详细介绍如下: - -- **`workspace_size`**,类型:int,默认值为 1 << 20。指定 TensorRT 使用的工作空间大小,TensorRT 会在该大小限制下筛选合适的 kernel 执行预测运算。 -- **`max_batch_size`**,类型:int,默认值为 1。需要提前设置最大的 batch 大小,运行时 batch 大小不得超过此限定值。 -- **`min_subgraph_size`**,类型:int,默认值为 3。Paddle-TRT 是以子图的形式运行,为了避免性能损失,当子图内部节点个数大于`min_subgraph_size`的时候,才会使用 Paddle-TRT 运行。 -- **`precision`**,类型:`enum class Precision {kFloat32 = 0, kHalf, kInt8,};`, 默认值为`AnalysisConfig::Precision::kFloat32`。指定使用 TRT 的精度,支持 FP32(kFloat32),FP16(kHalf),Int8(kInt8)。若需要使用 Paddle-TRT int8 离线量化校准,需设定`precision`为 `AnalysisConfig::Precision::kInt8`, 且设置`use_calib_mode` 为 true。 -- **`use_static`**,类型:bool, 默认值为 false。如果指定为 true,在初次运行程序的时候会将 TRT 的优化信息进行序列化到磁盘上,下次运行时直接加载优化的序列化信息而不需要重新生成。 -- **`use_calib_mode`**,类型:bool, 默认值为 false。若要运行 Paddle-TRT int8 离线量化校准,需要将此选项设置为 true。 - -**Note:** Paddle-TRT 目前只支持固定 shape 的输入,不支持变化 shape 的输入。 - -## Paddle-TRT 样例编译测试 - -1. 下载或编译带有 TensorRT 的 paddle 预测库,参考[安装与编译 C++预测库](../../inference_deployment/inference/build_and_install_lib_cn.html)。 -2. 从[NVIDIA 官网](https://developer.nvidia.com/nvidia-tensorrt-download)下载对应本地环境中 cuda 和 cudnn 版本的 TensorRT,需要登陆 NVIDIA 开发者账号。 -3. 下载[预测样例](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz)并解压,进入`sample/paddle-TRT`目录下。 - - `paddle-TRT` 文件夹目录结构如下: - - ``` - paddle-TRT - ├── CMakeLists.txt - ├── mobilenet_test.cc - ├── fluid_generate_calib_test.cc - ├── fluid_int8_test.cc - ├── mobilenetv1 - │ ├── model - │ └── params - ├── run.sh - └── run_impl.sh - ``` - - - `mobilenet_test.cc` 为使用 paddle-TRT 预测的 C++源文件 - - `fluid_generate_calib_test.cc` 为使用 TRT int8 离线量化校准的 C++源文件 - - `fluid_int8_test.cc` 为使用 TRT 执行 int8 预测的 C++源文件 - - `mobilenetv1` 为模型文件夹 - - `run.sh` 为预测运行脚本文件 - - 在这里假设样例所在的目录为 `SAMPLE_BASE_DIR/sample/paddle-TRT` - -4. 配置编译与运行脚本 - - 编译运行预测样例之前,需要根据运行环境配置编译与运行脚本`run.sh`。`run.sh`的选项与路径配置的部分如下: - - ```shell - # 设置是否开启 MKL、GPU、TensorRT,如果要使用 TensorRT,必须打开 GPU - WITH_MKL=ON - WITH_GPU=ON - USE_TENSORRT=ON - - # 按照运行环境设置预测库路径、CUDA 库路径、CUDNN 库路径、TensorRT 路径、模型路径 - LIB_DIR=YOUR_LIB_DIR - CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR - CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR - TENSORRT_ROOT_DIR=YOUR_TENSORRT_ROOT_DIR - MODEL_DIR=YOUR_MODEL_DIR - ``` - - 按照实际运行环境配置`run.sh`中的选项开关和所需 lib 路径。 - -5. 编译与运行样例 - - -## Paddle-TRT INT8 使用 - -1. Paddle-TRT INT8 简介 - 神经网络的参数在一定程度上是冗余的,在很多任务上,我们可以在保证模型精度的前提下,将 Float32 的模型转换成 Int8 的模型。目前,Paddle-TRT 支持离线将预训练好的 Float32 模型转换成 Int8 的模型,具体的流程如下: - - 1) **生成校准表**(Calibration table):我们准备 500 张左右的真实输入数据,并将数据输入到模型中去,Paddle-TRT 会统计模型中每个 op 输入和输出值的范围信息,并将其记录到校准表中,这些信息有效减少了模型转换时的信息损失。 - - 2) 生成校准表后,再次运行模型,**Paddle-TRT 会自动加载校准表**,并进行 INT8 模式下的预测。 - -2. 编译测试 INT8 样例 - 将`run.sh`文件中的`mobilenet_test`改为`fluid_generate_calib_test`,运行 - - ``` shell - sh run.sh - ``` - - 即可执行生成校准表样例,在该样例中,我们随机生成了 500 个输入来模拟这一过程,在实际业务中,建议大家使用真实样例。运行结束后,在 `SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/_opt_cache` 模型目录下会多出一个名字为 trt_calib_*的文件,即校准表。 - - 生成校准表后,将带校准表的模型文件拷贝到特定地址 - - ``` shell - cp -rf SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/ SAMPLE_BASE_DIR/sample/paddle-TRT/mobilenetv1_calib - ``` - - 将`run.sh`文件中的`fluid_generate_calib_test`改为`fluid_int8_test`,将模型路径改为`SAMPLE_BASE_DIR/sample/paddle-TRT/mobilenetv1_calib`,运行 - - ``` shell - sh run.sh - ``` - - 即可执行 int8 预测样例。 - -## Paddle-TRT 子图运行原理 - - PaddlePaddle 采用子图的形式对 TensorRT 进行集成,当模型加载后,神经网络可以表示为由变量和运算节点组成的计算图。Paddle TensorRT 实现的功能是对整个图进行扫描,发现图中可以使用 TensorRT 优化的子图,并使用 TensorRT 节点替换它们。在模型的推断期间,如果遇到 TensorRT 节点,Paddle 会调用 TensorRT 库对该节点进行优化,其他的节点调用 Paddle 的原生实现。TensorRT 在推断期间能够进行 Op 的横向和纵向融合,过滤掉冗余的 Op,并对特定平台下的特定的 Op 选择合适的 kernel 等进行优化,能够加快模型的预测速度。 - -下图使用一个简单的模型展示了这个过程: - -**原始网络** -

- -

- -**转换的网络** -

- -

- - - 我们可以在原始模型网络中看到,绿色节点表示可以被 TensorRT 支持的节点,红色节点表示网络中的变量,黄色表示 Paddle 只能被 Paddle 原生实现执行的节点。那些在原始网络中的绿色节点被提取出来汇集成子图,并由一个 TensorRT 节点代替,成为转换后网络中的`block-25` 节点。在网络运行过程中,如果遇到该节点,Paddle 将调用 TensorRT 库来对其执行。 - -## Paddle-TRT 性能测试 - -### 测试环境 -- CPU:Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz GPU:Tesla P4 -- TensorRT4.0, CUDA8.0, CUDNNV7 -- 测试模型 ResNet50,MobileNet,ResNet101, Inception V3. - -### 测试对象 -**PaddlePaddle, PyTorch, TensorFlow** - -- 在测试中,PaddlePaddle 使用子图优化的方式集成了 TensorRT, 模型[地址](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models)。 -- PyTorch 使用了原生的实现, 模型[地址 1](https://github.com/pytorch/vision/tree/master/torchvision/models)、[地址 2](https://github.com/marvis/pytorch-mobilenet)。 -- 对 TensorFlow 测试包括了对 TF 的原生的测试,和对 TF—TRT 的测试,**对 TF—TRT 的测试并没有达到预期的效果,后期会对其进行补充**, 模型[地址](https://github.com/tensorflow/models)。 - - -#### ResNet50 - -|batch_size|PaddlePaddle(ms)|PyTorch(ms)|TensorFlow(ms)| -|---|---|---|---| -|1|4.64117 |16.3|10.878| -|5|6.90622| 22.9 |20.62| -|10|7.9758 |40.6|34.36| - -#### MobileNet - -|batch_size|PaddlePaddle(ms)|PyTorch(ms)|TensorFlow(ms)| -|---|---|---|---| -|1| 1.7541 | 7.8 |2.72| -|5| 3.04666 | 7.8 |3.19| -|10|4.19478 | 14.47 |4.25| - -#### ResNet101 - -|batch_size|PaddlePaddle(ms)|PyTorch(ms)|TensorFlow(ms)| -|---|---|---|---| -|1|8.95767| 22.48 |18.78| -|5|12.9811 | 33.88 |34.84| -|10|14.1463| 61.97 |57.94| - - -#### Inception v3 - -|batch_size|PaddlePaddle(ms)|PyTorch(ms)|TensorFlow(ms)| -|---|---|---|---| -|1|15.1613 | 24.2 |19.1| -|5|18.5373 | 34.8 |27.2| -|10|19.2781| 54.8 |36.7| diff --git a/docs/guides/performance_improving/paddle_tensorrt_infer_en.md b/docs/guides/performance_improving/paddle_tensorrt_infer_en.md deleted file mode 100644 index 0acc384ab2a..00000000000 --- a/docs/guides/performance_improving/paddle_tensorrt_infer_en.md +++ /dev/null @@ -1,200 +0,0 @@ -# Use Paddle-TensorRT Library for inference - -NVIDIA TensorRT is a is a platform for high-performance deep learning inference. It delivers low latency and high throughput for deep learning inference application. -Subgraph is used in PaddlePaddle to preliminarily integrate TensorRT, which enables TensorRT module to enhance inference performance of paddle models. The module is still under development. Currently supported models are as following: - -|classification|detection|segmentation| -|---|---|---| -|mobilenetv1|yolov3|ICNET| -|resnet50|SSD|| -|vgg16|mask-rcnn|| -|resnext|faster-rcnn|| -|AlexNet|cascade-rcnn|| -|Se-ResNext|retinanet|| -|GoogLeNet|mobilenet-SSD|| -|DPN||| - -We will introduce the obtaining, usage and theory of Paddle-TensorRT library in this documentation. - -**Note:** - -1. When compiling from source, TensorRT library currently only supports GPU compilation, and you need to set the compilation option TensorRT_ROOT to the path where tensorrt is located. -2. Windows support requires TensorRT version 5.0 or higher. -3. Paddle-TRT currently only supports fixed input shape. -4. After downloading and installing tensorrt, you need to manually add virtual destructors for `class IPluginFactory` and `class IGpuAllocator` in the `NvInfer.h` file: - ``` c++ - virtual ~IPluginFactory() {}; - virtual ~IGpuAllocator() {}; - ``` - -## Paddle-TRT interface usage - -When using AnalysisPredictor, we enable Paddle-TRT by setting - -``` c++ -config->EnableTensorRtEngine(1 << 20 /* workspace_size*/, - batch_size /* max_batch_size*/, - 3 /* min_subgraph_size*/, - AnalysisConfig::Precision::kFloat32 /* precision*/, - false /* use_static*/, - false /* use_calib_mode*/); -``` -The details of this interface is as following: - -- **`workspace_size`**: type:int, default is 1 << 20. Sets the max workspace size of TRT. TensorRT will choose kernels under this constraint. -- **`max_batch_size`**: type:int, default is 1. Sets the max batch size. Batch sizes during runtime cannot exceed this value. -- **`min_subgraph_size`**: type:int, default is 3. Subgraph is used to integrate TensorRT in PaddlePaddle. To avoid low performance, Paddle-TRT is only enabled when th number of nodes in th subgraph is more than `min_subgraph_size`. -- **`precision`**: type:`enum class Precision {kFloat32 = 0, kHalf, kInt8,};`, default is `AnalysisConfig::Precision::kFloat32`. Sets the precision of TRT, supporting FP32(kFloat32), FP16(kHalf), Int8(kInt8). Using Paddle-TRT int8 calibration requires setting `precision` to `AnalysisConfig::Precision::kInt8`, and `use_calib_mode` to true. -- **`use_static`**: type:bool, default is false. If set to true, Paddle-TRT will serialize optimization information to disk, to deserialize next time without optimizing again. -- **`use_calib_mode`**: type:bool, default is false. Using Paddle-TRT int8 calibration requires setting this option to true. - -**Note:** Paddle-TRT currently only supports fixed input shape. - -## Paddle-TRT example compiling test - -1. Download or compile Paddle Inference with TensorRT support, refer to [Install and Compile C++ Inference Library](../../inference_deployment/inference/build_and_install_lib_en.html). -2. Download NVIDIA TensorRT(with consistent version of cuda and cudnn in local environment) from [NVIDIA TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) with an NVIDIA developer account. -3. Download [Paddle Inference sample](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz) and uncompress, and enter `sample/paddle-TRT` directory. - - `paddle-TRT` directory structure is as following: - - ``` - paddle-TRT - ├── CMakeLists.txt - ├── mobilenet_test.cc - ├── fluid_generate_calib_test.cc - ├── fluid_int8_test.cc - ├── mobilenetv1 - │ ├── model - │ └── params - ├── run.sh - └── run_impl.sh - ``` - - - `mobilenet_test.cc` is the c++ source code of inference using Paddle-TRT - - `fluid_generate_calib_test.cc` is the c++ source code of inference using Paddle-TRT int8 calibration to generate calibration table - - `fluid_int8_test.cc` is the c++ source code of inference using Paddle-TRT int8 - - `mobilenetv1` is the model dir - - `run.sh` is the script for running inference - - Here we assume that the current directory is `SAMPLE_BASE_DIR/sample/paddle-TRT`. - - ``` shell - # set whether to enable MKL, GPU or TensorRT. Enabling TensorRT requires WITH_GPU being ON - WITH_MKL=ON - WITH_GPU=OFF - USE_TENSORRT=OFF - - # set path to CUDA lib dir, CUDNN lib dir, TensorRT root dir and model dir - LIB_DIR=YOUR_LIB_DIR - CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR - CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR - TENSORRT_ROOT_DIR=YOUR_TENSORRT_ROOT_DIR - MODEL_DIR=YOUR_MODEL_DIR - ``` - - Please configure `run.sh` depending on your environment. - -4. Build and run the sample. - - ``` shell - sh run.sh - ``` - -## Paddle-TRT INT8 usage - -1. Paddle-TRT INT8 introduction - The parameters of the neural network are redundant to some extent. In many tasks, we can turn the Float32 model into Int8 model on the premise of precision. At present, Paddle-TRT supports to turn the trained Float32 model into Int8 model off line. The specific processes are as follows: - - 1)**Create the calibration table**. We prepare about 500 real input data, and input the data to the model. Paddle-TRT will count the range information of each op input and output value in the model, and record in the calibration table. The information can reduce the information loss during model transformation. - - 2)After creating the calibration table, run the model again, **Paddle-TRT will load the calibration table automatically**, and conduct the inference in the INT8 mode. - -2. compile and test the INT8 example - - change the `mobilenet_test` in `run.sh` to `fluid_generate_calib_test` and run - - ``` shell - sh run.sh - ``` - - We generate 500 input data to simulate the process, and it's suggested that you use real example for experiment. After the running period, there will be a new file named trt_calib_* under the `SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/_opt_cache` model directory, which is the calibration table. - - Then copy the model dir with calibration infomation to path - - ``` shell - cp -rf SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/ SAMPLE_BASE_DIR/sample/paddle-TRT/mobilenetv1_calib - ``` - - change `fluid_generate_calib_test` in `run.sh` to `fluid_int8_test`, and change model dir path to `SAMPLE_BASE_DIR/sample/paddle-TRT/mobilenetv1_calib` and run - - ``` shell - sh run.sh - ``` - -## Paddle-TRT subgraph operation principle - - Subgraph is used to integrate TensorRT in PaddlePaddle. After model is loaded, neural network can be represented as a computing graph composed of variables and computing nodes. Functions Paddle TensorRT implements are to scan the whole picture, discover subgraphs that can be optimized with TensorRT and replace them with TensorRT nodes. During the inference of model, Paddle will call TensorRT library to optimize TensorRT nodes and call native library of Paddle to optimize other nodes. During the inference, TensorRT can integrate Op horizonally and vertically to filter redundant Ops and is able to choose appropriate kernel for specific Op in specific platform to speed up the inference of model. - - -A simple model expresses the process : - -**Original Network** -

- -

- -**Transformed Network** -

- -

- - We can see in the Original Network that the green nodes represent nodes supported by TensorRT, the red nodes represent variables in network and yellow nodes represent nodes which can only be operated by native functions in Paddle. Green nodes in original network are extracted to compose subgraph which is replaced by a single TensorRT node to be transformed into `block-25` node in network. When such nodes are encountered during the runtime, TensorRT library will be called to execute them. - -## Paddle-TRT benchmark - -### Test Environment -- CPU:Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz GPU:Tesla P4 -- TensorRT 4.0, CUDA 8.0, CUDNN V7 -- models: ResNet50,MobileNet,ResNet101, Inception V3. - -### Test set -**PaddlePaddle, PyTorch, TensorFlow** - -- PaddlePaddle integrates TensorRT with subgraph, model[link](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models)。 -- PyTorch uses original kernels, model[link1](https://github.com/pytorch/vision/tree/master/torchvision/models), [link2](https://github.com/marvis/pytorch-mobilenet)。 -- We tested TF original and TF-TRT**对 TF—TRT 的测试并没有达到预期的效果,后期会对其进行补充**, model[link](https://github.com/tensorflow/models)。 - - -#### ResNet50 - -|batch_size|PaddlePaddle(ms)|PyTorch(ms)|TensorFlow(ms)| -|---|---|---|---| -|1|4.64117 |16.3|10.878| -|5|6.90622| 22.9 |20.62| -|10|7.9758 |40.6|34.36| - -#### MobileNet - -|batch_size|PaddlePaddle(ms)|PyTorch(ms)|TensorFlow(ms)| -|---|---|---|---| -|1| 1.7541 | 7.8 |2.72| -|5| 3.04666 | 7.8 |3.19| -|10|4.19478 | 14.47 |4.25| - -#### ResNet101 - -|batch_size|PaddlePaddle(ms)|PyTorch(ms)|TensorFlow(ms)| -|---|---|---|---| -|1|8.95767| 22.48 |18.78| -|5|12.9811 | 33.88 |34.84| -|10|14.1463| 61.97 |57.94| - - -#### Inception v3 - -|batch_size|PaddlePaddle(ms)|PyTorch(ms)|TensorFlow(ms)| -|---|---|---|---| -|1|15.1613 | 24.2 |19.1| -|5|18.5373 | 34.8 |27.2| -|10|19.2781| 54.8 |36.7| diff --git a/docs/guides/performance_improving/quantization.md b/docs/guides/performance_improving/quantization.md deleted file mode 100644 index c07c0b00d6f..00000000000 --- a/docs/guides/performance_improving/quantization.md +++ /dev/null @@ -1,169 +0,0 @@ -# 飞桨模型量化 - -深度学习技术飞速发展,在很多任务和领域都超越传统方法。但是,深度学习模型通常需要较大的存储空间和计算量,给部署应用带来了不小挑战。 - -模型量化作为一种常见的模型压缩方法,使用整数替代浮点数进行存储和计算,可以减少模型存储空间、加快推理速度、降低计算内存,助力深度学习应用的落地。 - -飞桨提供了模型量化全流程解决方案,首先使用 PaddleSlim 产出量化模型,然后使用 Paddle Inference 和 Paddle Lite 部署量化模型。 - -
- missing -
图 1. 飞桨模型量化全流程解决方案
-
- -## 产出量化模型 - -飞桨模型量化全流程解决方案中,PaddleSlim 负责产出量化模型。 - -PaddleSlim 支持三种模型量化方法:动态离线量化方法、静态离线量化方法和量化训练方法。这三种量化方法的特点如下图。 - -
- missing -
图 2. 量化方法概述
-
- -动态离线量化方法不需要使用样本数据,也不会对模型进行训练。在模型产出阶段,动态离线量化方法将模型权重从浮点数量化成整数。在模型部署阶段,将权重从整数反量化成浮点数,使用浮点数运算进行预测推理。这种方式主要减少模型存储空间,对权重读取费时的模型有一定加速作用,对模型精度影响较小。 - -静态离线量化方法要求有少量无标签样本数据,需要执行模型的前向计算,不会对模型进行训练。在模型产出阶段,静态离线量化方法使用样本数据执行模型的前向计算,同时对量化 OP 的输入输出进行采样,然后计算量化信息。在模型部署阶段,使用计算好的量化信息对输入进行量化,基于整数运算进行预测推理。静态离线量化方法可以减少模型存储空间、加快模型推理速度、降低计算内存,同时量化模型只存在较小的精度损失。 - -量化训练方法要求有大量有标签样本数据,需要对模型进行较长时间的训练。在模型产出阶段,量化训练方法使用模拟量化的思想,在模型训练过程中会更新权重,实现拟合、减少量化误差的目的。在模型部署阶段,量化训练方法和静态离线量化方法一致,采用相同的预测推理方式,在存储空间、推理速度、计算内存三方面实现相同的收益。更重要的是,量化训练方法对模型精度只有极小的影响。 - - -根据使用条件和压缩目的,大家可以参考下图选用不同的模型量化方法产出量化模型。 - -
- missing -
图 3. 选择量化方法
-
- -产出量化模型的使用方法、Demo 和 API,请参考[PaddleSlim 文档](https://paddleslim.readthedocs.io/zh_CN/latest/index.html)。 - -## 部署量化模型 - -飞桨模型量化全流程解决方案中,Paddle Inference 负责在服务器端(X86 CPU 和 Nvidia GPU)部署量化模型,Paddle Lite 负责在移动端(ARM CPU)上部署量化模型。 - -X86 CPU 和 Nvidia GPU 上支持部署 PaddleSlim 静态离线量化方法和量化训练方法产出的量化模型。 -ARM CPU 上支持部署 PaddleSlim 动态离线量化方法、静态离线量化方法和量化训练方法产出的量化模型。 - -因为动态离线量化方法产出的量化模型主要是为了压缩模型体积,主要应用于移动端部署,所以在 X86 CPU 和 Nvidia GPU 上暂不支持这类量化模型。 - -### NV GPU 上部署量化模型 - -使用 PaddleSlim 静态离线量化方法和量化训练方法产出量化模型后,可以使用 Paddle Inference 在 Nvidia GPU 上部署该量化模型。 - -Nvidia GPU 上部署常规模型的流程是:准备 TensorRT 环境、配置 Config、创建 Predictor、执行。Nvidia GPU 上部署量化模型和常规模型大体相似,需要改动的是:指定 TensorRT 配置时将 precision_mode 设置为 paddle_infer.PrecisionType.Int8,将 use_calib_mode 设为 False。 - -``` -config.enable_tensorrt_engine( - workspace_size=1<<30, - max_batch_size=1, - min_subgraph_size=5, - precision_mode=paddle_infer.PrecisionType.Int8, - use_static=False, - use_calib_mode=False) -``` - -Paddle Inference 的详细说明,请参考[文档](https://paddleinference.paddlepaddle.org.cn/product_introduction/summary.html)。 - -Nvidia GPU 上部署量化模型的详细说明,请参考[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)。 - -### X86 CPU 上部署量化模型 - -使用 PaddleSlim 静态离线量化方法和量化训练方法产出量化模型后,可以使用 Paddle Inference 在 X86 CPU 上部署该量化模型。 - -X86 CPU 上部署量化模型,首先检查 X86 CPU 支持指令集,然后转换量化模型,最后在 X86 CPU 上执行预测。 - -Paddle Inference 的详细说明,请参考[文档](https://paddle-inference.readthedocs.io/en/latest/#)。 - -X86 CPU 上部署量化模型的详细说明,请参考[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)。 - -1)检查 X86 CPU 支持指令集 - -大家可以在命令行中输入 lscpu 查看本机支持指令。 - -在支持 avx512、不支持 avx512_vnni 的 X86 CPU 上(如:SkyLake, Model name:Intel(R) Xeon(R) Gold X1XX),量化模型性能为原始模型性能的 1.5 倍左右。 - -在支持 avx512 和 avx512_vnni 的 X86 CPU 上(如:Casecade Lake, Model name: Intel(R) Xeon(R) Gold X2XX),量化模型的精度和性能最高,量化模型性能为原始模型性能的 3~3.7 倍。 - -2)转换量化模型 - -下载[转换脚本](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/save_quant_model.py)到本地. -``` -wget https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/save_quant_model.py -``` - -使用脚本转换量化模型,比如: -``` -python save_quant_model.py \ - --quant_model_path=/PATH/TO/PADDLESLIM/GENERATE/MODEL \ - --int8_model_save_path=/PATH/TO/SAVE/CONVERTED/MODEL -``` - -3)执行预测 - -准备预测库,加载转换后的量化模型,创建 Predictor,进行预测。 - -注意,在 X86 CPU 预测端部署量化模型,必须开启 MKLDNN,不要开启 IrOptim(模型已经转换好)。 - -4)数据展示 - -**图像分类 INT8 模型在 Intel(R) Xeon(R) Gold 6271 上精度** - -| Model | FP32 Top1 Accuracy | INT8 Top1 Accuracy | Top1 Diff | FP32 Top5 Accuracy | INT8 Top5 Accuracy | Top5 Diff | -|:------------:|:------------------:|:------------------:|:---------:|:------------------:|:------------------:|:---------:| -| MobileNet-V1 | 70.78% | 70.74% | -0.04% | 89.69% | 89.43% | -0.26% | -| MobileNet-V2 | 71.90% | 72.21% | 0.31% | 90.56% | 90.62% | 0.06% | -| ResNet101 | 77.50% | 77.60% | 0.10% | 93.58% | 93.55% | -0.03% | -| ResNet50 | 76.63% | 76.50% | -0.13% | 93.10% | 92.98% | -0.12% | -| VGG16 | 72.08% | 71.74% | -0.34% | 90.63% | 89.71% | -0.92% | -| VGG19 | 72.57% | 72.12% | -0.45% | 90.84% | 90.15% | -0.69% | - -**图像分类 INT8 模型在 Intel(R) Xeon(R) Gold 6271 单核上性能** - -| Model | FP32 (images/s) | INT8 (images/s) | Ratio (INT8/FP32) | -|:------------:|:---------------:|:---------------:|:-----------------:| -| MobileNet-V1 | 74.05 | 216.36 | 2.92 | -| MobileNet-V2 | 88.60 | 205.84 | 2.32 | -| ResNet101 | 7.20 | 26.48 | 3.68 | -| ResNet50 | 13.23 | 50.02 | 3.78 | -| VGG16 | 3.47 | 10.67 | 3.07 | -| VGG19 | 2.83 | 9.09 | 3.21 | - -**Ernie INT8 模型在 Intel(R) Xeon(R) Gold 6271 的精度结果** - -| Model | FP32 Accuracy | INT8 Accuracy | Accuracy Diff | -| :---: | :-----------: | :-----------: | :-----------: | -| Ernie | 80.20% | 79.44% | -0.76% | - - -**Ernie INT8 模型在 Intel(R) Xeon(R) Gold 6271 上单样本耗时** - -| Threads | FP32 Latency (ms) | INT8 Latency (ms) | Ratio (FP32/INT8) | -| :--------: | :---------------: | :---------------: | :---------------: | -| 1 thread | 237.21 | 79.26 | 2.99X | -| 20 threads | 22.08 | 12.57 | 1.76X | - - -### ARM CPU 上部署量化模型 - -Paddle Lite 可以在 ARM CPU 上部署 PaddleSlim 动态离线量化方法、静态离线量化方法和量化训练方法产出的量化模型。 - -Paddle Lite 部署量化模型的方法和常规非量化模型完全相同,主要包括使用 opt 工具进行模型优化、执行预测。 - -Paddle Lite 的详细说明,请参考[文档](https://www.paddlepaddle.org.cn/lite)。 - -Paddle Lite 部署动态离线量化方法产出的量化模型,请参考[文档](https://www.paddlepaddle.org.cn/lite/develop/user_guides/quant/quant_post_dynamic.html)。 - -Paddle Lite 部署静态离线量化方法产出的量化模型,请参考[文档](https://www.paddlepaddle.org.cn/lite/develop/user_guides/quant/quant_post_static.html)。 - -Paddle Lite 部署量化训练方法产出的量化模型,请参考[文档](https://www.paddlepaddle.org.cn/lite/develop/user_guides/quant_aware.html)。 - -**模型量化前后性能对比** - -| 骁龙 855 | armv7(ms) | armv7(ms) | armv7(ms) | armv8(ms) | armv8(ms) | armv8(ms) | -|:------:|:---------:|:---------: | :-------: | :--------:| :--------:| :--------:| -| threads num| 1 | 2 | 4 | 1 | 2 | 4 | -| mobilenet_v1_fp32 | 32.19 | 18.75 | 11.02 | 29.50 | 17.50 | 9.58 | -| mobilenet_v1_int8 | 19.00 | 10.93 | 5.97 | 13.08 | 7.68 | 3.98 | -| mobilenet_v2_fp32 | 23.77 | 14.23 | 8.52 | 19.98 | 12.19 | 7.44 | -| mobilenet_v2_int8 | 17.68 | 10.49 | 5.93 | 12.76 | 7.70 | 4.36 |