2.4.0 Release Note

1. 重要更新

新动态图架构正式生效：新动态图框架调大幅提升了调度性能，超90%API的调度性能提升超过50%，超50%套件模型性能提升超过5%，功能架构更加清晰，二次开发能力和体验显著增强。
全面提升了飞桨的动静统一能力： 动转静功能提供了更加丰富的Python语法支持，飞桨的Python语法覆盖率达到90%，对语法转写逻辑进行了重点地优化，完备地支持了控制流语法，提供了更加流畅的一键转静态图体验；借助全新升级的静态图执行器，让动转静训练具有更优的加速能力，重点模型测试显示接近静态图最佳水平；提升了动转静的可扩展性，新增支持多函数合并导出和推理，支持用户使用PHI算子库进行二次开发和灵活部署，有效支撑语音领域U2++特色模型的自定义解码。
新增稀疏计算类API： 新增55个稀疏API paddle.sparse.*，支持稀疏计算主流场景，已应用于3D点云目标检测、Sparse Transformers等任务的稀疏训练和推理部署，高稀疏度场景下相比使用DenseTensor提速105.75%，相比同类产品稀疏计算提速4.01%~58.55%；支持多种稀疏Tensor(SparseCoo 和 SparseCsr等)的计算，极致节省显存；同时保持了一致的使用体验，和稠密Tensor的API使用方式一致。
大规模图神经网络GPU训练引擎： 通过SSD、内存、显存的异构层次化存储技术，突破显存瓶颈,支持超大规模图的全GPU存储和训练；实现了游走、采样、训练的全GPU一体化解决方案，相比传统的分布式CPU解决方案，相同成本的情况下训练速度提升10+倍。
环境适配： 新增了适配CUDA11.7 版本的预编译安装包，新增了支持在Ubuntu 22.04及以上版本中运行。

前瞻性预告

飞桨框架将在2.5版本废弃对python 3.6的支持。
飞桨框架将会逐步废弃python端的paddle.fluild命名空间下的API，在2.5版本时，部分该命名空间下的API将会被直接删除。

2. 不兼容升级

取消了适配CUDA10.1 版本的预编译安装包。
Tensor.clear_gradient(bool set_to_zero)接口不再接收kwargs传入的值，只能通过args传入set_to_zero的bool变量。
为了提高显存利用效率，动态图默认仅保留前向叶子结点变量的梯度如训练中网络参数的梯度，而不再支持默认保留非叶子结点的梯度。如果需要保留特定Tensor的梯度，可以在反向执行前调用Tensor.retain_grads()接口。
paddle.autograd.PyLayer将不再支持输入是tuple的情况，如果输入希望是一组Tensor的情况请传入list of Tensor。

3. 训练框架（含分布式）

（1）新增API和增强API功能

新增稀疏计算类API：paddle.sparse
- 新增55个稀疏API，支持稀疏计算主流场景，已应用于3D点云目标检测、Sparse Transformers等任务的稀疏训练和推理部署，高稀疏度场景下相比使用DenseTensor提速105.75%，相比同类产品稀疏计算提速4.01%~58.55%；支持多种稀疏Tensor(SparseCoo 和 SparseCsr等)的计算，极致节省显存；同时保持了一致的使用体验，和稠密Tensor的API使用方式一致。#45849, #46694, #45086, #41857, #42935, #43475, #43668, #43966, #44022, #44346, #44432, #44451, #44743, #42013, #43520, #41434, #42130, #41276, #41857, #41356
新增语音领域API： paddle.audio
- 新增MFCC、Spectrogram、LogMelSpectrogram等特征提取API，支持GPU计算，相比CPU实现处理性能提升 15x 倍以上，可大幅提升语音模型训练GPU利用率。#45424
- 新增窗函数、离散余弦变换等特征提取基础API，方便用户自定义语音特征提取。#45424
- 新增语音 IO 模块，提供2种音频I/O backend，支持6种编解码，便捷地实现语音数据的加载。 #45939
- 新增TESS，ESC50语音分类数据集，方便用户完成经典语音分类模型。#45939
新增图学习领域API： paddle.geometric
- 图学习逐渐成为机器学习领域的关键技术，飞桨新增paddle.geometric模块提供更好的图学习建模和训练开发体验。
  - 消息传递：图学习消息传递机制是图建模的基础，因此新增7个图学习消息传递API，更方便完成进行图学习建模。其中，新增的3个消息传递融合算子可大幅减少图模型训练显存占用，稠密图场景下GCN系列模型可节省50%+显存，训练速度可提升20%+。#44848, #44580, #43174, #44970
  - 图采样：图采样是图模型训练的性能瓶颈，此次新增了高性能图采样算子，支持高并发图采样，GraphSage的采样速度可提升32倍以上，模型训练速度可提升12倍以上。#44970
新增视觉领域API
- paddle.vision新增目标检测领域算子paddle.vision.distribute_fpn_proposals(#43736), paddle.vision.generate_proposals(#43611), paddle.vision.matrix_nms(#44357), paddle.vision.prior_box和paddle.vision.box_coder(#47282)。
- 新增其他API
- 新增iinfo(#45321), count_nonzero(#44169), nanmedian(#42385), remainder_ (#45266), take(#44741), triu_indices(#45168), sgn(#44568), bucketize(#44195), nanquantile(#41343), frac(#41226), logcumsumexp(#42267), pairwise_distance(#44161), heaviside(#41872), logspace(#41261), corrcoef(#40690)
- 新增RReLU(#41823), CyclicLR(#40698), OneCycleLR(#41825), Softmax2D(#40910), SoftMarginLoss(#42364), MultiLabelSoftMarginLoss(#41183), TripletMarginLoss(#40487), TripletMarginWithDistanceLoss(#40545), CosineEmbeddingLoss和cosine_embedding_loss(#41680), PixelUnshuffle(#40728), ChannelShuffle(#40743)
增强API功能
- 增加BatchNorm1D的大batch_size计算功能 #43072
完善集合通信分布式训练API
- 完善fleet.init函数，增加log_level参数，方便用户查看运行过程中的日志 #45909
- 新增paddle.distributed.fleet.recompute_sequential paddle.distributed.fleet.recompute_hybrid接口，方便用户使用recompute功能#45348
- 新增paddle.distributed.fleet.layers.mpu package，方便用户使用张量并行功能 #45803
- 新增通信API paddle.distributed.destroy_process_group paddle.distributed.isend paddle.distributed.irecv paddle.distributed.all_to_all_single，提升了通信的功能完备性和易用性 #43918
- 新增paddle.distributed.stream 通信package，性能比基础版本提升5%到10% #46023 #45282
- 通信API新增多种数据类型Char/Byte/Bool等的支持，提升了通信的功能完备性和易用性 #45574 #45440
- 通信API异步参数从use_calc_stream变成sync_op，增强了接口的语义可读性 #46493
增强高层API
- 高层API中视觉模型ResNeXt实现复用ResNet代码进行重构。 #40588
- 高层API中视觉模型Inceptionv3、MobileNetv1、MobileNetv2、ShuffleNetv2实现改进。#40431

（2）新功能及重要功能升级

新动态图架构正式上线：新动态图框架调度性能大幅提升，相比原有架构大幅提升了调度性能，超90%API的调度性能提升超过50%，超50%套件模型性能提升超过5%; 新动态图架构清晰，耦合度低，基于新架构实现Hook、PyLayer等扩展模块的学习与开发成本显著降低。#37550，#37574，#37813，#37926，#39192，#37599，#37406，#37466，#37599，#40945，#39989
高阶自动微分机制：为了更好支持科学计算等场景，飞桨框架针对高阶自动微分能力进一步完善优化。目前，已在paddle.incubate.autograd 目录下提供了支持前反向高阶自动微分相关试用功能及API（当前处于孵化状态，相关功能及API 签名可能会发生变化）。如果想自行实现相关模型、探索自动微分机制，请仔细阅读高阶自动微分使用方法及限制。具体的升级包括：
1. 静态图高阶微分机制升级，通过基础算子体系和程序变换，支持高阶前向及反向微分，并打通编译器、分布式功能。#41919, #41201
2. 新增前向和反向高阶自动微分API， paddle.incubate.autograd.forward_grad, paddle.incubate.autograd.grad。#43354
3. 新增18个高阶自动微分算子sin, cos, exp, erf, abs, log, cast, where, equal, not_equal, greater_than, greater_equal, elementwise_pow square, elementwise_max, gelu, reduce_mean, size。#46184, #46024, #45888, #45338, #44345
4. 修复现有elementwise_div, reduce_sum, p_norm等算子缺陷。#46514, #46184
通用异构参数服务器架构：
- 参数服务器GPUGraph基础架构升级，满足大规模应用落地：针对传统CPU存储和训练大规模图神经网络的成本高，稳定性低，性能不足的问题打造了纯GPU图训练引擎（PGLBox），通过SSD、内存、显存的异构层次化存储技术，支持超大规模图模型训练，同等成本下训练性能相对CPU图训练引擎提升10+倍，任务失败率下降到极低。#44594
- 大规模联邦参数服务器架构：针对大规模个性化推荐场景，基于异构PS基础架构，开发了大规模联邦参数服务器训练，支持千亿参数下的横向纵向联邦，它包括两个特性：用户私有参数本地更新，公共参数在远端更新，用户可灵活配置私有参数和公共参数的切分策略；新增中心调度节点 Coordinator，用户可从基类进行二次开发，自定义 Client 选择策略。#42682，#44864，#44327
自适应并行
- 设计并推出了完善的自动并行接口体系，支持自动动转静分布式训练、自动分布式数据加载、自动分布式保存与加载、自动参数转换、自定义切分标记和自定义执行过程等。用户只需要基于单机组网就可以非常容易获得自动分布式训练能力，支持数据并行、模型并行、流水线并行和混合并行。#45776 ，#46552，#44202，#45840，#45518，#40528，#42838，#43093，#43312，#45053。
- 完善了自适应并行底层机制，包括升级分布式cost model设计和实现，为切分策略提供更好评价；为Program IR添加了原生分布式属性，丰富了Cluster功能。#40457，#42601，#42727，#42874，#43114，#44095，#44146，#44701，#44973，#45002，#45118，#45237，#42576，#41722，#44150， #44989， #44951， #44963。
- 新增数据并行下Sharding stage1/2/3自动调优功能，在保证满足显存约束情况下，自动选择吞吐最高的Sharding stage策略。#43782。
训练硬件接入-插件式方案：新增了自定义Runtime/Kernel/CCL/Graph/Pass等方案，硬件厂商可以根据硬件特性按需选择实现哪些模块。
ONNX 格式导出
- 支持量化模型导出，导出后的 ONNX 模型使用 TensorRT 或 ONNXRuntime 加载推理，可获得 1.5~4 倍的推理加速 #856，#782
- 新增大于 2GB 的大模型导出 #942

（3）功能优化

动转静分析转换 & 扩展能力全面提升
- 为了提升模型动转静转换成功率和使用体验，重构了控制流语法的转写逻辑，升级核心语法为 JIT （just-in-time）范式，实现与Python代码的等价转写，并完善了break、return、continue等语法功能。#43666，#43846，#43848，#43880，#43957，#43328，#43348，#43998，#44465，#44504，#43713，#43864，#43967，#44155，#44487，#44527，#45105，#45900
- 为了支撑语音等场景自定义解码灵活部署场景，扩展了jit.save/load 接口功能，支持用户多函数合并导出，并新增了JITLayer组件，支持类函数式调用，同时配合PHI算子库C++ API实现了自定义推理部署功能。#44283，#41783，#43607，#43754，#43758，#43798，#44010，#44351，#44465，#44504，#44597，#44738，#44984，#46249
- 为了统一API动静行为，升级了20个算子，支持在静态图中Op的attribute信息可变，保证动静行为一致，提升模型的动转静转换成功率。包括pad2d、depthwise_conv2d_transpose、conv2d_transpose、adaptive_avg_pool2d、reverse、bincount、multinomial、reduce_sum、reduce_mean、reduce_prod、reduce_min、reduce_max、uniform、squeeze、max_unpool2d、dropout、cumsum、eye、argmin、argmax，#44737，#45084，#45189，#45391，#45417，#45427、#45514、#45525、#45543、#45660、#46352、#46433、#45078，#45342，#45372，#45453，#45522，#45620
- 为了解决用户动转静报错栈偶尔丢失问题，优化了报错模块的逻辑，提升了报错栈的可读性以及用户调试的使用体验。#44054，#44083，#44781，#44996
- 为了全面支持 Python 类型 Type Hint 语法，新增了TypeHint语法识别和转写模块。#47121
PHI算子库覆盖全量运算类算子：继续建设高可复用算子库PHI，将剩余的飞桨2.x 运算类PythonAPI关联的算子以及相关内核均迁移到PHI算子库，并改写为函数式，新增了约180个前反向算子的CPU&GPU内核，以及170个Kunlun专用算子内核，进一步提升了新增算子时可复用的内核函数集。同时，新增了100余个C++运算类API，可支持在自定义算子中使用，进一步提升了基于飞桨进行外部扩展开发的易用性。#44577，#44631，#44434，#44605，#44676，#44742，#44436，#45887，#45851，#45623，#45397，#45863
规范化算子定义，大幅提升模型简洁度：针对飞桨1.x历史算子定义存在诸多冗余参数，理解适配成本高的问题，对约150个高频算子的冗余参数进行了集中清理，基本上将数学无关的参数清理完毕。这些冗余参数清理后，飞桨存储的推理模型中信息量明显减少，普遍裁减掉了约40%的属性变量，显著提升了飞桨算子定义的清晰程度，提升了模型分析调试的体验；同时，也显著减小了飞桨存储推理模型的体积，普遍减小超过70%，显著提升了飞桨模型的轻量化程度。#44310 , #45613 , #45684 , #45708 , #45758 , #45786 , #45772 , #45845 , #45984 , #46218 , #46553

（4）性能优化

AMP性能及精度优化
- 更多算子增加FP16数据类型支持，包括elementwise系列算子, compare系列算子, strided_slice, set_value, uniform_ramdom等。（#45504 #44405 #45496 #46641 #46906）
- 优化hard_swish算子FP16 Kernel实现方案，保证精度无损。（ 35386 ）
- 更多算子增加BF16数据类型支持，包括fused_linear、empty、selu、pow、adam、clip、embedding、gelu、pad3d、pixel_shuffle、tile、where等。#46364，#47177
单机训练性能自动调优
- Transpose OP 支持自动Kernel选择机制，可以针对不同模型配置自动搜索到性能最优的Kernel实现，提升模型性能。#43310 (Transpose Op接入自动调优功能)
- AMP Layout自动切换支持新动态图模式，ResNet50、TSM、DeepLabV3等模型在新动态图下通过 Layout 自动调整获得性能提升9%~21%。(#45409, #45751, #45826, #46880)
GPU单机训练通用性能优化
- 优化Conv类算子cuDNN算法的Cache方案，并Cache所有算法获取方式下的结果，大幅减少算子的CPU开销。（#41891 #47197）
- 进一步优化多个算子的GPU Kernel和Python端性能，包括dist, poisson, depthwise_conv2d、transpose, eigh, broadcast类计算，reduce类计算，layer_norm，cross_entropy等，在更多配置场景下达到更优性能。（#44946, #45057, #45160, #42491, #42704, #42853, #46287, #46362, #46490, #46412, #46623, #40051）
集合通信分布式训练性能优化
- 为提高流水线并行调度效率，支持动态图Interleaving 1F1B调度策略，在GPT-3模型上性能提升3%~4%。#45797，#45869，#45922，#46209，#45402，#45444，#45497，#45797，#45869，#45922，#46209，#46399，#46483，#46876，#47242，#47249，#47497，#47517
- 为提升MLPerf BERT模型的分布式训练性能，DistributedFusedLamb分布式优化器支持分层AllReduce，在DCU 1024卡上MLPerf BERT性能提升17%。#44821，#44843
- 为优化使用数据并行Data Parallel时的显存占用，支持Tensor Fusion时的Buffer Lazy初始化策略，可降低等于模型参数量的显存占用量。#45631。
- 分布式并行策略Data Parallel和Sharding支持BF16训练。#46846，#47246
- 为支持Sequence Parallel等策略，分布式流水线并行策略支持enable_partial_send_recv策略，支持传输sequence parallel切分后的tensor。#46992，#47083
- 为提升sharding stage 2策略的性能，实现了sharding stage 2 optimizer broadcast参数与下一个step forward的overlap，并使用多CUDA Stream进行通信，GPT 6.7B模型16卡训练性能提升11%。#46495，#46656，#47061

（5）问题修复

动转静
- 修复了模型在多卡训练时Parameter无梯度场景下，动转静会报错的问题。#44485
- 修复了动转静时终端会有多余的框架日志误输出的问题。#45754，#46800
- 修复了模型中控制流中包含无需梯度的Tensor时，在动转静训练时会报错的问题。#43034
- 修复了动转静训练在梯度聚合时计算值错误的问题。#44893
- 修复了函数被@staticmethod装饰时动转静会报错的问题。#44983，#45268，#45277
- 修复了部分场景下模型包含控制动转静训练时，显存占用过多的问题。#45380
- 修复了模型中包含复杂控制流时，动转静在组网阶段shape推导报错的问题。#45916，#46020
报错机制修复
- 使用np.testing.assert_allclose替换self.assertTrue(np.allclose(...))，获得更充分的报错信息 ([#44947)(https://github.com//pull/44947)， #44988，#45213)
集合通信分布式训练
- 修复了通信库初始化、通信过程中的若干bug，增强了系统运行稳定性 #44964 #45100 #44758
- 修复流水线并行容易hang的问题，增强策略的易用性 #47201；增强流水线功能支持不均衡的输入 #47199
- 修复新动态图MP/PP策略下性能低于老动态图的问题 #47071
- 修复sharding stage2策略错误维护参数trainable属性的bug #47240
- 修复一系列OP在tensor numel大于INT32_MAX时的bug。#45711，#45741，#45897，#46158，#46767，#47191，#46045，#46160
- 修复FusedAttention和FusedFeedForward OP显存占用过大的bug。#47236，#47235
- 修复multi_tensor_adam和multi_tensor_momentum OP在传入的parameters是list of dict时参数更新错误的bug。#47352，#47372

4. 部署方向（Paddle Inference）

（1）新增特性

后端图引擎集成方案优化
- 为了减少Paddle-TensorRT插件代码开发，以及减少 Paddle-TensorRT子图数量从而降低资源占用率，开发了通用插件机制，可以自动对框架内丰富的Phi 算子提供统一的 TensorRT插件接口，在多数场景下可以有效减少显存占用。 #46970，#46179，#46580
- 为了方便用户在框架定制算子且能使得Paddle-TensorRT高效推理，进行功能升级支持升级框架自定义Paddle-TensorRT插件。#46970
Inference推理库构建系统优化，体积可按需裁剪
- 预编译的安装包默认支持TensorRT：训练用的预编译安装包与部署用的预编译安装包（Paddle Inference）统一为一个预编译安装包，且优化了构建系统，使得预编译的安装包默认支持TensorRT，减少用户使用PaddleTensorRT时的切换成本。#46008，#45824，#46058
- 体积可按需裁剪：可依据模型算子进行裁剪。#47033 , #47049 , #47047
Inference支持原生AMP
- 为了充分利用GPU Tensor Core计算能力，提升模型的推理性能，开发了模型精度转换工具，Inference GPU原生支持了混合精度模型的推理。使用方式可参考文档。#43814，#43881，#44057，#44307，#44457，#44866，#45050，#45346，#45379，#45406，#45882
- 为了提升混合精度下模型的推理性能，补充了未支持FP16计算的高频算子的FP16 kernel，减少了由于输入精度不匹配插入cast算子的可能性，提升推理性能。#44642，#45061，#44653，#45504，#45061，#44969，#44558，#44710，#43871，#44792
压缩与推理引擎打通升级
- 升级量化模型存储格式，新格式支持Paddle Inference、PaddleLite和Paddle2ONNX 3种部署方式，支持芯片类型包括X86 CPU、NVIDIA GPU、Arm CPU。（#46305 #462832 #46022）
- 新增兼容SoC/NPU芯片的INT8全量化功能，可保证产出的INT8量化模型在SoC/NPU芯片上有最佳推理加速和精度。
推理引擎与飞桨编译器（CINN）打通升级
- 升级飞桨框架与编译器的接口模块，支持推理模型通过Paddle Inference接入编译器进行优化（#44499 #44708 ）

（2）底层优化

GPU 性能优化
- 新增matmul_v2、LSTM、reshape、fill_constant、swish、mulitclass_nms3、bilinear_interp_v2、split、silu、shuffle_channel算子的TensorRT映射及完善动态shape的支持。多类重点模型性能提升7%～90% 。(#46177，#44678，#44314，#44561，#45166, #44411，#43424, #44516)
- 增加常量折叠PASS进行推理性能优化，提升SwinTransformer、HifiGAN、FastSpeech2等模型的性能。（#45494)
- 增加 conv_fusion workspacesize 的 cache，提升 conv_fusion 计算性能。(#45902)
视觉ViT模型优化
- 新增ViT模型Attention结构融合PASS，并支持OSS Plugin和自动padding，ViT推理速度提升30%-40% #45019 #45506
大模型推理性能优化
- 为提高超大生成模型推理速度以及显存节省，对多层Transformer融合算子(fused_multi_transformer_op)增加INT8实现（fused_multi_transformer_int8_op），支持生成模型的量化推理。结合矩阵乘算法选择、量化反量化kernel融合进行性能优化。 #46169
- 为了提升大模型推理使用fused_multi_transformer融合的易用性，增加Pass进行自动匹配融合。
CPU性能优化
- 优化语音 U2++ 模型，FP32 模型推理速度提升35%，INT8 模型推理速度提升69% (#47592 #47127 #47391 #47234 #47009 #47080)

（3）问题修复

TensorRT workspace size大小设置支持int64。（#44469）
Paddle-TRT中，全面支持Op的输入为权重。（#45545）
Paddle-TRT中，支持conv2d_transpose/conv3d_transpose含output_padding属性。（#45004）
Paddle-TRT中，增强strided_slice对动态shape的支持。（#46819）
Paddle-TRT中，优化了在多线程场景下运行时context的显存占用。（#45468）
Paddle-TRT中，修复了多个模型在同一进程中运行时，当初始化顺序变动时，反复生成序列化文件的问题。（#43942）
修复了同一进程中，多次初始化Predictor并运行时，偶发崩溃的问题。（#45203）
修复 MobileNetV3_large、ERNIE 3.0-Medium 和 bert 等量化模型推理精度异常问题 (#45416 #46283 #45920 #47573)

5. 环境适配

训练用的预编译安装包与部署用的预编译安装包（Paddle Inference）统一为一个预编译安装包，且优化了构建系统，使得预编译的安装包默认支持TensorRT。
取消了适配CUDA10.1 版本的预编译安装包。
新增了适配CUDA11.7 版本的预编译安装包。
源码编译时间缩短：减少模块间依赖，提升并行度，优化部分模块的编译速度，共同使的全量编译时间减少了约20分钟。
支持在windows 11、Centos 8、Ubuntu 22.04、Jetson 5.02系统环境上运行飞桨，支持使用WSL 2 工具在windows 系统中运行飞桨 linux 安装包。
修复飞桨在glibc2.34+环境中运行错误的问题。
优化了整个代码仓库中的C++、Python、CMake的代码风格，并引入或升级了以下的代码风格检查工具。
- pre-commit由1.10.4升级到2.17.0： #43103
- pylint由默认版本改为指定2.12.0版本： #43103
- remove-crlf由1.0.1升级到1.1.14： #43103
- cpplint由默认版本改为指定1.6.0版本： #43175，#43978，#43673，#43679，#43695，#43733，#43740
- clang-format由3.8升级到13.0： #42840，#43248，#43329，#43333，#43633，#43678
- 引入black工具进行python代码的风格检查：#46014
- 引入cmakelint工具用于cmake文件代码检查，版本为1.4.2： #43222，#43406，#43414，#43428
- 引入cmake-format用于cmake文件的自动格式化，版本为0.6.13： #43057

6. 硬件适配

海光DCU

增加在DCU上的Profiler功能，可以在DCU上对模型运行过程的性能数据进行收集、统计和展示，支持kernel层面的DCU占用率显示。

昆仑芯

增加在昆仑芯2代芯片上的Profiler功能，可以在昆仑芯2代芯片上对模型运行过程的性能数据进行收集、统计和展示，支持kernel层面的昆仑芯2代芯片占用率显示。
昆仑芯2代芯片（昆仑芯 AI加速卡 R200、R300、R200-8F、R200-8FS、RG800）训练/推理支持，已验证PPYOLOE、PP-OCR、ERNIE 3.0、PP-TSM、PP-TTS、DLRM、PPO等总计51个模型，支持静态图+动态图训练，支持混合精度训练，支持单机单卡、单机多卡训练，覆盖了智能视觉、自然语言处理、智能语音、智能推荐、强化学习5个领域。

寒武纪

寒武纪MLU芯片（MLU370系列板卡）训练/推理支持，已验证ResNet50、BERT、YoloV3、OCR-DB、Deeplabv3等多个模型，支持静态图+动态图训练，支持混合精度训练，支持单机单卡、单机多卡训练。

Graphcore

Graphcore IPU芯片（包括IPU Mk2 GC200 和 Bow IPU）训练/推理支持，支持ResNet50、BERT等模型，支持静态图和动转静模式训练，支持单芯片、单机、多机分布式训练。
增加更多算子支持
升级到 Poplar SDK v3.0.0 版本 #46892

支持使用动转静模式训练模型, 添加了一个新的 paddle.incubate.identity_loss op 用来辅助构图 #43770
支持 Paddle 原生的分布式训练 API paddle.distributed.launch #43311
支持使用混合精度训练模型 #41733
Paddle Inference 支持使用 PopART 自定义算子 #45235

Intel

迁移oneDNN算子transpose2_grad(#46139), relu6_grad(#46501), gaussian_random(#46747, #45481), sgd and stack(#46374), concat+grad, expand+grad,fill_constant(#45863), slice, slice_grad, split,pad and pad3d(#46101), softmax_grad(#46257), Shape(#46051), Sum(#46239), Transpose2_grad(#46139), Cast, clip+grad andpool+grad(#45775), Reduce sum+grad,mean+grad, min and max(#45536), Relu and abs(#45397), Gelu(#45596), Scale(#45537)
优化fill_constant, fc, conv等若干算子内核
增加若干Pass融合优化
优化Adam-W CPU FP32优化器 (#42522)
优化pad3d fp32 onednn算子内核实现 (#43990)
改进matmul, FC andlookup_v2 内核的并发执行 (#44023, #44078, #44640, #44744, #45249)
FC onednn算子内核支持bf16 ( #42758, #43154, #43109)
增加矩阵乘法和激活函数的融合(#43519, #43198)
支持卷积算子int8参数生产IR passes ( #44680, #42625)
增加pool/avg量化和scales修正 (#44186)
增加matmul和elementwise onednn算子内核融合(#45077)
修复QAT精度问题 (#43693, #45936, #46378)
迁移42个oneDNN算子内核到PHI算子库 (#46374, #46101, #45989, #45863, #45775, #45626, #45536, #46501, #46257, #45596, #45537, #45481, #45397, #46239, #46139, #46051)
量化elementwise_sub和shape算子内核 (#42854, #44124)

Thanks to our Contributors

This release contains contributions from:

0x45f, Aganlengzi, Ainavo, Allen Guo, Asthestarsfalll, Aurelius84, Baibaifan, baoachun, BiynXu, Bo Zhang, BrilliantYuKaimin, cambriconhsq, caozhou, carryyu, ccrrong, ceci3, chalsliu, Chang Xu, Charles-hit, Chen Long, Chen Weihang, chenjian, chentianyu03, Chenxiao Niu, cifar10, crystal, csy0225, danleifeng, David Nicolas, dc-cheny, denglin-github, dongfangshenzhu, duanboqiang, duanyanhui, engineer, enzodechine, Fan Zhang, feifei-111, Feiyu Chan, Feng Ni, feng_shuai, FlyingQianMM, freeliuzc, furnace, fuyou765, fwenguang, Ghost Screaming, gongweibao, Guanghua Yu, guguguzi, Guoxia Wang, Haipeng Wang, handiz, Haohongxiang, haosicheng, helen88, heliqi, hong, HongyuJia, houj04, huangxu96, Hui Zhang, Huihuang Zheng, huzhiqiang, Jacek Czaja, Jack Zhou, jack603047588, Jackwaterveg, jakpiase, james, Jiabin Yang, jiangcheng, Jiaqi Liu, JingZhuangzhuang, joanna.wozna.intel, JYChen, JZ-LIANG, Kaipeng Deng, kangguangli, kuizhiqing, Leo Chen, Leo Guo, levi131, Li Min, Li-fAngyU, lidanqing, LielinJiang, Ligoml, Lijunhui, lilong12, limingshu, Lin Manhui, Linjie Chen, liqitong-a, littletomatodonkey, liu zhengxi, Liu-xiandong, liutiexing, Liyulingyue, LiYuRio, Lux et Veritas, lyq, Matsumoto Ruko, MayYouBeProsperous, mengqingchun02, Ming-Xu Huang, ming1753, minghaoBD, moyan, mrcangye, Netpunk, niuliling123, Nyakku Shigure, OccupyMars2025, onecatcn, pangyoki, parap1uie-s, peachlcy, piotrekobi, Qi Li, QingshuChen, qipengh, Rayman, Regan Yue, RichardWooSJTU, risemeup1, Roc, ronnywang, Rui Li, Ruibiao Chen, seemingwang, Shang Zhizhou, shangliang Xu, ShenLiang, shentanyue, Shijie, ShiningZhang, shixingbo, shiyutang, Shuangchi He, Siming Dai, Sing_chan, Skr Bang, SmirnovKol, sneaxiy, sprouteer, Sylwester Fraczek, Sławomir Siwek, taixiurong, Tao CHANG, TeFeng Chen, Thomas Young, thunder95, Thunderbrook, tiancaishaonvjituizi, tianshuo78520a, Tomasz Socha, TTerror, USTCKAY, Vigi Zhang, Walter, Wang Bojun, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, WangXi, wangxinxin08, Wangzheee, WangZhen, wangzhen38, wawltor, wbn, Wei Shengyu, Weilong Wu, weishengying, Wen Sun, wenbin, whs, Wilber, WJJ1995, wuhuachaocoding, wuhuanzhou, wuyefeilin, XiaoguangHu, xiaoguoguo626807, xiaohemaikoo, xiaoting, xiaoxiaohehe001, Xiaoxu Chen, xiayanming, Xingyuan Zhang, xiongkun, yang131313, yangguohao, YangZhou, Yanxing Shi, Yao Zihang, yaoxuefeng, yaozhixin, yeliang2258, Yilingyelu, Yiqun Liu, ykkk2333, Yuang Liu, Yuanle Liu, YuanRisheng, yuguo, Yulong Ao, Yulv-git, YUNSHEN XIE, Zhang Jun, Zhang Ting, Zhang Zheng, zhangbo9674, zhangbopd, zhangchunle, Zhangjingyu06, zhangkaihuo, zhangxiaoci, zhangyikun02, zhangzhenguo, Zhanlue Yang, zhaocaibei123, zhaoying9105, zhaoyingli, Zhen Wang, Zhengyang Song, zhiboniu, Zhong Hui, Zhou Wei, zhoutianzi666, zhupengyang, ziyoujiyi, zlsh80826, zmxdream, zn, Zuza Gawrysiak, zyfncg, 傅剑寒, 六个骨头, 津, 熊峻峰, 王明冬, 石晓伟

2.4.0 Release Note

1. Important Updates

New dynamic graph architecture is officially effective: The new dynamic graph framework has significantly improved the scheduling performance. The scheduling performance of more than 90% APIs is improved by over 50%, and the model performance of more than 50% kits is improved by over 5%. The functional architecture is clearer, and the secondary development capability and experience are significantly enhanced.
Comprehensive improvement of the dynamic-static unification ability of the PaddlePaddle: The dynamic-to-static function is provided with richer Python syntax support. The Python syntax coverage of the PaddlePaddle reaches 90%. The syntax transcription logic is mainly optimized to completely support the control flow syntax, with providing smooth dynamic-to-static graph experiences by pressing one key. With the newly upgraded static graph executor, the dynamic-to-static training has better acceleration capability, and the key model test shows that it is close to the best level of the static graph. The dynamic-to-static scalability is improved, with newly supporting multi-function merge export and inference. Users can use the PHI operator library for secondary development and flexible deployment. This can effectively support the custom decoding of U2++ featured models in the speech domain.
Add sparse computing APIs: Add 55 sparse APIs paddle.sparse.* and support mainstream sparse computing scenarios. The APIs have been applied to sparse training and inference deployment for 3D point cloud target detection, Sparse Transformers, and other tasks, with a speedup of 105.75% compared to DenseTensor in high sparse scenarios. In contrast to similar products, the speed of sparse computing is increased by 4.01%-58.55%. Support the computing of a variety of sparse Tensors (SparseCoo and SparseCsr). This is the ultimate saving of video memory. Meanwhile, it maintains a consistent usage experience, with the same usage method of the dense Tensor API.
Large-scale graph neural network GPU training engine: Through the heterogeneous hierarchical storage technology of SSD, memory, and video memory, it breaks through the video memory bottleneck and supports all-GPU storage and training of super-large-scale graphs. It realizes the all-GPU integrated solution of walk, sampling and training. This can increase the training speed by more than 10x under the same costs, compared to the traditional distributed CPU solution.
Environment adaptation: Add pre-compiled installer adapted to CUDA version 11.7. It newly supports the running in Ubuntu 22.04 or later.

Forward-looking forecast

PaddlePaddle Framework will deprecate support for python 3.6 in version 2.5.
The PaddlePaddle framework will gradually deprecate the API under the paddle.fluild namespace on the python side, and some of the APIs under this namespace will be directly removed in version 2.5.

2. Incompatibility upgrade

The pre-compiled installer for CUDA version 10.1 is cancelled.
The -Tensor.clear_gradient(bool set_to_zero) interface will not take the value passed by kwargs, and will have to pass the bool variable of set_to_zero through args.
In order to improve the utilization efficiency of video memory, only the gradients of forward leaf node variables, such as the gradients of network parameters in training, are retained in the dynamic graph by default, instead of the gradients of non-leaf nodes. If you need to preserve a specific Tensor gradient, you can call the Tensor.retain_grads() interface before reverse execution.
paddle.autograd. PyLayer will no longer support the case where the input is tuple, pass in a list of Tensor if you want a group of them.

3. Training framework (including the distributed feature)

（1）New APIs and enhanced API functions

Add the sparse computing class API：paddle.sparse
- Add 55 sparse APIs and support mainstream sparse computing scenarios. The APIs have been applied to sparse training and inference deployment for 3D point cloud target detection, Sparse Transformers, and other tasks, with a speedup of 105.75% compared to DenseTensor in high sparse scenarios. In contrast to similar products, the speed of sparse computing is increased by 4.01%-58.55%. Support the computing of a variety of sparse Tensors (SparseCoo and SparseCsr). This is the ultimate saving of video memory. Meanwhile, it maintains a consistent usage experience, with the same usage method of the dense Tensor API.#45849, #46694, #45086, #41857, #42935, #43475, #43668, #43966, #44022, #44346, #44432, #44451, #44743, #42013, #43520, #41434, #42130, #41276, #41857, #41356
Add the audio field API： paddle.audio
- Add the feature extraction APIs such as MFCC, Spectrogram, and LogMelSpectrogram. Support the GPU computing. The performance increases by more than 15x compared to the CPU. This can significantly improve the GPU utilization in speech model training.#45424
- Add the feature extraction basic APIs such as Window Function and Discrete Cosine Transform. This can facilitate users to customize the speech feature extraction.#45424
- Add the speech I/O module. It provides 2 types of audio I/O backend and supports 6 types of codecs for convenient loading of speech data. #45939
- Add TESS and ESC50 speech classification datasets. It is convenient for users to complete the classical speech classification model.#45939
Add the graph learning domain API: paddle.geometric
- Graph learning is gradually becoming a key technology in the field of machine learning. The new paddle.geometric module of PaddlePaddle provides a better modeling and training development experience of graph learning.
  - Message passing: The message passing mechanism of the graph learning is the basis of graph modeling. We add 7 graph learning message passing APIs to make it more convenient to complete the modeling of the graph learning. Among them, 3 newly added message passing fusion operators can significantly reduce the GPU memory consumption in the GNN model training. In the dense graph scenarios, more than 50% of GPU memory can be saved in the models of GCN series, and the training speed can increase by more than 20%.#44848, #44580, #43174, #44970
  - Graph sampling: Graph sampling is the performance bottleneck of GNN model training. This newly added high-performance graph sampling operator supports high concurrent graph sampling. It can increase the sampling speed of GraphSage by more than 32 times and the model training speed by more than 12 times.#44970
Add the vision domain API
- The paddle.vision is added with target detection domain operators.(#43736), paddle.vision.generate_proposals(#43611), paddle.vision.matrix_nms(#44357), paddle.vision.prior_box和paddle.vision.box_coder( #47282 ).
- Add other API
- Add the iinfo(#45321), count_nonzero(#44169), nanmedian(#42385), remainder_ (#45266), take(#44741), triu_indices(#45168), sgn(#44568), bucketize(#44195), nanquantile(#41343), frac(#41226), logcumsumexp(#42267), pairwise_distance(#44161), heaviside(#41872), logspace(#41261), corrcoef(#40690)
- Add the RReLU(#41823), CyclicLR(#40698), OneCycleLR(#41825), Softmax2D(#40910), SoftMarginLoss(#42364), MultiLabelSoftMarginLoss(#41183), TripletMarginLoss(#40487), TripletMarginWithDistanceLoss(#40545), CosineEmbeddingLoss和cosine_embedding_loss(#41680), PixelUnshuffle(#40728), ChannelShuffle(#40743)
Enhanced API functions
- Add the large batch_size calculation function of BatchNorm1D #43072
Optimize the collective communications distributed training API
- Optimize the fleet.init function, and add the log_level parameter to facilitate users to view logs during operation #45909
- Add the paddle.distributed.fleet.recompute_sequential paddle.distributed.fleet.recompute_hybrid interface. It is convenient for users to use the recompute function #45348
- Add the paddle.distributed.fleet.layers.mpu package. It is convenient for users to use tensor parallel function #45803
- Add the communication API paddle.distributed.destroy_process_group paddle.distributed.isend paddle.distributed.irecv paddle.distributed.all_to_all_single. It improves the completeness and ease of use of communication #43918
- Add the paddle.distributed.stream package. The performance is increased by 5% to 10% compared to the base version#46023 #45282
- The communication API is added with the support of multiple data types such as Char/Byte/Bool. It improves the completeness and ease of use of communication #45574 #45440
- The communication API asynchronous parameter is changed fromuse_calc_stream to sync_op, It enhances the semantic readability of the interface #46493
Enhanced high-level API
- The visual model ResNeXt in the high-level API implements the reuse of the ResNet code for refactoring. #40588
- The visual models Inceptionv3, MobileNetv1, MobileNetv2, and ShuffleNetv2 in the high level API are improved.#40431

（2）New functions and important upgrades

The new dynamic graph architecture is officially launched：The scheduling performance of the new dynamic graph framework is greatly improved. Compared with the original architecture, the scheduling performance is significantly enhanced. The scheduling performance of more than 90% APIs is improved by over 50%, and the model performance of more than 50% of kits is improved by over 5%. The new dynamic graph architecture is clear, and the coupling is low. The learning and development costs of extension modules such as Hook and PyLayer are significantly reduced based on the new architecture. #37550 , #37574 , #37813 , #37926 , #39192 , #37599 , #37406 , #37466 , #37599 , #40945 , #39989
High-order auto-differentiation mechanism：In order to better support scientific computing and other scenarios, the PaddlePaddle framework has been further improved and optimized for higher-order auto-differentiation capabilities. At present, the paddle.incubate.autograd directory has provided relevant trial functions and APIs for forward/reverse higher-order auto-differentiation (Currently they are in incubation, and related functions and API signatures may change).If you intend to implement related models and explore the auto-differentiation mechanism by yourself, please read the usage and limitations of higher-order auto-differentiation carefully. Specific upgrades include：
1. Static graph higher-order differentiation mechanism upgrade. Through the base operator system and program transformation, it supports higher-order forward and reverse differentiation, with the availability of the compiler and distributed functions.#41919, #41201
2. Add the forward and reverse higher-order auto-differentiation API, paddle.incubate.autograd.forward_grad, paddle.incubate.autograd.grad. #43354
3. Add 18 higher-order auto-differentiation operators:sin, cos, exp, erf, abs, log, cast, where, equal, not_equal, greater_than, greater_equal, elementwise_pow square, elementwise_max, gelu, reduce_mean, size. #46184, #46024, #45888, #45338, #44345
4. Fix the existing bugs of the operators such aselementwise_div, reduce_sum, p_norm. #46514, #46184
Generic heterogeneous parameter server architecture：
- Parameter server GPUGraph infrastructure upgraded to meet the implementation needs of large-scale applications: The storage and training of large-scale graph neural networks based on the traditional CPU feature high cost, low stability, and less performance. To overcome these problems, we have built a pure GPU graph training engine (PGLBox). Through the heterogeneous hierarchical storage technology of SSD, memory and video memory, it supports the training of ultra-large scale graph models. The training performance is improved by more than 10x compared with CPU graph training engine on the premise of equal cost. The task failure rate is extremely low.#44594
- Large-scale federation parameter server architecture: For large-scale personalized recommendation scenarios, the large-scale federation parameter server training is developed based on the heterogeneous PS infrastructure, to support horizontal and vertical federation under hundreds of billions of parameters. It includes two features: User private parameters updated locally and public parameters updated remotely. Users can flexibly configure the slicing policy for private and public parameters. A new central scheduling node Coordinator is added. Users can perform secondary development from the base class to customize the Client selection policy. #42682 , #44864 , #44327
Adaptive parallel
- Design and launch a complete automatic parallelism interface system: Support automatic dynamic-to-static distributed training, automatic distributed data loading, automatic distributed saving and loading, automatic parameter conversion, custom slice marker and custom execution process. Users can easily obtain the automatic distributed training capability based on a single machine networking. It supports data parallel, model parallel, pipeline parallel, and hybrid parallel. #45776 ，#46552 , #44202 , #45840 , #45518 , #40528, #42838, #43093, #43312, #45053.
- Improve the underlying adaptive parallel mechanism, including the upgrade of the distributed costmodel design and implementation, to provide better evaluation of the slice policy. Add the native distributed properties to ProgramIR and enrich the Cluster functions. #40457 , #42601 , #42727 , #42874 , #43114 , #44095 , #44146 , #44701 , #44973 , #45002 , #45118 , #45237 , #42576 , #41722 , #44150 , #44989, #44951, #44963 .
- Add the Shardingstage1/2/3 AutoTuning feature under data parallel. This allows to automatically select the highest throughput Shardingstage policy while ensuring that the video memory constraints are met. #43782 .
Training hardware access - Plug-in solutions：Add custom Runtime/Kernel/CCL/Graph/Pass solutions. The hardware vendors can choose which modules to implement on-demand based on hardware characteristics.
ONNX format export
- Support the quantized model export. The exported ONNX model uses TensorRT or ONNXRuntime to load inference. About 1.5~4 times inference acceleration can be obtained #856, #782
- Add the export of a large model greater than 2GB #942

（3）Function optimization

Comprehensive increase of dynamic-to-static analysis conversion & extension capabilities
- In order to improve the success rate and experience of model dynamic-to-static conversion, the transcription logic of control flow syntax is reconstructed. The core syntax has been upgraded to JIT (just-in-time) paradigm to achieve equivalent transcription with Python codes. The syntax functions such as break, return and continue are improved.#43666 , #43846 , #43848 , #43880 , #43957 , #43328 , #43348 , #43998 , #44465 , #44504 , #43713 , #43864 , #43967 , #44155 , #44487 , #44527 , #45105 , #45900
- In order to support the voice custom decoding flexible deployment scenarios, the jit.save/load interface function is extended to support user multi-function merge and export. A new JITLayer component is added to support the invocation of class functions. Meanwhile, the custom inference deployment function is implemented with the PHI operator library C++ API. #44283, #41783, #43607, #43754, #43758, #43798, #44010, #44351, #44465, #44504, #44597, #44738, #44984, #46249
- In order to unify API dynamic and static behaviors, 20 operators are upgraded to support variable attribute information of Op in static graphs, to ensure consistent dynamic and static behaviors and improve the success rate of dynamic-to-static conversion of models. Include pad2d,depthwise_conv2d_transpose,conv2d_transpose,adaptive_avg_pool2d,reverse,bincount,multinomial,reduce_sum,reduce_mean,reduce_prod,reduce_min,reduce_max,uniform,squeeze,max_unpool2d,dropout,cumsum,eye,argmin,argmax. #44737, #45084, #45189, #45391, #45417, #45427, #45514, #45525, #45543, #45660, #46352, #46433, #45078, #45342, #45372, #45453, #45522, #45620
- In order to solve the problem of occasional loss of error reporting stack for user dynamic-to-static, the logic of the error reporting module is optimized to improve the readability of the error reporting stack and the user debugging experience. #44054, #44083, #44781, #44996
- Add the TypeHint syntax recognition and transcription module to fully support Python Type Hint syntax. #47121
PHI operator library covers the full amount of arithmetic class operators：Continuously build the highly reusable operator library PHI. The remaining PaddlePaddle 2.x arithmetic class PythonAPI-associated operators and related kernels are migrated to the PHI operators library and rewritten as functional expression. Add about 180 forward/reverse operator CPU&GPU kernels, and 170 Kunlun-specific arithmetic kernels. This further enhances the kernel function sets that can be reused when new operators are added. In addition, add more than 100 C++ arithmetic class APIs. These APIs can be used in the custom operators, further enhancing the ease of use for external extension development based on the PaddlePaddle. #44577, #44631, #44434, #44605, #44676, #44742, #44436 , #45887, #45851, #45623, #45397, #45863
Normalized operator definitions with significantly improving the model simplicity：For the problems of many redundant parameters in the historical operator definitions of PaddlePaddle 1.x and the high cost of understanding the adaptation, the redundant parameters of about 150 high-frequency operators are cleaned up centrally. Basically, the mathematically irrelevant parameters are removed. After these redundant parameters are cleaned up, the amount of information in the inference model stored in the PaddlePaddle is significantly reduced. Generally, about 40% of the attribute variables are removed, significantly improving the clarity of the PaddlePaddle operator definition, and improving the experience of model analysis and debugging. Meanwhile, the size of the inference model stored in the PaddlePaddle is also significantly reduced by more than 70%. As a result, this can significantly improve the lightweight of the PaddlePaddle model. #44310 , #45613 , #45684 , #45708 , #45758 , #45786 , #45772 , #45845 , #45984 , #46218 , #46553

（4）Performance optimization

AMP performance and accuracy optimization
- More operators are added with the support of FP16 data types, including elementwise series operators, compare series operators, strided_slice, set_value, uniform_ramdom, etc.（#45504 #44405 #45496 #46641, #46906 ）
- Optimize the implementation scheme of the hard_swish operator FP16 Kernel to guarantee the accuracy without loss. （ 35386 ）
- More operators are added with the support of BF16 data types, including fused_linear, empty, selu, pow, adam, clip, embedding, gelu, pad3d, pixel_shuffle, tile, where, etc. #46364, #47177
AutoTuning of single machine training performance
- Transpose OP supports automatic Kernel selection mechanism. This allows the automatic search for the best Kernel implementation for different model configurations, improving the model performance. #43310 (Transpose Op access AutoTuning function)
- AMP Layout auto-switching supports the new dynamic graph mode. For the ResNet50, TSM, and DeepLabV3 models, the performance increases by 9%-21% by Layout AutoTuning in the new dynamic graph. (#45409, #45751, #45826, #46880)
Generic performance optimization of GPU single machine training
- Optimize the Cache scheme of the Conv operator cuDNN algorithm and Cache the results in all algorithm acquisition methods. This can significantly reduce the CPU overhead of the operator.（#41891 #47197 ）
- Further optimize the GPU Kernel and Python side performance of multiple operators, including dist, poisson, depthwise_conv2d, transpose, eigh, broadcast computation, reduce computation, layer_norm, cross_entropy, etc. This can achieve better performance in more configuration scenarios. （#44946, #45057, #45160, #42491, #42704, #42853, #46287, #46362, #46490, #46412, #46623, #40051 ）
Performance optimization of distributed training for collective communications
- To improve pipeline parallel scheduling efficiency, support the dynamic graph Interleaving1F1B scheduling policy. In the GPT-3 model, the performance is improved by 3%-4%. #45797 , #45869 , #45922 , #46209 , #45402 , #45444 , #45497 , #45797 , #45869 , #45922, #46209, #46399 , #46483 , #46876 , #47242 , #47249 , #47497 , #47517
- To improve the distributed training performance of the MLPerfBERT model, the DistributedFusedLamb distributed optimizer supports hierarchical AllReduce. It improves MLPerfBERT performance by 17% on the DCU1024 card. #44821 , #44843
- To optimize the video memory footprint when using DataParallel, the Buffer Lazy initialization policy for Tensor Fusion is supported, thus reducing the video memory footprint by an amount equal to the number of model parameters. #45631.
- Distributed parallel policies DataParallel and Sharding support BF16 training. #46846 , #47246
- To support the Sequence Parallel policy, the Distributed Pipeline Parallel supports enable_partial_send_recv policy, and supports the tensor after slice of the transmission sequence parallel. #46992 , #47083
- To improve the performance of sharding stage 2 policy, implement the overlap of sharding stage 2 optimizer broadcast parameters with next step forward and use multi-CUDA Stream for communication. In the GPT 6.7B model, the 16-card training performance is improved by 11%. #46495 , #46656 , #47061

（5）Bug fix

Dynamic-to-static
- Fix the bug of reporting an error in dynamic-to-static of the model in a Parameter no-gradient scenario during multi-card training. #44485
- Fix the bug of where redundant frame logs are mistakenly output by the terminal in the dynamic-to-static. #45754, #46800
- Fix the bug of reporting an error in the dynamic-to-static training when the control flow in the model contains a Tensor that does not require a gradient. #43034
- Fix the bug of incorrect computation value during gradient aggregation in the dynamic-to-static training. #44893
- Fix the bug of reporting an error in the dynamic-to-static when the function is decorated with @staticmethod. #44983, #45268, #45277
- Fix the bug of too much video memory footprint in some scenarios where the model contains the dynamic-to-static training. #45380
- Fix the bug of reporting an error of dynamic-to-static shape derivation in the networking phase when the model contains a complex control flow. #45916, #46020
Fix the error report mechanism
- Replace self.assertTrue(np.allclose(...)) with np.testing.assert_allclose to get fuller error reporting information ( #44947, #44988, #45213)
Distributed training in collective communications
- Fix several bugs in communication library initialization and communication process, and enhance the system operation stability. #44964 #45100 #44758
- Fix the bug of frequent occurrences of hang in pipeline parallel, and enhance the ease of use of the policy #47201; enhance the pipeline function to support unbalanced input. #47199
- Fix the bug that the performance of the new dynamic graph MP/PP policy is lower than the old dynamic graph. #47071
- Fix the bug that the shardingstage2 policy incorrectly maintains the parameter trainable property. #47240
- Fix the bug that tensornumel is greater than INT32_MAX in series of OPs. #45711, #45741, #45897, #46158, #46767, #47191, #46045, #46160
- Fix the bug of too much video memory footprint in FusedAttention and Fused FeedForward OP.#47236, #47235
- Fix the bug of incorrect parameter update in multi_tensor_adam and multi_tensor_momentumOP when the parameters passed in are listofdict. #47352, #47372

4. Deployment direction (Paddle Inference)

（1）New features

Optimize the back-end graph engine integration scheme
- In order to reduce Paddle-TensorRT plugin code development and reduce the number of Paddle-TensorRT subgraphs and thus reducing resource usage, a generic plugin mechanism has been developed, to automatically provide a unified TensorRT plugin interface for rich Phi operators in the framework. As a result, the video memory footprint can be effectively reduced in most scenarios. #46970, #46179, #46580
- In order to facilitate users to customize operators in the framework and make Paddle-TensorRT perform efficient inference, the function is upgraded to support the framework custom Paddle-TensorRT plugin. #46970
Optimize the Inference library build system. The size can be pruned on demand
- Pre-compiled installer supports TensorRT by default: The pre-compiled installer for training and the pre-compiled installer for deployment (Paddle Inference) are unified into one pre-compiled installer. The build system is optimized so that the pre-compiled installer supports TensorRT by default, reducing the switching cost for users using PaddleTensorRT. #46008, #45824, #46058
- The size can be pruned on demand: Pruned according to the model operator. #47033 , #47049 , #47047
Inference supports native AMP
- In order to make full use of GPUTensorCore computation capability and improve the model inference performance, a model accuracy conversion tool has been developed. The InferenceGPU natively supports the inference of the mixed precision model. For the usages, refer to the documentation. documentation, #43814, #43881, #44057, #44307, #44457, #44866, #45050, #45346, #45379, #45406, #45882
- In order to improve the inference performance of the mixed precision model, the FP16kernel of high-frequency operators that do not support FP16 computation is supplemented, thus reducing the possibility of inserting the cast operator due to input precision mismatch. The inference performance is improved. #44642, #45061, #44653, #45504, #45061, #44969, #44558, #44710, #43871, #44792
Upgrade the compression and inference engine
- Upgrade the quantization model storage format. The new format supports PaddleInference, PaddleLite and Paddle2ONNX 3 deployment methods. The supported chips include X86 CPU, NVIDIA GPU, and Arm CPU. （#46305, #462832, #46022 ）
- Add the INT8 full quantization function compatible with SoC/NPU chips. This can ensure the output INT8 quantization model has the best inference acceleration and precision on SoC/NPU chips.
Add the INT8 full quantization function compatible with SoC/NPU chips. This can ensure the output INT8 quantization model has the best inference acceleration and precision on SoC/NPU chips.
- Upgrade the interface module between the PaddlePaddle framework and compiler, to support inference models to access the compiler for optimization via Paddle Inference. （#44499 #44708 ）

（2）Underlying optimization

GPU performance optimization
- Add the TensorRT mapping for operators such as matmul_v2, LSTM, reshape, fill_constant, swish, mulitclass_nms3, bilinear_interp_v2, split, silu, shuffle_channel operators. Optimize the support for the dynamic shape. Performance improved by 7% to 90% for multi-class focused models. (#46177, #44678, #44314, #44561, #45166, #44411, #43424, #44516)
- Add constant folding PASS for inference performance optimization, to improve the performance of SwinTransformer, HifiGAN, FastSpeech2, and other models.（#45494)
- Add cache of conv_fusionworkspacesize, to improve the computation performance of conv_fusion. (#45902)
Vision ViT model optimization
- Add the ViT model Attention structure fusion PASS, and support OSSPlugin and auto padding. The ViT inference speed increases by 30%-40%. #45019 #45506
Inference performance optimization of large model
- To improve the inference speed of very large generative models and save the video memory, add INT8 implementation (fused_multi_transformer_int8_op) to the multi-layer Transformer fusion operator (fused_multi_transformer_op), and support quantized inference of generative models. Use the matrix multiplication algorithm to select, quantize/de-quantize the kernel fusion for performance optimization. #46169
- Add Pass for automatic matching fusion in order to improve the ease of use of fused_multi_transformer fusion for large model inference.
CPU performance optimization
- Optimize the speech U2++ model. The FP32 model inference speed is improved by 35%. The INT8 model inference speed is improved by 69%. (#47592, #47127, #47391, #47234, #47009, #47080)

（3）Bug fix

TensorRT workspace size supports int64. （#44469 ）
In Paddle-TRT, fully support Op's input as weight.（#45545 ）
In Paddle-TRT, support conv2d_transpose/conv3d_transpose to have the output_padding attribute.（#45004 ）
In Paddle-TRT, enhance the strided_slice support for dynamic shape. （#46819 ）
In Paddle-TRT, optimize the video memory footprint of context when running in multi-thread scenarios.（#45468 ）
In Paddle-TRT, fix the bug of repeatedly generating serialization files in case of change of initialization sequences when multiple models run in the same process.（#43942 ）
Fix the bug of occasional crash when Predictor is initialized to run for multiple times in the same process.（#45203 ）
Fix the bug of abnormal inference accuracy of quantization models such as MobileNetV3_large, ERNIE 3.0-Medium and bert (#45416, #46283, #45920 #47573)

5. Environment adaptation

The pre-compiled installer for training and the pre-compiled installer for deployment (Paddle Inference) are unified into one pre-compiled installer. The build system is optimized so that the pre-compiled installer supports TensorRT by default.
The pre-compiled installer for CUDA version 10.1 is cancelled.
Add the pre-compiled installer for CUDA 11.7.
Decrease of source code compilation time: Reduce inter-module dependencies, improve the parallel, and optimize the compilation speed of some modules. The full compilation time is reduced by about 20 minutes in total.
Support the running of PaddlePaddle on windows 11, Centos 8, Ubuntu 22.04, Jetson 5.02 system environment. Support to run PaddlePaddle linux installer in windows system by using the WSL 2 tool.
Fix the running error bug of the PaddlePaddle in glibc2.34+ environment.
Optimize the code style of C++, Python, CMake in the whole code repository. Introduce or upgrade the following code style checking tools.
- pre-commit is upgraded from 1.10.4 to 2.17.0： #43103
- pylint is changed from default version to specify as： #43103
- remove-crlf is upgraded from 1.0.1 to 1.1.14 ： #43103
- cpplint is changed from default version to specify as 1.6.0 ： #43175, #43978, #43673, #43679, #43695, #43733, #43740
- clang-format is upgrade from 3.8 to 13.0 ： #42840, #43248, #43329, #43333, #43633, #43678
- Introduce the black tool for python code style checking ：#46014
- Introduce the cmakelint tool for cmake file code checking. Version is 1.4.2 ： #43222, #43406, #43414, #43428
- Introduce cmake-format for automatic formatting of cmake files. Version is 0.6.13 ： #43057

6. Hardware adaptation

Hygon DCU

Add the Profiler function on DCU, to collect, count and display performance data of model running process on DCU, and support DCU occupancy display at kernel level.

Kunlunxin Chip

Add Profiler function on Kunlunxin 2 generation chip, which can collect, count and display the performance data of model running process on Kunlunxin 2 generation chip, and support occupancy display of Kunlunxin 2 generation chip at kernel level.
Training/reasoning support for Kunlunxin 2 generation chips (Kunlunxin AI accelerator cards R200, R300, R200-8F, R200-8FS, RG800), a total of 51 models such as PPYOLOE, PP-OCR, ERNIE3.0, PP-TSM, PP-TTS, DLRM, PPO, etc. have been verified, supporting static graph + dynamic graph training, supporting mixed precision training, support single machine single card and single machine multi-card training, covering 5 fields of intelligent vision, natural language processing, intelligent speech, intelligent recommendation, reinforcement learning.

Cambricon

Support the training/inference of Cambricon MLU chip (MLU370 series of boards): The ResNet50, BERT, YoloV3, OCR-DB, Deeplabv3 and many other models are verified. Support the static graph + dynamic graph training. Support mixed precision training. Support the single machine single card and single machine multi-card training.

Graphcore

Support the training/inference of Graphcore IPU chip (including IPU Mk2 GC200 and Bow IPU). Support ResNet50, BERT and other models. Support the static graph and dynamic-to-static mode training. Support the single chip, single machine, and multi-machine distributed training.
Add the support of more operators
Upgrade to Poplar SDK v3.0.0 #46892

Support the training models by using the dynamic-to-static mode. Add a new paddle.incubate.identity_loss op to assist with composition #43770
Support the Paddle native distributed training API: paddle.distributed.launch #43311
Support the training models with the mixed precision #41733
Paddle Inference supports custom operators by using PopART #45235

Intel

Migrate oneDNN operators : transpose2_grad(#46139), relu6_grad(#46501), gaussian_random(#46747, #45481), sgd and stack(#46374), concat+grad, expand+grad,fill_constant(#45863), slice, slice_grad, split,pad and pad3d(#46101), softmax_grad(#46257), Shape(#46051), Sum(#46239), Transpose2_grad(#46139), Cast, clip+grad andpool+grad(#45775), Reduce sum+grad,mean+grad, min and max(#45536), Relu and abs(#45397), Gelu(#45596), Scale(#45537)
Optimize kernels of fill_constant, fc, conv, and a number of operators
Add several Pass fusion optimizations
Optimize the Adam-W CPU FP32 optimizer (#42522)
Optimize pad3d fp32 onednn operator kernel implementation (#43990)
Optimize the concurrent execution of matmul, FC andlookup_v2 kernels (#44023, #44078, #44640, #44744, #45249)
FC onednn operator kernel supports bf16 ( #42758, #43154, #43109)
Add the fusion of matrix multiplication and activation functions (#43519, #43198)
Support convolution operator int8 parameter production IR passes ( #44680, #42625)
Add pool/avg quantization and scales correction (#44186)
Add the matmul and elementwise onednn operator kernel fusion (#45077)
Fix the QAT precision bug (#43693, #45936, #46378)
Migrate 42 oneDNN operator kernels to PHI operator library (#46374, #46101, #45989, #45863, #45775, #45626, #45536, #46501, #46257, #45596, #45537, #45481, #45397, #46239, #46139, #46051)
Quantize the elementwise_sub and shape operator kernels (#42854, #44124)

Thanks to our Contributors

This release contains contributions from:

PaddlePaddle 2.4.0 Release Note

2.4.0 Release Note

1. 重要更新

前瞻性预告

2. 不兼容升级

3. 训练框架（含分布式）

（1）新增API和增强API功能

（2）新功能及重要功能升级

（3）功能优化

（4）性能优化

（5）问题修复

4. 部署方向（Paddle Inference）

（1）新增特性

（2）底层优化

（3）问题修复

5. 环境适配

6. 硬件适配

海光DCU

昆仑芯

寒武纪

Graphcore

Intel

Thanks to our Contributors

2.4.0 Release Note

1. Important Updates

Forward-looking forecast

2. Incompatibility upgrade

3. Training framework (including the distributed feature)

（1）New APIs and enhanced API functions

（2）New functions and important upgrades

（3）Function optimization

（4）Performance optimization

（5）Bug fix

4. Deployment direction (Paddle Inference)

（1）New features

（2）Underlying optimization

（3）Bug fix

5. Environment adaptation

6. Hardware adaptation

Hygon DCU

Kunlunxin Chip

Cambricon

Graphcore

Intel

Thanks to our Contributors

Contributors