Performance optimization for StreamSafeCudaAllocator #40718

From00 · 2022-03-18T15:02:19Z

PR types

Performance optimization

PR changes

Others

Describe

相关背景

PR #40460 对Profiler的RecordEvent操作进行了性能优化，优化之后在AutoGrowth策略下开启StreamSafeCUDAAllocator，即使是不使用多stream功能，显存分配操作也仍有约2倍的性能开销增加（因StreamSafeCUDAAllocator和AutoGrowthBestFitAllocatorh中均有RecordEvent打点操作，相关优化同时对AutoGrowth也生效，不止针对StreamSafeCUDAAllocaror）。

显存分配上增加的开销，主要来自延迟创建Allocator引入的读写锁、从DeviceContextPool中获取默认stream、ProcessUnfreedAllocations为了多线程安全而引入的自旋锁，以及新包一层StreamSafeCUDAAllocation的堆内存分配（new操作）和构造等。

因对目前实际应用的大多数模型来说，跨stream的显存操作占比并不高，本PR尝试对StreamSafeCUDAAllocator在无跨stream使用场景下的显存分配性能进行优化，目标是使得对模型中没有指定stream（即默认使用DeviceContex中的stream）、且后续未进行跨stream显存使用的相关Allocation分配操作，相比未支持多stream之前的单stream显存分配，尽量不引入较多的性能开销。

优化点

针对StreamSafeCUDAAllocator的显存分配操作，本PR进行以下两点优化：

从数据结构层面隔离默认stream和非默认stream的Allocator，默认stream（即DeviceContex中的stream）相关的Allocator在初始化时即时创建，非默认stream（即用户传入的stream）相关的Allocator延迟创建。通过Allocator的数据结构隔离与创建逻辑区分，可以使得对占大多数的默认stream分配操作不需要从DeviceContex中获取stream，也不需要引入读写锁的开销。
在无锁条件下对unfreed_allocations_列表进行判空，仅在列表中存在未回收的Allocation时，才进入自旋锁临界区中进行ProcessUnfreedAllocations操作。

优化效果

本PR使得StreamSafeCUDAAllocatora开启前后，对默认stream的显存分配操作开销增加从约2倍降低到约55%。

通过以下代码进行性能测试：

TEST(StreamSafeCUDAAllocInterfaceTest, AllocInterfaceTest) {
  std::shared_ptr<Allocation> pre_allocation =
      AllocShared(platform::CUDAPlace(), 100000000);
  pre_allocation.reset();
  ProfilerStart("alloc.prof");
  for (int i = 0; i < 10000000; ++i) {
    size_t alloc_size = rand() % 10000000;
    std::shared_ptr<Allocation> allocation_implicit_stream =
        AllocShared(platform::CUDAPlace(), alloc_size);
    EXPECT_GE(allocation_implicit_stream->size(), alloc_size);
    allocation_implicit_stream.reset();
  }
  ProfilerStop();
}

在本PR优化之后，执行上述代码，测试开启多stream前后性能对比如下：
开启多stream之前：

开启多stream之后：

此时显存分配操作（AllocFacade::Alloc）的耗时只从2.14秒增加到3.31秒，增加了1.17秒，约55%开销占比。
剩余的开销主要来自以下部分：

创建StreamSafeCUDAAllocator的堆内存分配申请（new操作）占0.52秒，新增性能开销约24%
开启后AutoGrowthBestFitAllocator的分配开销也从1.83秒增加到2.09秒，增加了0.26秒，增加性能开销约12%。此部分开销主要来自new操作耗时增加（0.58增加到0.83），但相关代码逻辑实际并未改变，怀疑是多stream新增的new操作对AutoGrowth本身的new速度也会产生影响
StreamSafeCUDAAllocation的构造函数耗时0.14秒，新增性能开销约7%
StreamSafeCUDAAllocator的AllocateImpl逻辑耗时0.12秒，新增性能开销约6%
RecordEvent相关操作耗时0.08秒，新增性能开销约4%
GetPrivate和GetAllocator在优化之后，性能开销只剩下额外多0.03秒，占比不大

在显存回收方面，主要开销来自StreamSafeCUDAAllocation结构体的析构和释放，此外还有额外一次deleter的调用和CanBeFree的判断等，但后者开销占比并不大，本PR未对显存回收操作的性能开销进行优化。

可以看出，优化之后剩余的性能开销主要都来自StreamSafeCUDAAllocation的内存分配、构造和析构，这部分是Allocator层层包裹的装饰者设计模式导致的，除非整体推翻这种设计模式，否则每新加一层新功能的Allocator就不可避免地需要引入新一层的构造和析构开销，没有更大的优化空间。

… optimize-stream-safe-cuda-allocator

paddle-bot-old · 2022-03-18T15:02:43Z

✅ This PR's description meets the template requirements!
Please wait for other CI results.

paddle-bot-old · 2022-03-18T15:02:45Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

zhiqiu

LGTM

From00 added 4 commits March 11, 2022 08:34

Performance optimize

69c3d87

Optimize GetAllocator, RWLock and ProcessUnfreedAllocation

fc9abe3

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

146e659

… optimize-stream-safe-cuda-allocator

Remove test file

296df0e

From00 changed the title ~~Performance optimizatino for StreamSafeCudaAllocator~~ Performance optimization for StreamSafeCudaAllocator Mar 19, 2022

From00 added 3 commits March 21, 2022 03:38

Fix CI error

db0290b

Fix CI errors

21d93ed

Fix CI errors

222f843

From00 requested review from zhiqiu, Aurelius84 and chenwhql March 23, 2022 02:45

zhiqiu approved these changes Mar 23, 2022

View reviewed changes

From00 requested a review from Shixiaowei02 March 23, 2022 04:00

Aurelius84 approved these changes Mar 23, 2022

View reviewed changes

chenwhql approved these changes Mar 23, 2022

View reviewed changes

From00 merged commit d8bff98 into PaddlePaddle:develop Mar 23, 2022

From00 deleted the optimize-stream-safe-cuda-allocator branch April 4, 2022 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimization for StreamSafeCudaAllocator #40718

Performance optimization for StreamSafeCudaAllocator #40718

From00 commented Mar 18, 2022 •

edited

Loading

paddle-bot-old bot commented Mar 18, 2022 •

edited

Loading

paddle-bot-old bot commented Mar 18, 2022

zhiqiu left a comment

Performance optimization for StreamSafeCudaAllocator #40718

Performance optimization for StreamSafeCudaAllocator #40718

Conversation

From00 commented Mar 18, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Mar 18, 2022 • edited Loading

paddle-bot-old bot commented Mar 18, 2022

zhiqiu left a comment

Choose a reason for hiding this comment

From00 commented Mar 18, 2022 •

edited

Loading

paddle-bot-old bot commented Mar 18, 2022 •

edited

Loading