Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimization for StreamSafeCudaAllocator #40718

Merged

Conversation

From00
Copy link
Contributor

@From00 From00 commented Mar 18, 2022

PR types

Performance optimization

PR changes

Others

Describe

相关背景

PR #40460 对Profiler的RecordEvent操作进行了性能优化,优化之后在AutoGrowth策略下开启StreamSafeCUDAAllocator,即使是不使用多stream功能,显存分配操作也仍有约2倍的性能开销增加(因StreamSafeCUDAAllocator和AutoGrowthBestFitAllocatorh中均有RecordEvent打点操作,相关优化同时对AutoGrowth也生效,不止针对StreamSafeCUDAAllocaror)。

显存分配上增加的开销,主要来自延迟创建Allocator引入的读写锁、从DeviceContextPool中获取默认stream、ProcessUnfreedAllocations为了多线程安全而引入的自旋锁,以及新包一层StreamSafeCUDAAllocation的堆内存分配(new操作)和构造等。

因对目前实际应用的大多数模型来说,跨stream的显存操作占比并不高,本PR尝试对StreamSafeCUDAAllocator在无跨stream使用场景下的显存分配性能进行优化,目标是使得对模型中没有指定stream(即默认使用DeviceContex中的stream)、且后续未进行跨stream显存使用的相关Allocation分配操作,相比未支持多stream之前的单stream显存分配,尽量不引入较多的性能开销

优化点

针对StreamSafeCUDAAllocator的显存分配操作,本PR进行以下两点优化:

  1. 从数据结构层面隔离默认stream和非默认stream的Allocator,默认stream(即DeviceContex中的stream)相关的Allocator在初始化时即时创建,非默认stream(即用户传入的stream)相关的Allocator延迟创建。通过Allocator的数据结构隔离与创建逻辑区分,可以使得对占大多数的默认stream分配操作不需要从DeviceContex中获取stream,也不需要引入读写锁的开销。
  2. 在无锁条件下对unfreed_allocations_列表进行判空,仅在列表中存在未回收的Allocation时,才进入自旋锁临界区中进行ProcessUnfreedAllocations操作。

优化效果

本PR使得StreamSafeCUDAAllocatora开启前后,对默认stream的显存分配操作开销增加从约2倍降低到约55%

通过以下代码进行性能测试:

TEST(StreamSafeCUDAAllocInterfaceTest, AllocInterfaceTest) {
  std::shared_ptr<Allocation> pre_allocation =
      AllocShared(platform::CUDAPlace(), 100000000);
  pre_allocation.reset();
  ProfilerStart("alloc.prof");
  for (int i = 0; i < 10000000; ++i) {
    size_t alloc_size = rand() % 10000000;
    std::shared_ptr<Allocation> allocation_implicit_stream =
        AllocShared(platform::CUDAPlace(), alloc_size);
    EXPECT_GE(allocation_implicit_stream->size(), alloc_size);
    allocation_implicit_stream.reset();
  }
  ProfilerStop();
}

在本PR优化之后,执行上述代码,测试开启多stream前后性能对比如下:
开启多stream之前:
image

开启多stream之后:
image

此时显存分配操作(AllocFacade::Alloc)的耗时只从2.14秒增加到3.31秒,增加了1.17秒,约55%开销占比。
剩余的开销主要来自以下部分:

  1. 创建StreamSafeCUDAAllocator的堆内存分配申请(new操作)占0.52秒,新增性能开销约24%
  2. 开启后AutoGrowthBestFitAllocator的分配开销也从1.83秒增加到2.09秒,增加了0.26秒,增加性能开销约12%。此部分开销主要来自new操作耗时增加(0.58增加到0.83),但相关代码逻辑实际并未改变,怀疑是多stream新增的new操作对AutoGrowth本身的new速度也会产生影响
  3. StreamSafeCUDAAllocation的构造函数耗时0.14秒,新增性能开销约7%
  4. StreamSafeCUDAAllocator的AllocateImpl逻辑耗时0.12秒,新增性能开销约6%
  5. RecordEvent相关操作耗时0.08秒,新增性能开销约4%
  6. GetPrivate和GetAllocator在优化之后,性能开销只剩下额外多0.03秒,占比不大

在显存回收方面,主要开销来自StreamSafeCUDAAllocation结构体的析构和释放,此外还有额外一次deleter的调用和CanBeFree的判断等,但后者开销占比并不大,本PR未对显存回收操作的性能开销进行优化。

可以看出,优化之后剩余的性能开销主要都来自StreamSafeCUDAAllocation的内存分配、构造和析构,这部分是Allocator层层包裹的装饰者设计模式导致的,除非整体推翻这种设计模式,否则每新加一层新功能的Allocator就不可避免地需要引入新一层的构造和析构开销,没有更大的优化空间。

@paddle-bot-old
Copy link

paddle-bot-old bot commented Mar 18, 2022

✅ This PR's description meets the template requirements!
Please wait for other CI results.

@paddle-bot-old
Copy link

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@From00 From00 changed the title Performance optimizatino for StreamSafeCudaAllocator Performance optimization for StreamSafeCudaAllocator Mar 19, 2022
Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@From00 From00 merged commit d8bff98 into PaddlePaddle:develop Mar 23, 2022
@From00 From00 deleted the optimize-stream-safe-cuda-allocator branch April 4, 2022 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants