Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize conv algo cache #41891

Merged
merged 48 commits into from
Aug 25, 2022
Merged

Conversation

phlrain
Copy link
Collaborator

@phlrain phlrain commented Apr 18, 2022

PR types

Performance optimization

PR changes

Others

Describe

优化点:cudnn提供的cudnnGetConvolutionForwardAlgorithm_v7,cudnnGetConvolutionForwardWorkspaceSize 这两个接口性能都比较低,在重调度的场景下,会带来较大的性能问题

本pr基于auto tune cache的基础上做了一些优化升级,升级点如下:

  1. 将workspace size 进行cache,cudnnGetConvolutionForwardAlgorithm_v7返回的结果中包含了workspace size,(memory字段),可以直接使用,没有必要重复获取;因此将cache的value 从int64_t,升级为DnnNode,包含一个int64_t 和 size_t
  2. 即使不开启auto tune,也会将搜索的结果进行cache,减少调用cudnnGetConvolutionForwardAlgorithm_v7的次数,本地做了一个实验,当cache中包含10,000,000个元素的时候,平均一次搜索的时间为0.16 微妙,cudnnGetConvolutionForwardAlgorithm_v7 搜索一次的开销大约为70-100微妙左右;cache查询的开销,远低于 cudnn搜索一次的开销
  3. 为防止内存爆掉, cache承载的最大的数据量为1,000,000; 超过1,000,000,会强行clear; 因为unordered map消耗的资源大约为实际存储的数据的10倍, 1,000,000个元素,每个元素16个字节(int64_t 8字节,size_t 为8字节),所以消耗的最大内存空间为160M字节;同时对于大部分训练的任务,cache中的元素数量是跟输入图片的形状强关联的,是能穷举,不会无限增长;
  4. conv args 增加groups 和 dataformat字段,出现其他的属性相同,但是groups和data format不一致,导致的冲突,这种冲突会引起conv的kernel执行失败
  5. 升级cache的unordered map的设计,旧的unordered map的hash key是在外部算好,传递给unordered map的,这种方案带来的问题是如果不同的conv args得到相同的key,就会导致返回错误的algo 和 workspace,导致运行错误; 新的方案设计为 unordered map的key 是ConvCacheKey(存储的内容和conv args一致,只是为了编译解耦,新定义了一个数据结构),重载hash和equal函数,避免冲突的产生

该优化pr能够是mask rcnn 在bs =1的时候,性能提升大约20%

@paddle-bot-old
Copy link

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@phlrain phlrain changed the title optimize conv alog speed optimize conv algo scache Apr 18, 2022
@phlrain phlrain changed the title optimize conv algo scache optimize conv algo cache Apr 18, 2022
Copy link
Contributor

@JamesLim-sy JamesLim-sy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Work.

} else {
result = FindAlgoHeuristic(args, ctx);
}
phi::autotune::DnnNode node(static_cast<int64_t>(result.algo),
result.workspace_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DnnNode 的功能和SearchResult的重复性比较高,如果能够用SearchResult替代更好。不过后续我们这边应该会在DnnNode的基础上扩展出来AutoTuneResult类型。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我有一个版本是使用的SearchResult,但是search Result 里面模板T是 cudnnConvolutionFwdAlgoPerf_t, 这样cache.h会依赖,gpu_info.h, cache.h 在cpu场景下也会使用,编译会有问题

* Value Range: int32, default=1000000
* Example:
*/
PADDLE_DEFINE_EXPORTED_int32(search_cache_max_number, 1000000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块想了解下1000000 次设置的验证范围,是否覆盖了目前常用的Cudnn版本,主要是想避免因为版本问题导致Find开销超越cudnnGetConvolutionForwardAlgorithm_v7这类接口的开销。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为防止内存爆掉, cache承载的最大的数据量为1,000,000; 超过1,000,000,会强行clear; 因为unordered map消耗的资源大约为实际存储的数据的10倍, 1,000,000个元素,每个元素16个字节(int64_t 8字节,size_t 为8字节),所以消耗的最大内存空间未160M字节;同时对于大部分训练的任务,cache中的元素数量是跟输入图片的形状强关联的,是能穷举,不会无限增长;

1,000,000的设置,主要是考虑内存的占用,这个数量,跟模型input 和 weight这些强关联,跟cudnn版本关联性并不大

JamesLim-sy
JamesLim-sy previously approved these changes Apr 24, 2022
Copy link
Contributor

@JamesLim-sy JamesLim-sy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@paddle-bot-old
Copy link

paddle-bot-old bot commented May 8, 2022

Sorry to inform you that b15a4be's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

Copy link
Contributor

@JamesLim-sy JamesLim-sy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work !

};

template <typename AlgorithmT>
class CudnnAlgorithmsCache {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CudnnAlgorithmsCache名义上即针对ConvCudnn,是否可以取消模板参数AlgorithmT,切换成DnnNode,比如下方的Get方法:

DnnNode AlgorithmT Get(const ConvCacheKey& key) {
      ......
};

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (auto_tune_map_.find(key) == auto_tune_map_.end()) {
AlgorithmsCacheMap cache;
auto_tune_map_[key] = cache;
if (algo_type == AlgorithmType::kTranspose) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可用自动调节的OP目前只有ConvTranspose,为了后续更多的OP接入,建议将下列判断条件
if (algo_type == AlgorithmType::kTranspose) { 改成
if (static_cast<size_t>(algo_type) >= static_cast<size_t>(AlgorithmType::kTranspose)) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

paddle::experimental::CppTypeToDataType<T>::Type());
paddle::experimental::CppTypeToDataType<T>::Type(),
group,
static_cast<int64_t>(data_layout)); // (todo,hong) data layeout is a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetCacheKey的功能已经被Convert2ConvCacheKey 取代了,这块感觉可以直接删除.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

groups_(groups),
data_layout_(data_layout) {}
size_t hash_value() const {
return ConvKey(x_dims_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConvKey的功能已经被取代,这里建议直接改用原ConvKey的函数体了

  return GetKey(x_dims_,
                w_dims_,
                strides_,
                paddings_,
                dilations_,
                dtype_,
                groups_,
                data_layout_);
               

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@JamesLim-sy JamesLim-sy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for 'self.assertTrue(np.allclose(...))' and 'self.assertTrue(np.array_equal(...))。下个PR修复一下。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants