-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize conv algo cache #41891
optimize conv algo cache #41891
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
… try_to_fix_conv_speed
… try_to_fix_conv_speed
… try_to_fix_conv_speed
… try_to_fix_conv_speed
… try_to_fix_conv_speed
… try_to_fix_conv_speed
… try_to_fix_conv_speed
… try_to_fix_conv_speed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Work.
} else { | ||
result = FindAlgoHeuristic(args, ctx); | ||
} | ||
phi::autotune::DnnNode node(static_cast<int64_t>(result.algo), | ||
result.workspace_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DnnNode
的功能和SearchResult
的重复性比较高,如果能够用SearchResult
替代更好。不过后续我们这边应该会在DnnNode
的基础上扩展出来AutoTuneResult
类型。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我有一个版本是使用的SearchResult,但是search Result 里面模板T是 cudnnConvolutionFwdAlgoPerf_t, 这样cache.h会依赖,gpu_info.h, cache.h 在cpu场景下也会使用,编译会有问题
paddle/fluid/platform/flags.cc
Outdated
* Value Range: int32, default=1000000 | ||
* Example: | ||
*/ | ||
PADDLE_DEFINE_EXPORTED_int32(search_cache_max_number, 1000000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块想了解下1000000
次设置的验证范围,是否覆盖了目前常用的Cudnn版本,主要是想避免因为版本问题导致Find开销超越cudnnGetConvolutionForwardAlgorithm_v7
这类接口的开销。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为防止内存爆掉, cache承载的最大的数据量为1,000,000; 超过1,000,000,会强行clear; 因为unordered map消耗的资源大约为实际存储的数据的10倍, 1,000,000个元素,每个元素16个字节(int64_t 8字节,size_t 为8字节),所以消耗的最大内存空间未160M字节;同时对于大部分训练的任务,cache中的元素数量是跟输入图片的形状强关联的,是能穷举,不会无限增长;
1,000,000的设置,主要是考虑内存的占用,这个数量,跟模型input 和 weight这些强关联,跟cudnn版本关联性并不大
… try_to_fix_conv_speed
… try_to_fix_conv_speed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… try_to_fix_conv_speed
… try_to_fix_conv_speed
Sorry to inform you that b15a4be's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
… try_to_fix_conv_speed
… try_to_fix_conv_speed
… try_to_fix_conv_speed
… try_to_fix_conv_speed
… try_to_fix_conv_speed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work !
paddle/phi/kernels/autotune/cache.h
Outdated
}; | ||
|
||
template <typename AlgorithmT> | ||
class CudnnAlgorithmsCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CudnnAlgorithmsCache名义上即针对ConvCudnn,是否可以取消模板参数AlgorithmT
,切换成DnnNode
,比如下方的Get
方法:
DnnNode AlgorithmT Get(const ConvCacheKey& key) {
......
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
paddle/phi/kernels/autotune/cache.h
Outdated
if (auto_tune_map_.find(key) == auto_tune_map_.end()) { | ||
AlgorithmsCacheMap cache; | ||
auto_tune_map_[key] = cache; | ||
if (algo_type == AlgorithmType::kTranspose) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可用自动调节的OP目前只有Conv
和 Transpose
,为了后续更多的OP接入,建议将下列判断条件
if (algo_type == AlgorithmType::kTranspose) {
改成
if (static_cast<size_t>(algo_type) >= static_cast<size_t>(AlgorithmType::kTranspose)) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
paddle::experimental::CppTypeToDataType<T>::Type()); | ||
paddle::experimental::CppTypeToDataType<T>::Type(), | ||
group, | ||
static_cast<int64_t>(data_layout)); // (todo,hong) data layeout is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetCacheKey
的功能已经被Convert2ConvCacheKey
取代了,这块感觉可以直接删除.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
paddle/phi/kernels/autotune/cache.h
Outdated
groups_(groups), | ||
data_layout_(data_layout) {} | ||
size_t hash_value() const { | ||
return ConvKey(x_dims_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConvKey
的功能已经被取代,这里建议直接改用原ConvKey的函数体了
return GetKey(x_dims_,
w_dims_,
strides_,
paddings_,
dilations_,
dtype_,
groups_,
data_layout_);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
… try_to_fix_conv_speed
… try_to_fix_conv_speed
… try_to_fix_conv_speed
… try_to_fix_conv_speed
… try_to_fix_conv_speed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for 'self.assertTrue(np.allclose(...))' and 'self.assertTrue(np.array_equal(...))。下个PR修复一下。
PR types
Performance optimization
PR changes
Others
Describe
优化点:cudnn提供的cudnnGetConvolutionForwardAlgorithm_v7,cudnnGetConvolutionForwardWorkspaceSize 这两个接口性能都比较低,在重调度的场景下,会带来较大的性能问题
本pr基于auto tune cache的基础上做了一些优化升级,升级点如下:
该优化pr能够是mask rcnn 在bs =1的时候,性能提升大约20%