-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add design doc for FPGA on Paddle #4027
Conversation
doc/design/baidu_fpga/README.MD
Outdated
|
||
我们计划利用百度FPGA云主机和深度学习加速库Polaris,将FPGA集成到PaddlePaddle,使其支持更多的异构硬件。 | ||
- 百度FPGA云服务器:百度云环境中配备的FPGA计算实例,通过该实例,可以快速地构建FPGA硬件加速程序[1]。 | ||
- Polaris:一个百度开发的基于FPGA的高性能深度学习计算库,供用户方便的调用百度FPGA实现的功能\[[2](#references)\]。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请问这个Polaris库是开源的么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polaris库会在github上面开放头文件和.a文件
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那能否改成:
Polaris:一个百度开发的基于FPGA的高性能深度学习计算库(待开源,包含头文件和.a文件)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好。
doc/design/baidu_fpga/README.MD
Outdated
```C++ | ||
PADDLE_ENFORCE(!platform::is_fpga_place(ctx.GetPlace()), | ||
"It can not use FPGAPlace."); | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FPGA的op,能写成FPGAXXX_op.cc么?即不和原来的XXX_op.cc写在一块,主要考虑:
- 如果写在一块,编译嵌入式设备上的纯cpu代码时,很难分离。
- 方便FPGA代码维护。TensorFlow的mkl相关op,也是单独写的。后续加入MKLDNNXXX_op,也会单独写成一个op。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FPGA的op,能写成FPGAXXX_op.cc么?
这个没有问题,但是如果支持的OP多了的话会有比较多的FPGAXXX_op.cc文件,这个有问题吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没问题,这样比较清晰。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luotao1 以lookup_table为例,现在GPU注册代码是
REGISTER_OP_GPU_KERNEL(lookup_table, ops::LookupTableCUDAKernel);
我们需要实现的是如下这样的吗?
REGISTER_OP_FPGA_KERNEL(lookup_table, ops::LookupTableFPGAKernel);
还是说是:
REGISTER_OP_FPGA_KERNEL(fpga_lookup_table, ops::LookupTableFPGAKernel);
即仅仅是代码分离,还是OP本身就是单独的?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有个建议,添加XXX_op_fpga.cc来实现FPGA的逻辑,而不是FPGAXXX_op.cc是不是更好?主要考虑是同样一个op,在目录结构上cpu,gpu,fpga的实现会连在一起。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@QingshuChen 确实这样会更清楚些
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
仅仅是代码分离,还是OP本身就是单独的
因为我对fpga不是很了解,我以MKLDNN目前的集成工作为例来表达下我的观点:
- MKLDNNXXX_layer的写法和paddle中原先layer的写法不太一样,原因是MKLDNN库中的数据存储方式和paddle的nchw方式不一样,导致得先定义一个MKLDNNMatrix的类来管理所有的数据。所以无法像调用MKLBlas库那样,直接在原先的layer中,调用MKLDNN库。
- FPGA的数据存储方式,和重构后的Tensor方式是否一致?或者容易转换?
- 如果目前lookup_table_op.cc完全满足FPGA,即只需要加一个lookup_table_fpga_op.cc,在里面实现LookupTableFPGAKernel,那么只要代码分离,不需要op分离。
- 反之,如果FPGA有自己的一套数据存储方式,并会在lookup_table_op.cc添加很多逻辑,那么就需要op分离。
@QingshuChen XXX_op_fpga.cc和XXX_op_fpga.h确实更好一点。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shijiaxin 目前感觉当成一个kernel,注册到原有的op上面没啥问题。
但是注意到tensorflow的设计中mkl部分是单独的kernel和单独的op,有待确认下,会不会是mkl有什么坑在里面。(可能是因为mkl的有自己独特的数据格式,在使用时需要自己手动转一下)。而GPU和FPGA,如果要跨设备的话,应该都需要显式添加一个copy operator,来负责拷贝数据。可以参考下 #4031
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用的格式和CPU、GPU标准的格式是一样的
可以在文档中强调说明一下,如果有关于格式说明的官方链接,贴出来最好。那么可以加入XXX_op_fpga.cc,在里面实现XXXFPGAKernel,代码分离,不需要op分离。
doc/design/baidu_fpga/README.MD
Outdated
|
||
我们计划利用百度FPGA云主机和深度学习加速库Polaris,将FPGA集成到PaddlePaddle,使其支持更多的异构硬件。 | ||
- 百度FPGA云服务器:百度云环境中配备的FPGA计算实例,通过该实例,可以快速地构建FPGA硬件加速程序[1]。 | ||
- Polaris:一个百度开发的基于FPGA的高性能深度学习计算库,供用户方便的调用百度FPGA实现的功能\[[2](#references)\]。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那能否改成:
Polaris:一个百度开发的基于FPGA的高性能深度学习计算库(待开源,包含头文件和.a文件)?
doc/design/baidu_fpga/README.MD
Outdated
```C++ | ||
PADDLE_ENFORCE(!platform::is_fpga_place(ctx.GetPlace()), | ||
"It can not use FPGAPlace."); | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没问题,这样比较清晰。
doc/design/baidu_fpga/README.MD
Outdated
我们计划基于重构中的PaddlePaddle进行开发,目标是: | ||
|
||
- 百度FPGA支持大部分常见的深度学习Operator。 | ||
- 百度FPGA支持大部分常见的深度学习模型。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请问“常见的”:包括图像、NLP的都支持么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我们计划都包含
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以在文档中强调一下,包含图像、NLP、语言等,我可能没列全。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的,thx
**备注**:由于FPGA不如CPU和GPU灵活,因此FPGA只能支持大部分Operator,无法支持所有的Operator。 | ||
|
||
### Net | ||
Net是包含了一系列的Operator,目前一个Net上的Operator只能运行在同一个设备上。由于FPGA不够灵活,部分Operator可能不支持FPGA,因此需要提供类似Parallel-nn的方法,即部分Operator在FPGA运行,部分Operator在CPU或GPU上运行。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
关于多设备执行的一个讨论 #4031 ,还在design中,可以一起来讨论
doc/design/baidu_fpga/README.MD
Outdated
```C++ | ||
PADDLE_ENFORCE(!platform::is_fpga_place(ctx.GetPlace()), | ||
"It can not use FPGAPlace."); | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@QingshuChen 确实这样会更清楚些
doc/design/baidu_fpga/README.MD
Outdated
```C++ | ||
PADDLE_ENFORCE(!platform::is_fpga_place(ctx.GetPlace()), | ||
"It can not use FPGAPlace."); | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shijiaxin 目前感觉当成一个kernel,注册到原有的op上面没啥问题。
但是注意到tensorflow的设计中mkl部分是单独的kernel和单独的op,有待确认下,会不会是mkl有什么坑在里面。(可能是因为mkl的有自己独特的数据格式,在使用时需要自己手动转一下)。而GPU和FPGA,如果要跨设备的话,应该都需要显式添加一个copy operator,来负责拷贝数据。可以参考下 #4031
doc/design/baidu_fpga/README.MD
Outdated
我们计划基于重构中的PaddlePaddle进行开发,目标是: | ||
|
||
- 百度FPGA支持大部分常见的深度学习Operator。 | ||
- 百度FPGA支持大部分常见的深度学习模型。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以在文档中强调一下,包含图像、NLP、语言等,我可能没列全。
doc/design/baidu_fpga/README.MD
Outdated
<div align="center"> | ||
<img src="image/overview.png" width=350><br/> | ||
Figure 1. FPGA on Paddle. | ||
</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这幅图可以缩小一点。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好
是否需要在这个设计文档上考虑下TeamCity和Travis Ci的FPGA单测集成问题? |
doc/design/baidu_fpga/README.MD
Outdated
Net是包含了一系列的Operator,目前一个Net上的Operator只能运行在同一个设备上。由于FPGA不够灵活,部分Operator可能不支持FPGA,因此需要提供类似Parallel-nn的方法,即部分Operator在FPGA运行,部分Operator在CPU或GPU上运行。 | ||
|
||
### UnitTest | ||
FPGA相关代码的单元测试会添加到对应修改的模块中。例如在添加了FPGAPlace,就需要在place_test.cc中添加FPGA的单元测试。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单测可能是一个问题,因为ci中应该还不支持FPGA的设备。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我们提供一台FPGA云主机,在上面进行CI测试,这种方式可以吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种方式是可以的。但一台够么?
- 目前Teamcity是三台机器,PR多的时候还会存在排队等待的现象。
- 以后所有的PR都要过FPGA的ci测试吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
赞!可以的,可能上面要搭一个teamcity。
doc/design/baidu_fpga/README.MD
Outdated
@@ -58,7 +57,7 @@ bool places_are_same_class(const Place &p1, const Place &p2) { | |||
重构中的PaddlePaddle实现了DeviceContext类来负责对应设备上面的资源管理,目前只支持CPU和GPU的版本。我们会添加一个FPGADeviceContext来负责管理FPGA设备的资源。 | |||
|
|||
### Operator | |||
重构中的PaddlePaddle是基于Operator的,不同设备(CPU、GPU)共享一个Op定义,OpKernel提供Compute方法\[[3](#references))\],我们会在现有XXX_op.cc文件中添加FPGAKernel来实现FPGA的计算逻辑。对于FPGA不支持的操作,我们在compute函数中加入判断: | |||
重构中的PaddlePaddle是基于Operator的,不同设备(CPU、GPU)共享一个Op定义,OpKernel提供Compute方法\[[3](#references))\]。FPGA的数据格式与CPU、GPU是一样的,因此我们计划添加XXX_op_FPGA.cc文件实现FPGAKernel的计算逻辑,注册到原有的Operator上。对于FPGA不支持的操作,我们在compute函数中加入判断: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在compute函数中加入判断
是对不支持,还是当前没实现的加入判断呢?目前不支持GPU的也没有加入判断,默认没有.cu的op就是不支持gpu的op。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前不支持GPU的也没有加入判断,默认没有.cu的op就是不支持gpu的op。
那是不是可以默认没有XXX_op_FPGA.cc就是不支持FPGA?
另外,如果没有.cu,但是传入op的place是GPUPlace,程序会是一种什么行为?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以默认没有XXX_op_FPGA.cc就是不支持FPGA
可以,现在存在.cu才会去用nv_library编译gpu代码。不存在的时候就不会编译。
如果没有.cu,但是传入op的place是GPUPlace,程序会是一种什么行为
这个不是很理解,是说上一个op是在GPU端跑的,而这个op没有GPU,就跑不了是么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前传给op传入的是DeviceContext。如果编译的是CPU版本,那么CUDADeviceContext就不会被编出来,只能创建CPUDeviceContext。
可以考虑加条件编译吧,如果 WITH_FPGA=OFF, FPGA相关的代码也不会编进去的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FPGA不支持的操作,只要不去调用REGISTER_OP_FPGA_KERNEL,就可以表示不支持/未实现吧。
在OpBase那边实现一个类似的IsSupportFPGA()的函数即可。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shijiaxin 是的,这么做就可以
@QiJune design doc现在是否可以合入? |
|
||
我们会把Polaris作为第三方库集成进PaddlePaddle,利用百度FPGA云服务器能够快速构建基于FPGA的PaddlePaddle应用,整体框架图如下所示: | ||
<div align="center"> | ||
<img src="image/overview.png" width=280><br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我感觉这个图貌似不必要吧?
- Polaris:一个百度开发的基于FPGA的高性能深度学习计算库(待开源,包含头文件和.a文件),供用户方便的调用百度FPGA实现的功能\[[2](#references)\]。 | ||
|
||
|
||
我们计划基于重构中的PaddlePaddle进行开发,目标是: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
design doc不用说计划。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我仔细看了这个文档。这是一个工作计划,不是设计。不适合作为一个PR贴出来,而是适合创建一个 Github Project。一个例子是:https://github.com/PaddlePaddle/Paddle/projects/19
这个例子里有三个 columns: todo, pending, done。其中每个 card/task 对应一个 Github Issue,可以有负责人。
另外,我看了 FPGA 的文档 http://fpga.baidu.com/doc/v1.0/md_docs_int8.html 。 这里提供的是C++ API,而不是C API。C++ API 里有CPU和FPGA的内存管理:
auto c_encoded = AlloateCPU<float>(size_c);
auto a_fpga = AlloateFPGA<float>(size_a);
这套设计和Paddle自己的memory management的设计融合起来的时候,不知道会不会有效率问题 —— 因为我看不到 AllocateCPU 的实现是什么样的。
我建议 Polaris 的interface library做成纯 C 的 —— 和 BLAS 的其他实现,例如 MKL 和 OpenBLAS 和 cuBLAS 一样。也不必提供 CPU memory management
的功能。这样比较好融合到更高抽象级别的软件系统(比如PaddlePaddle)里。
如果FPGA团队另起一个项目,在C API的基础上封装 C++或者Python的接口,也会更加方便。
最后,建议今早尽快的开源相关软件。这样 PaddlePaddle 对 Polaris 的意见和建议,可以提交为 Github Issue。有问题早发现早预见早解决,确保 Polaris 和 PaddlePaddle 是兼容的。
这一段example需要更新,我们的接口中确实使用的是C接口。内存管理方面我们提供了polaris_malloc()和polaris_malloc_host()来进行分配,不使用C++指针。
这里我们提供 |
好。我强烈建议Polaris项目尽快建立Github repo。这样类似关于Polaris的讨论可以在那里进行。我估计随着大家对Polaris的了解越多,问题会越多,需要有个地方开放式讨论。然后才能知道如何在 PaddlePaddle 里融合 Polaris,从而在PaddlePaddle的repo里创建融合工作对应的Project。 PaddlePaddle 肯定是愿意支持FPGA的。现在的问题是怎么样支持好。要建立一个高效率的工作流程。 |
@wangkuiyi 您好,我们的repo已经创建了,可以看这里,目前开放了头文件和.a文件,另外比较详细的文档和使用样例可以查看Manual,请参考一下,如果有任何问题欢迎直接提issue。(= 另外Paddle repo里的project我们是先建同步做起来还是等你们熟悉完Polaris之后再创建? |
@zealoct 好的。PaddlePaddle在讨论如何利用多个computation acceleration devices。稍后发出design doc。 |
No description provided.