Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add design doc for FPGA on Paddle #4027

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions doc/design/baidu_fpga/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Baidu FPGA on PaddlePaddle: Design Doc

我们计划利用百度FPGA云主机和深度学习加速库Polaris,将FPGA集成到PaddlePaddle,使其支持更多的异构硬件。
- 百度FPGA云服务器:百度云环境中配备的FPGA计算实例,通过该实例,可以快速地构建FPGA硬件加速程序[1]。
- Polaris:一个百度开发的基于FPGA的高性能深度学习计算库,供用户方便的调用百度FPGA实现的功能\[[2](#references)\]。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请问这个Polaris库是开源的么?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polaris库会在github上面开放头文件和.a文件

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那能否改成:
Polaris:一个百度开发的基于FPGA的高性能深度学习计算库(待开源,包含头文件和.a文件)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好。




我们计划基于重构中的PaddlePaddle进行开发,目标是:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

design doc不用说计划。


- 百度FPGA支持大部分常见的深度学习Operator。
- 百度FPGA支持大部分常见的深度学习模型。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请问“常见的”:包括图像、NLP的都支持么?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们计划都包含

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以在文档中强调一下,包含图像、NLP、语言等,我可能没列全。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,thx



## Contents

- [Overview](#overview)
- [Actions](#actions)
- [CMake](#cmake)
- [Layers](#layers)
- [Activations](#activations)
- [Unit Tests](#unit-tests)
- [Protobuf Messages](#protobuf-messages)
- [Python API](#python-api)
- [Demos](#demos)
- [Benchmarking](#benchmarking)
- [Others](#others)

## Overview

我们会把Polaris作为第三方库集成进PaddlePaddle,利用百度FPGA云服务器能够快速构建基于FPGA的PaddlePaddle应用,整体框架图如下所示:
<div align="center">
<img src="image/overview.png" width=350><br/>
Figure 1. FPGA on Paddle.
</div>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这幅图可以缩小一点。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Actions
我们把集成方案大致分为了如下几个方面。

### CMake
我们会在`CMakeLists.txt`中会添加`WITH_FPGA`的选项,当设置这个值为`ON`的时候会启用FPGA编译功能。我们会在`cmake/external`目录新建`polaris.cmake`文件,它们会在编译PaddlePaddle的时候下载Polaris头文件和库,并放到PaddlePaddle的third_party目录中。

### Place
重构中的PaddlePaddle用Place类来标注数据在什么设备上,目前支持CPUPlace和GPUPlace。我们会在place.h中添加FPGAPlace,标注数据在FPGA设备上,并在其中添加int类型的deviceid来支持FPGA多设备。同时,我们参照CPUPlace和GPUPlace的实现,提供is_fpga_place等方法。

**备注**:重构中的PaddlePaddle有一些代码默认设备不是GPU就是CPU,如:
```C++
bool places_are_same_class(const Place &p1, const Place &p2) {
return is_gpu_place(p1) == is_gpu_place(p2);
}
```
添加FPGAPlace后需要修改这部分代码的逻辑。

### Memory
重构中的PaddlePaddle实现了buddy memomry allocator类来负责内存的管理,目前只支持CPU和GPU的版本。我们会添加一个FPGAAllocator来负责FPGA的内存管理。

### DeviceContext
重构中的PaddlePaddle实现了DeviceContext类来负责对应设备上面的资源管理,目前只支持CPU和GPU的版本。我们会添加一个FPGADeviceContext来负责管理FPGA设备的资源。

### Operator
重构中的PaddlePaddle是基于Operator的,不同设备(CPU、GPU)共享一个Op定义,OpKernel提供Compute方法\[[3](#references))\],我们会在现有XXX_op.cc文件中添加FPGAKernel来实现FPGA的计算逻辑。对于FPGA不支持的操作,我们在compute函数中加入判断:
```C++
PADDLE_ENFORCE(!platform::is_fpga_place(ctx.GetPlace()),
"It can not use FPGAPlace.");
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FPGA的op,能写成FPGAXXX_op.cc么?即不和原来的XXX_op.cc写在一块,主要考虑:

  • 如果写在一块,编译嵌入式设备上的纯cpu代码时,很难分离。
  • 方便FPGA代码维护。TensorFlow的mkl相关op,也是单独写的。后续加入MKLDNNXXX_op,也会单独写成一个op。

Copy link
Contributor Author

@QingshuChen QingshuChen Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FPGA的op,能写成FPGAXXX_op.cc么?

这个没有问题,但是如果支持的OP多了的话会有比较多的FPGAXXX_op.cc文件,这个有问题吗?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没问题,这样比较清晰。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luotao1 以lookup_table为例,现在GPU注册代码是
REGISTER_OP_GPU_KERNEL(lookup_table, ops::LookupTableCUDAKernel);
我们需要实现的是如下这样的吗?
REGISTER_OP_FPGA_KERNEL(lookup_table, ops::LookupTableFPGAKernel);
还是说是:
REGISTER_OP_FPGA_KERNEL(fpga_lookup_table, ops::LookupTableFPGAKernel);
即仅仅是代码分离,还是OP本身就是单独的?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有个建议,添加XXX_op_fpga.cc来实现FPGA的逻辑,而不是FPGAXXX_op.cc是不是更好?主要考虑是同样一个op,在目录结构上cpu,gpu,fpga的实现会连在一起。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@QingshuChen 确实这样会更清楚些

Copy link
Contributor

@luotao1 luotao1 Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shijiaxin

仅仅是代码分离,还是OP本身就是单独的

因为我对fpga不是很了解,我以MKLDNN目前的集成工作为例来表达下我的观点:

  • MKLDNNXXX_layer的写法和paddle中原先layer的写法不太一样,原因是MKLDNN库中的数据存储方式和paddle的nchw方式不一样,导致得先定义一个MKLDNNMatrix的类来管理所有的数据。所以无法像调用MKLBlas库那样,直接在原先的layer中,调用MKLDNN库。
  • FPGA的数据存储方式,和重构后的Tensor方式是否一致?或者容易转换?
    • 如果目前lookup_table_op.cc完全满足FPGA,即只需要加一个lookup_table_fpga_op.cc,在里面实现LookupTableFPGAKernel,那么只要代码分离,不需要op分离。
    • 反之,如果FPGA有自己的一套数据存储方式,并会在lookup_table_op.cc添加很多逻辑,那么就需要op分离。

@QingshuChen XXX_op_fpga.cc和XXX_op_fpga.h确实更好一点。

Copy link
Member

@QiJune QiJune Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shijiaxin 目前感觉当成一个kernel,注册到原有的op上面没啥问题。
但是注意到tensorflow的设计中mkl部分是单独的kernel和单独的op,有待确认下,会不会是mkl有什么坑在里面。(可能是因为mkl的有自己独特的数据格式,在使用时需要自己手动转一下)。而GPU和FPGA,如果要跨设备的话,应该都需要显式添加一个copy operator,来负责拷贝数据。可以参考下 #4031

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luotao1 @QiJune 我们支持的操作,使用的格式和CPU、GPU标准的格式是一样的,所以不存在MKLDNN的问题

Copy link
Contributor

@luotao1 luotao1 Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用的格式和CPU、GPU标准的格式是一样的

可以在文档中强调说明一下,如果有关于格式说明的官方链接,贴出来最好。那么可以加入XXX_op_fpga.cc,在里面实现XXXFPGAKernel,代码分离,不需要op分离。


**备注**:由于FPGA不如CPU和GPU灵活,因此FPGA只能支持大部分Operator,无法支持所有的Operator。

### Net
Net是包含了一系列的Operator,目前一个Net上的Operator只能运行在同一个设备上。由于FPGA不够灵活,部分Operator可能不支持FPGA,因此需要提供类似Parallel-nn的方法,即部分Operator在FPGA运行,部分Operator在CPU或GPU上运行。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关于多设备执行的一个讨论 #4031 ,还在design中,可以一起来讨论


### UnitTest
FPGA相关代码的单元测试会添加到对应修改的模块中。例如在添加了FPGAPlace,就需要在place_test.cc中添加FPGA的单元测试。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单测可能是一个问题,因为ci中应该还不支持FPGA的设备。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们提供一台FPGA云主机,在上面进行CI测试,这种方式可以吗?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种方式是可以的。但一台够么?

  • 目前Teamcity是三台机器,PR多的时候还会存在排队等待的现象。
  • 以后所有的PR都要过FPGA的ci测试吧。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

赞!可以的,可能上面要搭一个teamcity。


### Python API
目前只考虑v2 API,使用FPGA时只需在设置Place时设置成FPGA,其他配置不变。
```python
place = core.FPGAPlace()
```

## References
1. [百度FPGA云服务器](https://cloud.baidu.com/product/fpga.html)
2. [Baidu Polaris Project](http://fpga.baidu.com/)
3. [如何写新的Operator](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/new_op_cn.md#%E5%AE%9A%E4%B9%89OpKernel%E7%B1%BB)
Binary file added doc/design/baidu_fpga/image/overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.