[Hardware][OpenCL] Intelfocl support #9

zhanghaohit · 2020-06-18T10:59:18Z

This PR relates to this RFC.

Main changes are:

OpenCL driver
intelfocl implementation for VTA
add ALU MUL and Load INT_8

This is a basic version as a POC, without much performance optimization. We're still optimizing some parts, and will be submitted as a new PR later.

- add mul, load_int8 - some bugfix for bits width

Rename VTA_MEM_ID_ACC_8 to VTA_MEM_ID_ACC_8BIT

config/intelfocl_sample.json

tmoreau89 · 2020-06-19T04:24:01Z

CC-ing @vegaluisjose @huajsj @pasqoc

tmoreau89 · 2020-06-20T01:16:07Z

config/vta_cost.py

+    factor = 1000000.0
+    def alu_imm_cost(iter_out, iter_in, uop_bgn, uop_end):
+        x = (uop_end - uop_bgn) * iter_out * iter_in
+        cycles = x + 46


How are these costs generally derived? Are they FPGA device specific? If we change the compilation parameters in the OpenCL compiler, could it affect these cost models?

Yes. Those costs are FPGA and configuration specific, and we need to profile/measure the values from the actual hardware.

I believe the values are provided here as an example.

Actually I think it would be good to define constants here, and add a comment that states that these are set to arbitrarily (e.g. 46).

I'm working with a colleague on making easier to perform some of these cycle accurate co-simulation of VTA modules which should help us derive these costs more accurately on a per-VTA variant basis. In the meantime, can we use constant names, e.g. ALU_LATENCY_CYCLES = 46

include/vta/hw_spec.h

tmoreau89 · 2020-06-20T01:32:48Z

Thanks @zhanghaohit for the PR! I left a couple questions, but will need to review in more depth. One question: have you tested it on the Pynq board to see if some changes may affect the existing design? Given that the ISA has changed slightly (some fields are now wider, like VTA_MEMOP_ID_BIT_WIDTH), it may affect some parameterizations of VTA.

zhanghaohit · 2020-06-22T06:25:49Z

Thanks @zhanghaohit for the PR! I left a couple questions, but will need to review in more depth. One question: have you tested it on the Pynq board to see if some changes may affect the existing design? Given that the ISA has changed slightly (some fields are now wider, like VTA_MEMOP_ID_BIT_WIDTH), it may affect some parameterizations of VTA.

Thanks @tmoreau89 for the comments. Yes. We have tested on the ultra96 board (I only have ultra96 board on hand). It works. The only thing we have to change is to re-compile the bitstream with the new ISA. I think we have to update the pre-compiled bitstream here https://github.com/uwsampl/vta-distro/tree/master/bitstreams, right?

src/intelfocl/AOCLUtils/opencl.h

tmoreau89 · 2020-06-23T21:24:05Z

That is correct, we'll need to bump the versioning of the bitstream

zhanghaohit · 2020-07-17T02:17:20Z

@tmoreau89 @liangfu any other comments? Thanks.

liangfu · 2020-07-17T04:15:30Z

src/intelfocl/intelfocl_device.cc

+
+int IntelFOCLDevice::init(size_t mem_size, std::string aocx_file)
+{
+    cl_int status;


Please ensure the source code pass cpplint rules. Specifically, we use 2 spaces for indentation.

Sure. Thanks. We will update that.

liangfu · 2020-07-17T05:31:51Z

config/intelfocl_sample.json

@@ -0,0 +1,13 @@
+{
+  "TARGET" : "intelfocl",
+  "HW_VER" : "0.0.1",


We might need to update this version number, since the HW ISA has changed?

Agreed. Let's bump it up to 0.0.2

Please apply the 0.0.2 bump thanks

liangfu · 2020-07-17T05:39:55Z

hardware/intelfocl/src/vta.cl

@@ -0,0 +1,341 @@
+#pragma OPENCL EXTENSION cl_intel_channels: enable


Just curious, as I remember this PR is going to work for ultra96, does this extension work for Xilinx toolchain to compile the hardware design? I might misunderstood somewhere.

The code that work for ultra96 is still inside hardware/xilinx

config/intelfocl_sample.json

src/intelfocl/intelfocl_device.cc

tmoreau89 · 2020-07-17T22:31:30Z

Thanks for the changes. Please apply the 0.0.2 and rename the vta target to something more specific, e.g. "arria10". Also there are some CI errors related to linting that could be addressed. Thanks!

remotego · 2020-07-18T02:44:16Z

Thanks for the changes. Please apply the 0.0.2 and rename the vta target to something more specific, e.g. "arria10". Also there are some CI errors related to linting that could be addressed. Thanks!

Thank you very much! Sure. We will apply the 0.0.2 and address the linting errors.
However. I believe "arria10" is too restrictive here. The code should work for all devices supported by Intel OpenCL for FPGA, namely Intel Arria 10, Stratix V/10 and Cyclone V/10. So far we have tested it on both Arria 10 and Stratix 10 boards, and it worked.

zhanghaohit · 2020-07-18T16:00:02Z

Thanks for the changes. Please apply the 0.0.2 and rename the vta target to something more specific, e.g. "arria10". Also there are some CI errors related to linting that could be addressed. Thanks!

Thanks. The linting error here is due to filetype checking. I think we have to add the opencl filetype to the lint script. I've created a PR here: apache/tvm#6092

liangfu · 2020-07-19T09:51:57Z

The code should work for all devices supported by Intel OpenCL for FPGA, namely Intel Arria 10, Stratix V/10 and Cyclone V/10.

Just my humble opinion, given that both "Intel OpenCL for FPGA" and "VTA" requires a large amount of logic utilization, many Cyclone V chips that supports AOCL couldn't get this compiled because of the hardware utilization issue.
In my understanding of @tmoreau89 's comment, which I would agree, setting a proper target device in vta_config gives the property of being "versatile" in VTA - the Versatile Tensor Accelerator. Specifically, we could take the advantage of the properties in vta_config to define an accelerator that could scale from minimal ones towards data center scale accelerators.

As a side note, the reason we are taking these efforts in building open source projects, in part, we are hoping someone in the community could reproduce what we have done, and could easily start to build something that is even better. With the target being defined too board, an potential grad student could fail to reproduce the result, since not all the student could easily purchase a board with Stratix 10, and low cost Cyclone V boards couldn't get this running. In addition, they're wasting large amount of valuable logic resources, even they could afford a board with Stratix 10. Therefore, we should specify a precise target device for the vta_config.

remotego · 2020-07-19T10:30:21Z

The code should work for all devices supported by Intel OpenCL for FPGA, namely Intel Arria 10, Stratix V/10 and Cyclone V/10.

Just my humble opinion, given that both "Intel OpenCL for FPGA" and "VTA" requires a large amount of logic utilization, many Cyclone V chips that supports AOCL couldn't get this compiled because of the hardware utilization issue.
In my understanding of @tmoreau89 's comment, which I would agree, setting a proper target device in vta_config gives the property of being "versatile" in VTA - the Versatile Tensor Accelerator. Specifically, we could take the advantage of the properties in vta_config to define an accelerator that could scale from minimal ones towards data center scale accelerators.

As a side note, the reason we are taking these efforts in building open source projects, in part, we are hoping someone in the community could reproduce what we have done, and could easily start to build something that is even better. With the target being defined too board, an potential grad student could fail to reproduce the result, since not all the student could easily purchase a board with Stratix 10, and low cost Cyclone V boards couldn't get this running. In addition, they're wasting large amount of valuable logic resources, even they could afford a board with Stratix 10. Therefore, we should specify a precise target device for the vta_config.

Thank you for your reply. However, could you explain more on the reason why the design shall not work on Cyclone V FPGAs that supports Intel OpenCL for FPGA?

Precisely as you mentioned, the VTA design is versatile, the user could always change the settings (i.e. LOG_BLOCK and LOG_*_BUF_SIZE) to adjust the resource usage in order to fit their own FPGA boards. Surely a low cost Cyclone V device could not support 64x64 GEMV cores like large Stratix 10 FPGAs do. But the user could always try to use 16x16 or even 4x4 GEMV cores, by setting the LOG_BLOCK lower.

Considering that, we used a relatively small default setting for LOG_BLOCK (4) and Buffers(15, 15, 18, 17). Thus the design should be able to fit into FPGAs comparable to the original Zynq/Zedboard platforms.

We must admit that we don't have those Cyclone V boards on hand, nor has the design been tested on those platforms. However, if there is any issue on compiling the design for a AOCL-compatibale cyclone V board, we will be more than happy to investigate and try to solve the issue together.

In terms of accessibility, we know that high-end could FPGA cards are very expensive. The good news is that nowadays there are many Cloud Service Providers available offering high-end FPGA instances! Those FPGA instances generally only cost few bucks for hour's usage.

In addition, we are also working on porting the design over to Amazon EC2 F1 instances (Xilinx SDAccel). We will update again when we finish testing on the Amazon platforms.

remotego · 2020-09-27T14:35:04Z

Hi @tmoreau89,

We have completed the changes listed, and we have also included an README.rst file as a preliminary installation guide.

Please let us know if more changes are required.

Thank you very much!

tmoreau89

@remotego @zhanghaohit thank you for making the requested changes and adding the README.rst, it reads very well.

@liangfu please approve/request changes

liangfu

LTGM

liangfu · 2020-10-12T06:29:30Z

ci tests failed on tsim seems to be unrelated, @zhanghaohit do you mind retrigger ci to see if it passes? (It was successful previously.)

zhanghaohit · 2020-10-26T16:04:34Z

ci tests failed on tsim seems to be unrelated, @zhanghaohit do you mind retrigger ci to see if it passes? (It was successful previously.)

@liangfu Thanks for the comments and sorry for my late reply. I tried to trigger the ci again, but it failed at the same place.

I tried to run python3 -m pytest -v ${TVM_PATH}/vta/tests/python/unittest locally. It passed. Any ideas?

tmoreau89 · 2020-12-04T18:50:33Z

@zhanghaohit sorry for the late reply. I think that the CI issue is due to the changes in the ISA; we are relying on essentially tests that assume the older ISA, therefore breaking unit tests here. The ISA rarely changes, so we didn't set up the tests to account for changes in ISA.

tmoreau89 · 2020-12-04T19:00:54Z

@zhanghaohit what I'd like to suggest is that we temporarily disable the unit tests that are failing and re-enable them once we've update TVM since there's a bit of a circular dependence to make the tests work

Please comment out the test in tests/scripts/docker_bash.sh here ./tests/scripts/task_python_vta_tsim.sh. I'll take care of enabling the TSIM test once we have that merged into TVM.

zhanghaohit · 2020-12-09T04:54:52Z

@zhanghaohit what I'd like to suggest is that we temporarily disable the unit tests that are failing and re-enable them once we've update TVM since there's a bit of a circular dependence to make the tests work

Please comment out the test in tests/scripts/docker_bash.sh here ./tests/scripts/task_python_vta_tsim.sh. I'll take care of enabling the TSIM test once we have that merged into TVM.

Thanks @tmoreau89 for the suggestion.

I cannot find the test ./tests/scripts/task_python_vta_tsim.sh in tests/scripts/docker_bash.sh. So I add a condition check to skip this test here:. I also tried to comment out the test in Jenkinsfile, but it seems not working.

Now the CI has passed. Could you help check? And thanks for the help to re-enable the tsim test after all are merged.

tmoreau89 · 2020-12-11T02:13:32Z

Thank you @zhanghaohit, @remotego, @liangfu , the PR is merged!

zhanghaohit and others added 8 commits June 18, 2020 12:09

- static auto-tune sample config

3806c7d

- add mul, load_int8 - some bugfix for bits width

Extract hw_spec_const.h out of hw_spec.h

0931ec2

Rename VTA_MEM_ID_ACC_8 to VTA_MEM_ID_ACC_8BIT

Add OpenCL kernel sources for Intel OpenCL for FPGA devices

a521ac0

Add driver sources to support Intel OpenCL for FPGA devices

4d50027

intelfocl sample configuration for VTA added

00be17d

Workaround for Signedness bug in Intel OpenCL for FPGA compiler

e5a2151

remove some comments

75ed231

rename cpp to cc

ed466d7

zhanghaohit changed the title ~~Intelfocl support~~ [Hardware][OpenCL] Intelfocl support Jun 18, 2020

zhanghaohit mentioned this pull request Jun 18, 2020

[VTA][OpenCL] Cloud FPGA support apache/tvm#5842

Closed

liangfu reviewed Jun 19, 2020

View reviewed changes

config/intelfocl_sample.json Show resolved Hide resolved

tmoreau89 reviewed Jun 20, 2020

View reviewed changes

include/vta/hw_spec.h Outdated Show resolved Hide resolved

change UOP src_idx size to max(inp, acc)

9498e6e

liangfu reviewed Jun 23, 2020

View reviewed changes

src/intelfocl/AOCLUtils/opencl.h Outdated Show resolved Hide resolved

Move AOCLUtils into 3rdpary directory on TVM

98860a2

liangfu reviewed Jul 17, 2020

View reviewed changes

zhanghaohit and others added 3 commits July 19, 2020 22:52

bump the intelfocl HW_VER to 0.0.2

bbf8b9b

Bump all the HW_VER to 0.0.2 as there is a ISA change

fb7d4cc

Address cpplint issues

28e9340

Li Jiashu added 3 commits September 22, 2020 17:06

Add comments for OCLFPGADevice member functions

31eda26

2-space indentation for .cl files

9fecdcb

Add README to hardware/intelfocl

3113b52

Update README.rst

0d967d5

remotego force-pushed the feature/opencl branch 2 times, most recently from 71d55c5 to 0d967d5 Compare September 28, 2020 05:54

Update README.rst

a096abe

tqchen changed the base branch from master to main October 11, 2020 17:42

zhanghaohit requested review from tmoreau89 and liangfu October 12, 2020 03:43

tmoreau89 approved these changes Oct 12, 2020

View reviewed changes

liangfu approved these changes Oct 12, 2020

View reviewed changes

update to trigger ci

bed595c

zhanghaohit added 2 commits December 7, 2020 10:27

disable tsim test: quick fix for test fails due to ISA changes

60f3bc4

TESTING

ad9719c

zhanghaohit changed the title ~~[Hardware][OpenCL] Intelfocl support~~ WIP: [Hardware][OpenCL] Intelfocl support Dec 9, 2020

zhanghaohit added 2 commits December 9, 2020 11:41

disable tsim test in docker_bash.sh

c8dd61b

cleanup code

48fb34e

zhanghaohit changed the title ~~WIP: [Hardware][OpenCL] Intelfocl support~~ [Hardware][OpenCL] Intelfocl support Dec 9, 2020

tmoreau89 merged commit 5bd9c6a into apache:main Dec 11, 2020

tmoreau89 mentioned this pull request Dec 11, 2020

[VTA][OpenCL] intelfocl apache/tvm#6126

Merged

remotego deleted the feature/opencl branch December 15, 2020 09:16

jinhongyii pushed a commit that referenced this pull request Sep 5, 2023

Update README.md (#9)

6e3d46d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hardware][OpenCL] Intelfocl support #9

[Hardware][OpenCL] Intelfocl support #9

zhanghaohit commented Jun 18, 2020

tmoreau89 commented Jun 19, 2020

tmoreau89 Jun 20, 2020

remotego Jun 25, 2020

tmoreau89 Aug 26, 2020

tmoreau89 commented Jun 20, 2020

zhanghaohit commented Jun 22, 2020

tmoreau89 commented Jun 23, 2020

zhanghaohit commented Jul 17, 2020

liangfu Jul 17, 2020

remotego Jul 17, 2020

liangfu Jul 17, 2020

remotego Jul 17, 2020 •

edited

Loading

tmoreau89 Jul 17, 2020

liangfu Jul 17, 2020

remotego Jul 17, 2020

tmoreau89 commented Jul 17, 2020

remotego commented Jul 18, 2020 •

edited

Loading

zhanghaohit commented Jul 18, 2020

liangfu commented Jul 19, 2020 •

edited

Loading

remotego commented Jul 19, 2020 •

edited

Loading

remotego commented Sep 27, 2020

tmoreau89 left a comment

liangfu left a comment

liangfu commented Oct 12, 2020 •

edited

Loading

zhanghaohit commented Oct 26, 2020

tmoreau89 commented Dec 4, 2020

tmoreau89 commented Dec 4, 2020

zhanghaohit commented Dec 9, 2020

tmoreau89 commented Dec 11, 2020

		@@ -0,0 +1,341 @@
		#pragma OPENCL EXTENSION cl_intel_channels: enable

[Hardware][OpenCL] Intelfocl support #9

[Hardware][OpenCL] Intelfocl support #9

Conversation

zhanghaohit commented Jun 18, 2020

tmoreau89 commented Jun 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmoreau89 commented Jun 20, 2020

zhanghaohit commented Jun 22, 2020

tmoreau89 commented Jun 23, 2020

zhanghaohit commented Jul 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

remotego Jul 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmoreau89 commented Jul 17, 2020

remotego commented Jul 18, 2020 • edited Loading

zhanghaohit commented Jul 18, 2020

liangfu commented Jul 19, 2020 • edited Loading

remotego commented Jul 19, 2020 • edited Loading

remotego commented Sep 27, 2020

tmoreau89 left a comment

Choose a reason for hiding this comment

liangfu left a comment

Choose a reason for hiding this comment

liangfu commented Oct 12, 2020 • edited Loading

zhanghaohit commented Oct 26, 2020

tmoreau89 commented Dec 4, 2020

tmoreau89 commented Dec 4, 2020

zhanghaohit commented Dec 9, 2020

tmoreau89 commented Dec 11, 2020

remotego Jul 17, 2020 •

edited

Loading

remotego commented Jul 18, 2020 •

edited

Loading

liangfu commented Jul 19, 2020 •

edited

Loading

remotego commented Jul 19, 2020 •

edited

Loading

liangfu commented Oct 12, 2020 •

edited

Loading