Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hardware][OpenCL] Intelfocl support #9

Merged
merged 39 commits into from
Dec 11, 2020

Conversation

zhanghaohit
Copy link
Contributor

This PR relates to this RFC.

Main changes are:

  • OpenCL driver
  • intelfocl implementation for VTA
  • add ALU MUL and Load INT_8

This is a basic version as a POC, without much performance optimization. We're still optimizing some parts, and will be submitted as a new PR later.

@zhanghaohit zhanghaohit changed the title Intelfocl support [Hardware][OpenCL] Intelfocl support Jun 18, 2020
@tmoreau89
Copy link
Contributor

CC-ing @vegaluisjose @huajsj @pasqoc

factor = 1000000.0
def alu_imm_cost(iter_out, iter_in, uop_bgn, uop_end):
x = (uop_end - uop_bgn) * iter_out * iter_in
cycles = x + 46
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these costs generally derived? Are they FPGA device specific? If we change the compilation parameters in the OpenCL compiler, could it affect these cost models?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Those costs are FPGA and configuration specific, and we need to profile/measure the values from the actual hardware.

I believe the values are provided here as an example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think it would be good to define constants here, and add a comment that states that these are set to arbitrarily (e.g. 46).

I'm working with a colleague on making easier to perform some of these cycle accurate co-simulation of VTA modules which should help us derive these costs more accurately on a per-VTA variant basis. In the meantime, can we use constant names, e.g. ALU_LATENCY_CYCLES = 46

include/vta/hw_spec.h Outdated Show resolved Hide resolved
@tmoreau89
Copy link
Contributor

Thanks @zhanghaohit for the PR! I left a couple questions, but will need to review in more depth. One question: have you tested it on the Pynq board to see if some changes may affect the existing design? Given that the ISA has changed slightly (some fields are now wider, like VTA_MEMOP_ID_BIT_WIDTH), it may affect some parameterizations of VTA.

@zhanghaohit
Copy link
Contributor Author

Thanks @zhanghaohit for the PR! I left a couple questions, but will need to review in more depth. One question: have you tested it on the Pynq board to see if some changes may affect the existing design? Given that the ISA has changed slightly (some fields are now wider, like VTA_MEMOP_ID_BIT_WIDTH), it may affect some parameterizations of VTA.

Thanks @tmoreau89 for the comments. Yes. We have tested on the ultra96 board (I only have ultra96 board on hand). It works. The only thing we have to change is to re-compile the bitstream with the new ISA. I think we have to update the pre-compiled bitstream here https://github.com/uwsampl/vta-distro/tree/master/bitstreams, right?

@tmoreau89
Copy link
Contributor

That is correct, we'll need to bump the versioning of the bitstream

@zhanghaohit
Copy link
Contributor Author

@tmoreau89 @liangfu any other comments? Thanks.


int IntelFOCLDevice::init(size_t mem_size, std::string aocx_file)
{
cl_int status;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure the source code pass cpplint rules. Specifically, we use 2 spaces for indentation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thanks. We will update that.

@@ -0,0 +1,13 @@
{
"TARGET" : "intelfocl",
"HW_VER" : "0.0.1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to update this version number, since the HW ISA has changed?

Copy link

@remotego remotego Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let's bump it up to 0.0.2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please apply the 0.0.2 bump thanks

@@ -0,0 +1,341 @@
#pragma OPENCL EXTENSION cl_intel_channels: enable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, as I remember this PR is going to work for ultra96, does this extension work for Xilinx toolchain to compile the hardware design? I might misunderstood somewhere.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code that work for ultra96 is still inside hardware/xilinx

config/intelfocl_sample.json Show resolved Hide resolved
src/intelfocl/intelfocl_device.cc Outdated Show resolved Hide resolved
@tmoreau89
Copy link
Contributor

Thanks for the changes. Please apply the 0.0.2 and rename the vta target to something more specific, e.g. "arria10". Also there are some CI errors related to linting that could be addressed. Thanks!

@remotego
Copy link

remotego commented Jul 18, 2020

Thanks for the changes. Please apply the 0.0.2 and rename the vta target to something more specific, e.g. "arria10". Also there are some CI errors related to linting that could be addressed. Thanks!

Thank you very much! Sure. We will apply the 0.0.2 and address the linting errors.
However. I believe "arria10" is too restrictive here. The code should work for all devices supported by Intel OpenCL for FPGA, namely Intel Arria 10, Stratix V/10 and Cyclone V/10. So far we have tested it on both Arria 10 and Stratix 10 boards, and it worked.

@zhanghaohit
Copy link
Contributor Author

Thanks for the changes. Please apply the 0.0.2 and rename the vta target to something more specific, e.g. "arria10". Also there are some CI errors related to linting that could be addressed. Thanks!

Thanks. The linting error here is due to filetype checking. I think we have to add the opencl filetype to the lint script. I've created a PR here: apache/tvm#6092

@liangfu
Copy link
Member

liangfu commented Jul 19, 2020

The code should work for all devices supported by Intel OpenCL for FPGA, namely Intel Arria 10, Stratix V/10 and Cyclone V/10.

Just my humble opinion, given that both "Intel OpenCL for FPGA" and "VTA" requires a large amount of logic utilization, many Cyclone V chips that supports AOCL couldn't get this compiled because of the hardware utilization issue.
In my understanding of @tmoreau89 's comment, which I would agree, setting a proper target device in vta_config gives the property of being "versatile" in VTA - the Versatile Tensor Accelerator. Specifically, we could take the advantage of the properties in vta_config to define an accelerator that could scale from minimal ones towards data center scale accelerators.

As a side note, the reason we are taking these efforts in building open source projects, in part, we are hoping someone in the community could reproduce what we have done, and could easily start to build something that is even better. With the target being defined too board, an potential grad student could fail to reproduce the result, since not all the student could easily purchase a board with Stratix 10, and low cost Cyclone V boards couldn't get this running. In addition, they're wasting large amount of valuable logic resources, even they could afford a board with Stratix 10. Therefore, we should specify a precise target device for the vta_config.

@remotego
Copy link

remotego commented Jul 19, 2020

The code should work for all devices supported by Intel OpenCL for FPGA, namely Intel Arria 10, Stratix V/10 and Cyclone V/10.

Just my humble opinion, given that both "Intel OpenCL for FPGA" and "VTA" requires a large amount of logic utilization, many Cyclone V chips that supports AOCL couldn't get this compiled because of the hardware utilization issue.
In my understanding of @tmoreau89 's comment, which I would agree, setting a proper target device in vta_config gives the property of being "versatile" in VTA - the Versatile Tensor Accelerator. Specifically, we could take the advantage of the properties in vta_config to define an accelerator that could scale from minimal ones towards data center scale accelerators.

As a side note, the reason we are taking these efforts in building open source projects, in part, we are hoping someone in the community could reproduce what we have done, and could easily start to build something that is even better. With the target being defined too board, an potential grad student could fail to reproduce the result, since not all the student could easily purchase a board with Stratix 10, and low cost Cyclone V boards couldn't get this running. In addition, they're wasting large amount of valuable logic resources, even they could afford a board with Stratix 10. Therefore, we should specify a precise target device for the vta_config.

Thank you for your reply. However, could you explain more on the reason why the design shall not work on Cyclone V FPGAs that supports Intel OpenCL for FPGA?

Precisely as you mentioned, the VTA design is versatile, the user could always change the settings (i.e. LOG_BLOCK and LOG_*_BUF_SIZE) to adjust the resource usage in order to fit their own FPGA boards. Surely a low cost Cyclone V device could not support 64x64 GEMV cores like large Stratix 10 FPGAs do. But the user could always try to use 16x16 or even 4x4 GEMV cores, by setting the LOG_BLOCK lower.

Considering that, we used a relatively small default setting for LOG_BLOCK (4) and Buffers(15, 15, 18, 17). Thus the design should be able to fit into FPGAs comparable to the original Zynq/Zedboard platforms.

We must admit that we don't have those Cyclone V boards on hand, nor has the design been tested on those platforms. However, if there is any issue on compiling the design for a AOCL-compatibale cyclone V board, we will be more than happy to investigate and try to solve the issue together.

In terms of accessibility, we know that high-end could FPGA cards are very expensive. The good news is that nowadays there are many Cloud Service Providers available offering high-end FPGA instances! Those FPGA instances generally only cost few bucks for hour's usage.

In addition, we are also working on porting the design over to Amazon EC2 F1 instances (Xilinx SDAccel). We will update again when we finish testing on the Amazon platforms.

@remotego
Copy link

Hi @tmoreau89,

We have completed the changes listed, and we have also included an README.rst file as a preliminary installation guide.

Please let us know if more changes are required.

Thank you very much!

@remotego remotego force-pushed the feature/opencl branch 2 times, most recently from 71d55c5 to 0d967d5 Compare September 28, 2020 05:54
@tqchen tqchen changed the base branch from master to main October 11, 2020 17:42
Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@remotego @zhanghaohit thank you for making the requested changes and adding the README.rst, it reads very well.

@liangfu please approve/request changes

Copy link
Member

@liangfu liangfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM

@liangfu
Copy link
Member

liangfu commented Oct 12, 2020

ci tests failed on tsim seems to be unrelated, @zhanghaohit do you mind retrigger ci to see if it passes? (It was successful previously.)

@zhanghaohit
Copy link
Contributor Author

ci tests failed on tsim seems to be unrelated, @zhanghaohit do you mind retrigger ci to see if it passes? (It was successful previously.)

@liangfu Thanks for the comments and sorry for my late reply. I tried to trigger the ci again, but it failed at the same place.

I tried to run python3 -m pytest -v ${TVM_PATH}/vta/tests/python/unittest locally. It passed. Any ideas?

@tmoreau89
Copy link
Contributor

@zhanghaohit sorry for the late reply. I think that the CI issue is due to the changes in the ISA; we are relying on essentially tests that assume the older ISA, therefore breaking unit tests here. The ISA rarely changes, so we didn't set up the tests to account for changes in ISA.

@tmoreau89
Copy link
Contributor

@zhanghaohit what I'd like to suggest is that we temporarily disable the unit tests that are failing and re-enable them once we've update TVM since there's a bit of a circular dependence to make the tests work

Please comment out the test in tests/scripts/docker_bash.sh here ./tests/scripts/task_python_vta_tsim.sh. I'll take care of enabling the TSIM test once we have that merged into TVM.

@zhanghaohit zhanghaohit changed the title [Hardware][OpenCL] Intelfocl support WIP: [Hardware][OpenCL] Intelfocl support Dec 9, 2020
@zhanghaohit zhanghaohit changed the title WIP: [Hardware][OpenCL] Intelfocl support [Hardware][OpenCL] Intelfocl support Dec 9, 2020
@zhanghaohit
Copy link
Contributor Author

@zhanghaohit what I'd like to suggest is that we temporarily disable the unit tests that are failing and re-enable them once we've update TVM since there's a bit of a circular dependence to make the tests work

Please comment out the test in tests/scripts/docker_bash.sh here ./tests/scripts/task_python_vta_tsim.sh. I'll take care of enabling the TSIM test once we have that merged into TVM.

Thanks @tmoreau89 for the suggestion.

I cannot find the test ./tests/scripts/task_python_vta_tsim.sh in tests/scripts/docker_bash.sh. So I add a condition check to skip this test here:. I also tried to comment out the test in Jenkinsfile, but it seems not working.

Now the CI has passed. Could you help check? And thanks for the help to re-enable the tsim test after all are merged.

@tmoreau89 tmoreau89 merged commit 5bd9c6a into apache:main Dec 11, 2020
@tmoreau89
Copy link
Contributor

Thank you @zhanghaohit, @remotego, @liangfu , the PR is merged!

@remotego remotego deleted the feature/opencl branch December 15, 2020 09:16
jinhongyii pushed a commit that referenced this pull request Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants