-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VTA][OpenCL] Cloud FPGA support #5842
Conversation
- add load acc_int8 in simulation - remove copy op - add vta schedule - add always 32-bits
Thanks @zhanghaohit I can generate the new VTA design on Pynq to verify that it still passes on the new ISA. I will let you know if this passes. Installation documentation will be highly welcome. |
@zhanghaohit good news, the end to end tests ran successfully on VTA hardware with your changes. Given that this PR brings modifications to the ISA spec, we need to bump the |
In addition were you able to put a guide together? Which Intel FPGA have you tested this on? |
Thanks @tmoreau89 for the help with the tests.
Sure. What guide are you referring to? The guide on HW_VER? We're using Intel Arria 10. |
@zhanghaohit I was referring to an install guide as the one shown here: https://tvm.apache.org/docs/vta/install.html The source file is found here: https://github.com/apache/incubator-tvm/blob/master/docs/vta/install.rst |
Note that I've also uploaded the Pynq bitstream given the ISA changes here: https://github.com/uwsampl/vta-distro/tree/master/bitstreams/pynq/0.0.2 It would be great @liangfu if you had the bandwidth to synthesize the DE10 bistream as well and submit a pull request. Let me know if you are tight on time and may need some help. Finally, I will add an ultra96 variant in there as well. It would be great @zhanghaohit if you could also pre-generate the Arria10 bitstream and add it to the |
Thanks @tmoreau89 for the help. For the Arria10 bitstream, because the fpga device we are currently using is from other vendors (not Intel), it is not easy to be used by others. We're trying to adapt to the fpga devices on AWS, which are more standard. After that, we can upload the aocx/bitstream. |
@tmoreau89 @vegaluisjose @huajsj @pasqoc |
Thanks for the update @zhanghaohit , please rebase upon latest master branch and resolve the conflicts. |
@zhanghaohit thanks for the follow up. I recommend to echo @liangfu to rebase so the PR can be in a mergeable state. In addition, if the target is a custom FPGA board, I think we can leave it as a generic target for now. Having it ported to AWS F1 should be a great addition. I understand having setup instructions at this time would not make the most sense. |
@@ -0,0 +1,555 @@ | |||
// Copyright (C) 2013-2018 Altera Corporation, San Jose, California, USA. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tqchen will this copyright work with Apache licensing rules? this will end up in 3rd party
@@ -186,6 +189,16 @@ def __init__(self, | |||
timeout=10, n_parallel=None, | |||
number=4, repeat=3, min_repeat_ms=0, cooldown_interval=0.1, | |||
check_correctness=False): | |||
static_tune = os.getenv("TVM_STATIC_TUNE_EXPERIMENTAL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what TVM_STATIC_TUNE_EXPERIMENTAL is being used for? I'm a little weary of inserting additional execution modes in AutoTVM dictated by environmental variables.
(1) we either add additional flags in AutoTVM config
(2) we clean this experimental mode if it's not really necessary for the OpenCL variant of VTA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TVM_STATIC_TUNE_EXPERIMENTAL
is used for static auto tune. I think as you advised, we can split the PR into separate PRs. For static auto tune, the interface is not well nicely done. I think maybe we remove this part first.
@@ -323,8 +331,10 @@ def compute_conv2d_transpose(attrs, inputs, out_dtype): | |||
out = topi_compute( | |||
inputs[0], inputs[1], strides, padding, out_dtype) | |||
output_padding = get_const_tuple(attrs.output_padding) | |||
out = topi.nn.pad(out, [0, 0, 0, 0], | |||
[0, 0, output_padding[0], output_padding[1]]) | |||
if output_padding[0] != 0 or output_padding[1] != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which operators was this breaking on previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause errors when running dcgan on vta. There are two issues here:
- if output_padding are all 0, no need to do pad
- if the size of out.shape is not 4, it will cause problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause errors when running dcgan on vta. There are two issues here:
- if output_padding are all 0, no need to do pad
- if the size of out.shape is not 4, it will cause problems.
After rebase, this changes are not needed.
@zhanghaohit I took a second look at the changes in this PR, and I believe it would be best to break the PR down into separate, smaller PRs: (1) modifications to VTA for OpenCL support (runtime, operator support etc.) Let me know what you think. This will make the PRs easier to review and merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider breaking down the PR down into smaller PRs.
I'm thinking that the |
Thanks @tmoreau89 for the suggestion. I've split into 3 small PRs here: And remove the unnecessary code that are not very related to OpenCL support. Thanks. |
This PR, coupled with this on tvm-vta repo, is the basic implementation of RFC #5840
Some notes:
this is just a basic version without much performance optimization. We've done some optimization, and achieved significant improvement (but this part of code is not ready; has to be organized and cleaned up further).
there are some experimental features, which are also included in this PR, including
These codes are not well styled (using some environmental variables), and we have to think about how to format/implement nicely. But we think these features are useful so we also commit. We can discuss if we should include these codes in this PR, or leave it for other PRs.