[Hardware][Verilator] Integrating and simulating hardware accelerators in TVM #6971

vegaluisjose · 2020-11-24T23:51:59Z

Integrating and simulating hardware accelerators in TVM

This RFC proposes support for integrating hardware designs in TVM, specifically before they are manufactured in silicon or prototyped in an FPGA. Regardless of the hardware target, ML accelerators are built today using the waterfall approach described in the following figure:

Starting from an accelerator idea, hardware is built incrementally for several weeks or months until it is prototyped. Once this happens, software teams start working on integrating this new piece of hardware to a particular ML framework of choice i.e., TVM. One challenge of this approach is the fact that hardware-design decisions are evaluated too late, which turns out to be counterproductive. Additionally, there are higher probabilities that the target ML model gets updated along the way, which makes the efficiency of this process critical.

Motivated by this reality, we believe that ML accelerators should be integrated, and more importantly evaluated once the hardware design process begins. Notably, this approach allows engineers to have design feedback and hardware-software continuous integration from day one, as shown in the following figure:

We achieved this integration by leveraging the fact that most ML accelerators are funnel down to a common hardware language i.e., Verilog, regardless of the source language. Moreover, we have today efficient Verilog-to-C++ open-source compilers i.e., Verilator that can be used to interoperate with Verilog via a simple C interface (CFFI) from TVM.

Concretely, this RFC consists on a simple codegen using BYOC, and an opaque kernel library, runtime, and device interface that can be implemented by engineers to efficiently integrate ML accelerators with TVM regardless of their design complexity. Also, we provide a small example (demo) that shows how to offload an add instruction from a Relay program to a simple hardware implementation (scalar adder) written in Verilog. This example is hosted on a third party repository called hw-widgets that we are planning to use for future hardware designs.

@tmoreau89 @liangfu @jroesch

giuseros · 2020-11-25T11:38:09Z

Hi @vegaluisjose ,

This looks very interesting! I would suggest to move the RFC discussion in https://discuss.tvm.apache.org/. Meanwhile, let me double check I understood the design. So, the idea is:
a) Write Verilog
b) Compile (e.g., with Verilator) in a C++ Cycle Accurate Model
c) Link that model into TVM
You are using BYOC so that you can split your graph on the operations you want, and offload the one you decided to support (potentially, the entire graph) to the C++ model. Is my understanding correct?

Two questions:
a) How hard is to open a route in TIR for this (to test smaller hw parts)? In theory call_extern should be sufficient?
b) Why adding hw-widgets as a dependency? IIUC this is just an example, right?

liangfu

I think this is definitely a great idea. This is the missing part of the design that leverage BYOC to help HW designer from day1. I agree to have the hw-widgets in the 3rdparty, since the repo could be a starting point for HW designers.

The implementation look good.

vegaluisjose · 2020-11-25T17:24:23Z

Hi @vegaluisjose ,

This looks very interesting! I would suggest to move the RFC discussion in https://discuss.tvm.apache.org/. Meanwhile, let me double check I understood the design. So, the idea is:
a) Write Verilog
b) Compile (e.g., with Verilator) in a C++ Cycle Accurate Model
c) Link that model into TVM
You are using BYOC so that you can split your graph on the operations you want, and offload the one you decided to support (potentially, the entire graph) to the C++ model. Is my understanding correct?

Two questions:
a) How hard is to open a route in TIR for this (to test smaller hw parts)? In theory call_extern should be sufficient?
b) Why adding hw-widgets as a dependency? IIUC this is just an example, right?

Hi @giuseros ,

Yes to all of your points (a, b, c). This is an extension of a work we started exploring earlier this year. The main difference between the paper and this RFC is the fact that we are allowing the hardware engineer to specify the level of detail of the simulating object. For example, if you have hardware code for simulating complex protocols i.e., AXI, then you could implement such protocol transactions using the proposed device interface. On the other hand, if you don't have such models or you want to bypass them to speed simulation then you can use hierarchical-paths to access memories or registers in your design, and control it the same way I am doing it in the current demo. The benefit of this freedom is that the interoperability should be independent of a particular hardware protocol or interface, and such interoperability should happen in the most productive way possible.

In terms of the other questions,

a) I haven't explored the TIR path yet. However, the benefit of doing such thing is leveraging other utilities available in the framework i.e., AutoTVM. In other words, BYOC could be considered a more shallow integration, but still useful.

b) The reason why we have the hw-widgets repo is that this stuff is hard to explain without a running example given that we are in a hardware-software boundary. I am working on a blogpost that will come after this gets merged that not only covers all of this but also shows how to offload operators from ML models. For example, we have already tested offloading nn.bias_add() to the same scalar adder, used in this example, from a quantized MobileNet model. As @liangfu said, this is a starting point for others.

Finally, this approach does not stop at ML accelerators only, you can imagine integrating a bare-metal processor out there in a similar fashion. There are plenty of RISCV processors out there using Verilator for simulation, so that could happen also. I am giving a small talk in TVM conf about this RFC.

giuseros

Hi @vegaluisjose , don't get me wrong, I am very supportive with this work, questions were mostly to understand it better. I will attend your talk. And this LGTM!

comaniac

Thanks for this interesting idea and implementation! Hopefully we can get this in. Meanwhile, we should let more people review this PR especially it introduces a new 3rdparty dependency (hw-widgets). I also agree with @giuseros that it would be good to have an RFC talking about what you expect users to make use of this feature, because I didn't get this point after reviewing the PR.

My confusion is what role "hw-widgets" plays in this feature.

According to its README, "These are examples of hardware designs for testing Verilator backend integration to TVM", is it just examples? If so then I don't think we really need an example to be in a 3rd party dependency. We should even bring the implementation into the TVM code base and only compile it when Verilator is enabled.
What should I do if I have a new design in Verilator? Should I 1) compile my design to a library like you did for hw-widgets, 2) modify the verilator_codegen and verilator_runtime to link that library? If this is a case, then this PR is more like just an example instead of a framework.
The current implementation in this PR executes the graph op by op, but it seems not common for an accelerator. Is this just an example or your intention?

cc @zhiics @tqchen

comaniac · 2020-11-25T17:46:18Z

src/relay/backend/contrib/verilator/codegen.cc

+                                                "kernel", /* op_type_ */
+                                                inputs, 1 /* num_outputs_ */);
+    SetCallNodeAttribute(node, call);
+    return AddNode(node, GetRef<Expr>(cn));


This is too ad-hoc even you only support AddNode at this moment. It would be better to check the node type and throw an error when you find other unsupported nodes. This is also more flexible to add other nodes in the future.

Great point, I was thinking the same. The thing is that we would need to add the function call in the verilator-runtime as well right?, leaving the freedom of marking the ops to be offloaded to the user (python). I was thinking on increasing this coverage as a follow-up PR, I did not want this to be too many lines to review.

tmoreau89

This PR is a very welcome addition, particularly for HW vendors/researchers.

Indeed having validation from @tqchen or @zhiics on adding a new 3rd party package + extra cmake flag would be great too.

vegaluisjose · 2020-11-25T18:40:12Z

Hi @comaniac , thank you for your feedback -- answers inline

Thanks for this interesting idea and implementation! Hopefully we can get this in. Meanwhile, we should let more people review this PR especially it introduces a new 3rdparty dependency (hw-widgets). I also agree with @giuseros that it would be good to have an RFC talking about what you expect users to make use of this feature, because I didn't get this point after reviewing the PR.

My confusion is what role "hw-widgets" plays in this feature.

According to its README, "These are examples of hardware designs for testing Verilator backend integration to TVM", is it just examples? If so then I don't think we really need an example to be in a 3rd party dependency. We should even bring the implementation into the TVM code base and only compile it when Verilator is enabled.

The hw-widgets repository is just an example on how to use this "backend" and a hardware codebase. The reason behind having a separate repo with examples is similar to the split that happen with VTA (3rdparty vta-hw).

What should I do if I have a new design in Verilator? Should I 1) compile my design to a library like you did for hw-widgets, 2) modify the verilator_codegen and verilator_runtime to link that library? If this is a case, then this PR is more like just an example instead of a framework.

I don't think I referred to this as a framework. I think this should be viewed more as another backend for TVM. Similarly to Ethos-N or DNNL, you need to have a precompiled library in your system but in this case this library is a cycle-accurate-version of your hardware as @giuseros pointed out. However, we need a separate ecosystem because like other external codegens, Verilator requires certain things to make it work.

For example, building floating-point hardware is far more complicated than integer or fixed-point hardware. Most of the backends in TVM, uses kernel libraries based on floating-point because hardware support is already free. But if you are building things from scratch, or if you are in a power-constraint environment i.e., edge, you are most likely to start with integer ops. Therefore, we need a mechanism to not only leverage full-blown-complex-accelerators but also the tiny-smallest-version-of-hardware that allow us to have a more vibrant hardware ecosystem/community compared to other frameworks. The other benefit of this is being capable of building hardware "incrementally" and evaluate your design-decisions with TVM over time.

The current implementation in this PR executes the graph op by op, but it seems not common for an accelerator. Is this just an example or your intention?

This backend can mirror what we currently have for DNNL in terms of single and merged operators (composite), no reason for stopping here. The reason why we have a single one is because we have only one example for this particular operator. Eventually, we are hoping for the community to join this effort and add examples that run more complicated use cases i.e., CONV2D, etc.

cc @zhiics @tqchen

comaniac

@vegaluisjose thanks for the explanation. It's now much more reasonable for me as I was thinking this is a general backend for everyone. In terms of being an example, this PR looks good to me except for the 3rd party dependency. To me, this is not the case as VTA, because VTA is a framework but hw-widgets is just an example. In addition, IIUC, with the 3rd party dependency, anyone who wants to contribute to this feature needs to first submit a PR to that repo, or it seems not possible to enable more ops such as Conv2D, otherwise.

btw we may need a test case for this backend even it's just an example.

tmoreau89 · 2020-11-25T18:59:45Z

@comaniac thank you for the feedback. The decision to add the Verilog sources in a 3rd party submodule came from TVM industry contributor concerns about leaving hardware sources out of the main repo for IP protection reasons. We are simply making sure all hardware sources are cleaved off from the main TVM repo.

tmoreau89 · 2020-11-25T19:00:53Z

I agree on adding a test case - @vegaluisjose do you mind adding one to make sure this example doesn't bitrot?
I believe that the default CPU docker images are already build with Verilator, so hopefully we don't need to rebuild those.

zhiics · 2020-11-25T19:00:55Z

@vegaluisjose thanks for the effort. I think this is a quite interesting implementation and yet another use case of BYOC. It is more like the json runtime example we had before but in the Verilator scenario. I think we need a test case. In addition, as @comaniac mentioned this would increase of maintenance (mostly from your side though) because we need to update both repos if there is change like adding operator support. Do you think it makes sense to contribute the kernels to TVM as well?

zhiics · 2020-11-25T19:01:28Z

ahh, just saw @tmoreau89's comment on the separation.

vegaluisjose · 2020-11-25T19:02:12Z

@vegaluisjose thanks for the explanation. It's now much more reasonable for me as I was thinking this is a general backend for everyone. In terms of being an example, this PR looks good to me except for the 3rd party dependency. To me, this is not the case as VTA, because VTA is a framework but hw-widgets is just an example. In addition, IIUC, with the 3rd party dependency, anyone who wants to contribute to this feature needs to first submit a PR to that repo, or it seems not possible to enable more ops such as Conv2D, otherwise.

btw we may need a test case for this backend even it's just an example.

Regarding the test, I do have one located here that I hope we can figure how to eventually add it to CI.

vegaluisjose · 2020-11-25T19:14:35Z

Alright guys, sounds like we have a plan.

I will move the Python test that I have in hw-widgets to follow what we have with other external codegens, making sure this work on our CI and does not bitrot.

Does this sounds reasonable?

Thank you all for your valuable input @giuseros @comaniac @liangfu @tmoreau89 @zhiics

zhiics · 2020-11-25T19:31:19Z

@vegaluisjose sounds good to me. Thanks.

giuseros · 2020-11-25T19:37:19Z

@vegaluisjose , it is ok for me too. Although, I would suggest to move the RFC on the discuss forum to get more visibility and comments

comaniac · 2020-11-25T19:43:32Z

@comaniac thank you for the feedback. The decision to add the Verilog sources in a 3rd party submodule came from TVM industry contributor concerns about leaving hardware sources out of the main repo for IP protection reasons. We are simply making sure all hardware sources are cleaved off from the main TVM repo.

I totally understand this concern and it makes sense to me. I completely agree with you that we should keep the hardware implementation out of the main repo. I was just concern the maintenance for this example. IMHO, it should also be fine if we bring the hardware example in and mention in the tutorial/blog post/comments that users are suggested to have a separate repo for this part. Anyways, I won't against the decision if everyone else are fine with it.

I will move the Python test that I have in hw-widgets to follow what we have with other external codegens, making sure this work on our CI and does not bitrot.

Sounds like a plan. Thanks.

tmoreau89 · 2020-11-25T19:58:57Z

As @giuseros pointed out, an RFC about this work would be nice for longer term conversations. I'm hoping this PR can be merged soon for the TVM conf demo, but we should have longer-lasting discussion on extensions/applications of this work.

tqchen · 2020-11-25T20:29:58Z

Re the hw related component, how about we put it into the https://github.com/apache/tvm-vta repo? Given that infrastructure are already setup for testing in the VTA component, so the code in the repo can also be tested against the latest version of tvm. We could make it as part of vta/apps/hw-widgets

I agree that having an RFC about the proposal would be really nice.

tmoreau89 · 2020-11-25T20:59:38Z

@tqchen if we go towards putting hardware sources under VTA, we may want to rename that repo to something like tvm-hardware or something similar under which apps and VTA sources would sit. And hopefully we'll see other designs in there as well. That's something we can do after an RFC discussion.

tmoreau89

Sorry but I'll need to gate this PR until we move the submodule contents into the VTA repo, and then update the submodule in this PR.

tmoreau89 · 2020-11-27T20:00:30Z

apache/tvm-vta#18 has now been merged as requested by @tqchen .

Please update the submodule and paths to now point to tvm-vta, and bump the submodule version.

tmoreau89

Thank you for updating the PR @vegaluisjose

tmoreau89 · 2020-11-29T04:02:02Z

Thank to @vegaluisjose and @comaniac @liangfu @zhiics @giuseros who reviewed the PR, it's now merged.

…s in TVM (apache#6971) * add files * update interface * update * fix comment * fix fmt * remove widget repo * remove

vegaluisjose added 5 commits November 24, 2020 19:50

add files

e96a794

update interface

5116c11

update

dcf9324

fix comment

f5f2ac2

fix fmt

17a6b4e

liangfu approved these changes Nov 25, 2020

View reviewed changes

giuseros approved these changes Nov 25, 2020

View reviewed changes

comaniac requested changes Nov 25, 2020

View reviewed changes

tmoreau89 reviewed Nov 25, 2020

View reviewed changes

tmoreau89 approved these changes Nov 25, 2020

View reviewed changes

comaniac reviewed Nov 25, 2020

View reviewed changes

comaniac approved these changes Nov 25, 2020

View reviewed changes

zhiics approved these changes Nov 25, 2020

View reviewed changes

tmoreau89 requested changes Nov 26, 2020

View reviewed changes

vegaluisjose mentioned this pull request Nov 27, 2020

[Hardware][Verilator] Integrating and simulating hardware accelerators in TVM apache/tvm-vta#18

Merged

vegaluisjose added 2 commits November 28, 2020 18:16

remove widget repo

8f27bea

remove

37f4a73

tmoreau89 approved these changes Nov 29, 2020

View reviewed changes

tmoreau89 merged commit 8d1dc7e into apache:main Nov 29, 2020

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 2, 2020

[Hardware][Verilator] Integrating and simulating hardware accelerator…

e02eeff

…s in TVM (apache#6971) * add files * update interface * update * fix comment * fix fmt * remove widget repo * remove

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 4, 2020

[Hardware][Verilator] Integrating and simulating hardware accelerator…

e96cae9

…s in TVM (apache#6971) * add files * update interface * update * fix comment * fix fmt * remove widget repo * remove

trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Dec 4, 2020

[Hardware][Verilator] Integrating and simulating hardware accelerator…

61b2a14

…s in TVM (apache#6971) * add files * update interface * update * fix comment * fix fmt * remove widget repo * remove

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hardware][Verilator] Integrating and simulating hardware accelerators in TVM #6971

[Hardware][Verilator] Integrating and simulating hardware accelerators in TVM #6971

vegaluisjose commented Nov 24, 2020 •

edited

Loading

giuseros commented Nov 25, 2020

liangfu left a comment

vegaluisjose commented Nov 25, 2020 •

edited

Loading

giuseros left a comment

comaniac left a comment

comaniac Nov 25, 2020

vegaluisjose Nov 25, 2020

tmoreau89 left a comment

vegaluisjose commented Nov 25, 2020 •

edited

Loading

comaniac left a comment

tmoreau89 commented Nov 25, 2020

tmoreau89 commented Nov 25, 2020

zhiics commented Nov 25, 2020

zhiics commented Nov 25, 2020

vegaluisjose commented Nov 25, 2020

vegaluisjose commented Nov 25, 2020

zhiics commented Nov 25, 2020

giuseros commented Nov 25, 2020

comaniac commented Nov 25, 2020

tmoreau89 commented Nov 25, 2020

tqchen commented Nov 25, 2020 •

edited

Loading

tmoreau89 commented Nov 25, 2020

tmoreau89 left a comment

tmoreau89 commented Nov 27, 2020

tmoreau89 left a comment

tmoreau89 commented Nov 29, 2020

[Hardware][Verilator] Integrating and simulating hardware accelerators in TVM #6971

[Hardware][Verilator] Integrating and simulating hardware accelerators in TVM #6971

Conversation

vegaluisjose commented Nov 24, 2020 • edited Loading

Integrating and simulating hardware accelerators in TVM

giuseros commented Nov 25, 2020

liangfu left a comment

Choose a reason for hiding this comment

vegaluisjose commented Nov 25, 2020 • edited Loading

giuseros left a comment

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

comaniac Nov 25, 2020

Choose a reason for hiding this comment

vegaluisjose Nov 25, 2020

Choose a reason for hiding this comment

tmoreau89 left a comment

Choose a reason for hiding this comment

vegaluisjose commented Nov 25, 2020 • edited Loading

comaniac left a comment

Choose a reason for hiding this comment

tmoreau89 commented Nov 25, 2020

tmoreau89 commented Nov 25, 2020

zhiics commented Nov 25, 2020

zhiics commented Nov 25, 2020

vegaluisjose commented Nov 25, 2020

vegaluisjose commented Nov 25, 2020

zhiics commented Nov 25, 2020

giuseros commented Nov 25, 2020

comaniac commented Nov 25, 2020

tmoreau89 commented Nov 25, 2020

tqchen commented Nov 25, 2020 • edited Loading

tmoreau89 commented Nov 25, 2020

tmoreau89 left a comment

Choose a reason for hiding this comment

tmoreau89 commented Nov 27, 2020

tmoreau89 left a comment

Choose a reason for hiding this comment

tmoreau89 commented Nov 29, 2020

vegaluisjose commented Nov 24, 2020 •

edited

Loading

vegaluisjose commented Nov 25, 2020 •

edited

Loading

vegaluisjose commented Nov 25, 2020 •

edited

Loading

tqchen commented Nov 25, 2020 •

edited

Loading