-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hardware][Verilator] Integrating and simulating hardware accelerators in TVM #6971
Conversation
Hi @vegaluisjose , This looks very interesting! I would suggest to move the RFC discussion in https://discuss.tvm.apache.org/. Meanwhile, let me double check I understood the design. So, the idea is: Two questions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is definitely a great idea. This is the missing part of the design that leverage BYOC to help HW designer from day1. I agree to have the hw-widgets in the 3rdparty, since the repo could be a starting point for HW designers.
The implementation look good.
Hi @giuseros , Yes to all of your points (a, b, c). This is an extension of a work we started exploring earlier this year. The main difference between the paper and this RFC is the fact that we are allowing the hardware engineer to specify the level of detail of the simulating object. For example, if you have hardware code for simulating complex protocols i.e., AXI, then you could implement such protocol transactions using the proposed device interface. On the other hand, if you don't have such models or you want to bypass them to speed simulation then you can use hierarchical-paths to access memories or registers in your design, and control it the same way I am doing it in the current demo. The benefit of this freedom is that the interoperability should be independent of a particular hardware protocol or interface, and such interoperability should happen in the most productive way possible. In terms of the other questions, a) I haven't explored the TIR path yet. However, the benefit of doing such thing is leveraging other utilities available in the framework i.e., AutoTVM. In other words, BYOC could be considered a more shallow integration, but still useful. b) The reason why we have the Finally, this approach does not stop at ML accelerators only, you can imagine integrating a bare-metal processor out there in a similar fashion. There are plenty of RISCV processors out there using Verilator for simulation, so that could happen also. I am giving a small talk in TVM conf about this RFC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vegaluisjose , don't get me wrong, I am very supportive with this work, questions were mostly to understand it better. I will attend your talk. And this LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this interesting idea and implementation! Hopefully we can get this in. Meanwhile, we should let more people review this PR especially it introduces a new 3rdparty dependency (hw-widgets
). I also agree with @giuseros that it would be good to have an RFC talking about what you expect users to make use of this feature, because I didn't get this point after reviewing the PR.
My confusion is what role "hw-widgets" plays in this feature.
-
According to its README, "These are examples of hardware designs for testing Verilator backend integration to TVM", is it just examples? If so then I don't think we really need an example to be in a 3rd party dependency. We should even bring the implementation into the TVM code base and only compile it when Verilator is enabled.
-
What should I do if I have a new design in Verilator? Should I 1) compile my design to a library like you did for hw-widgets, 2) modify the
verilator_codegen
andverilator_runtime
to link that library? If this is a case, then this PR is more like just an example instead of a framework. -
The current implementation in this PR executes the graph op by op, but it seems not common for an accelerator. Is this just an example or your intention?
"kernel", /* op_type_ */ | ||
inputs, 1 /* num_outputs_ */); | ||
SetCallNodeAttribute(node, call); | ||
return AddNode(node, GetRef<Expr>(cn)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too ad-hoc even you only support AddNode at this moment. It would be better to check the node type and throw an error when you find other unsupported nodes. This is also more flexible to add other nodes in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, I was thinking the same. The thing is that we would need to add the function call in the verilator-runtime as well right?, leaving the freedom of marking the ops to be offloaded to the user (python). I was thinking on increasing this coverage as a follow-up PR, I did not want this to be too many lines to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @comaniac , thank you for your feedback -- answers inline
The
I don't think I referred to this as a framework. I think this should be viewed more as another backend for TVM. Similarly to Ethos-N or DNNL, you need to have a precompiled library in your system but in this case this library is a cycle-accurate-version of your hardware as @giuseros pointed out. However, we need a separate ecosystem because like other external codegens, Verilator requires certain things to make it work. For example, building floating-point hardware is far more complicated than integer or fixed-point hardware. Most of the backends in TVM, uses kernel libraries based on floating-point because hardware support is already free. But if you are building things from scratch, or if you are in a power-constraint environment i.e., edge, you are most likely to start with integer ops. Therefore, we need a mechanism to not only leverage full-blown-complex-accelerators but also the tiny-smallest-version-of-hardware that allow us to have a more vibrant hardware ecosystem/community compared to other frameworks. The other benefit of this is being capable of building hardware "incrementally" and evaluate your design-decisions with TVM over time.
This backend can mirror what we currently have for DNNL in terms of single and merged operators (composite), no reason for stopping here. The reason why we have a single one is because we have only one example for this particular operator. Eventually, we are hoping for the community to join this effort and add examples that run more complicated use cases i.e., CONV2D, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vegaluisjose thanks for the explanation. It's now much more reasonable for me as I was thinking this is a general backend for everyone. In terms of being an example, this PR looks good to me except for the 3rd party dependency. To me, this is not the case as VTA, because VTA is a framework but hw-widgets is just an example. In addition, IIUC, with the 3rd party dependency, anyone who wants to contribute to this feature needs to first submit a PR to that repo, or it seems not possible to enable more ops such as Conv2D, otherwise.
btw we may need a test case for this backend even it's just an example.
@comaniac thank you for the feedback. The decision to add the Verilog sources in a 3rd party submodule came from TVM industry contributor concerns about leaving hardware sources out of the main repo for IP protection reasons. We are simply making sure all hardware sources are cleaved off from the main TVM repo. |
I agree on adding a test case - @vegaluisjose do you mind adding one to make sure this example doesn't bitrot? |
@vegaluisjose thanks for the effort. I think this is a quite interesting implementation and yet another use case of BYOC. It is more like the json runtime example we had before but in the Verilator scenario. I think we need a test case. In addition, as @comaniac mentioned this would increase of maintenance (mostly from your side though) because we need to update both repos if there is change like adding operator support. Do you think it makes sense to contribute the kernels to TVM as well? |
ahh, just saw @tmoreau89's comment on the separation. |
Regarding the test, I do have one located here that I hope we can figure how to eventually add it to CI. |
Alright guys, sounds like we have a plan. I will move the Python test that I have in Does this sounds reasonable? Thank you all for your valuable input @giuseros @comaniac @liangfu @tmoreau89 @zhiics |
@vegaluisjose sounds good to me. Thanks. |
@vegaluisjose , it is ok for me too. Although, I would suggest to move the RFC on the discuss forum to get more visibility and comments |
I totally understand this concern and it makes sense to me. I completely agree with you that we should keep the hardware implementation out of the main repo. I was just concern the maintenance for this example. IMHO, it should also be fine if we bring the hardware example in and mention in the tutorial/blog post/comments that users are suggested to have a separate repo for this part. Anyways, I won't against the decision if everyone else are fine with it.
Sounds like a plan. Thanks. |
As @giuseros pointed out, an RFC about this work would be nice for longer term conversations. I'm hoping this PR can be merged soon for the TVM conf demo, but we should have longer-lasting discussion on extensions/applications of this work. |
Re the hw related component, how about we put it into the https://github.com/apache/tvm-vta repo? Given that infrastructure are already setup for testing in the VTA component, so the code in the repo can also be tested against the latest version of tvm. We could make it as part of I agree that having an RFC about the proposal would be really nice. |
@tqchen if we go towards putting hardware sources under VTA, we may want to rename that repo to something like tvm-hardware or something similar under which apps and VTA sources would sit. And hopefully we'll see other designs in there as well. That's something we can do after an RFC discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry but I'll need to gate this PR until we move the submodule contents into the VTA repo, and then update the submodule in this PR.
apache/tvm-vta#18 has now been merged as requested by @tqchen . Please update the submodule and paths to now point to tvm-vta, and bump the submodule version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating the PR @vegaluisjose
Thank to @vegaluisjose and @comaniac @liangfu @zhiics @giuseros who reviewed the PR, it's now merged. |
…s in TVM (apache#6971) * add files * update interface * update * fix comment * fix fmt * remove widget repo * remove
…s in TVM (apache#6971) * add files * update interface * update * fix comment * fix fmt * remove widget repo * remove
…s in TVM (apache#6971) * add files * update interface * update * fix comment * fix fmt * remove widget repo * remove
Integrating and simulating hardware accelerators in TVM
This RFC proposes support for integrating hardware designs in TVM, specifically before they are manufactured in silicon or prototyped in an FPGA. Regardless of the hardware target, ML accelerators are built today using the waterfall approach described in the following figure:
Starting from an accelerator idea, hardware is built incrementally for several weeks or months until it is prototyped. Once this happens, software teams start working on integrating this new piece of hardware to a particular ML framework of choice i.e., TVM. One challenge of this approach is the fact that hardware-design decisions are evaluated too late, which turns out to be counterproductive. Additionally, there are higher probabilities that the target ML model gets updated along the way, which makes the efficiency of this process critical.
Motivated by this reality, we believe that ML accelerators should be integrated, and more importantly evaluated once the hardware design process begins. Notably, this approach allows engineers to have design feedback and hardware-software continuous integration from day one, as shown in the following figure:
We achieved this integration by leveraging the fact that most ML accelerators are funnel down to a common hardware language i.e., Verilog, regardless of the source language. Moreover, we have today efficient Verilog-to-C++ open-source compilers i.e., Verilator that can be used to interoperate with Verilog via a simple C interface (CFFI) from TVM.
Concretely, this RFC consists on a simple codegen using BYOC, and an opaque kernel library, runtime, and device interface that can be implemented by engineers to efficiently integrate ML accelerators with TVM regardless of their design complexity. Also, we provide a small example (demo) that shows how to offload an
add
instruction from a Relay program to a simple hardware implementation (scalar adder) written in Verilog. This example is hosted on a third party repository calledhw-widgets
that we are planning to use for future hardware designs.@tmoreau89 @liangfu @jroesch