-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RUNTIME][SDACCEL] Add support for multiple kernels #1424
Conversation
@tqchen @tmoreau89 @comaniac Can you review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is reasonable, although it might have room for further improvement. One problem I could imagine for the current implementation to result in poor performance is the overhead of reconfiguration and data transfer since this PR only supports the code generation for multiple kernels.
For example in your test case, those two kernels should be merged because there's no host computation in between, unless you cannot put both of them on an FPGA at the same time due to resource limitation. Lacking this kind of optimization will introduce nonignorable overhead and amortize the end-to-end performance improvement. My suggestion would be generating multiple kernels only when you really cannot put all of them into the same bit-stream, and this shouldn't happen frequently.
(Just provide my thought for future improvement and it's not necessary to be addressed in this PR.)
python/tvm/contrib/sdaccel.py
Outdated
|
||
kernel : str | ||
The kernel to compile or link. | ||
kernels : str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me "funcnames" used in codegen_vhls.cc describes this variable better.
The explanation of this variable can also be more specific, such as "Array of the kernel function names..."
python/tvm/contrib/sdaccel.py
Outdated
if returncode != 0: | ||
raise RuntimeError("Compile error") | ||
tmp_xo_files = [] | ||
for code, kernel in zip(codes, kernels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You used two arrays to store code
and funcname
respectively in codegen_vhls.cc
but zip them here again. It seems unnecessary to me. Could you use tuple
to bind the code
with its corresponding funcname
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a bit sure how to pass an Array of tuple to a PackedFunc. What I came up with was creating an Array of Array like as follows. Is there a better way?
diff --git a/src/codegen/codegen_vhls.cc b/src/codegen/codegen_vhls.cc
index 4b291d6..5b402d8 100644
--- a/src/codegen/codegen_vhls.cc
+++ b/src/codegen/codegen_vhls.cc
@@ -80,7 +80,7 @@ runtime::Module BuildSDAccel(Array<LoweredFunc> funcs) {
std::string whole_code = cg.Finish();
// Generate source code for compilation.
- Array<Expr> codes, funcnames;
+ Array<Array<Expr>> params;
for (LoweredFunc f : funcs) {
CodeGenVivadoHLS cg;
cg.Init(output_ssa);
@@ -89,13 +89,15 @@ runtime::Module BuildSDAccel(Array<LoweredFunc> funcs) {
if (const auto* f = runtime::Registry::Get("tvm_callback_vhls_postproc")) {
code = (*f)(code).operator std::string();
}
- codes.push_back(code);
- funcnames.push_back(f->name);
+ Array<Expr> param;
+ param.push_back(code);
+ param.push_back(f->name);
+ params.push_back(param);
}
std::string xclbin;
if (const auto* f = Registry::Get("tvm_callback_sdaccel_compile")) {
- xclbin = (*f)(codes, funcnames).operator std::string();
+ xclbin = (*f)(params).operator std::string();
} else {
LOG(FATAL) << "Cannot compile Vivado HLS code.";
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure either, but I felt comfortable with this alternative.
@comaniac Thanks for your comments. This PR generates one bitstream with multiple kernels, so there is no overhead of reconfiguration. However, as you said, there is plenty of room for performance improvement like how to fuse computation into one kernel as far as possible or how to transfer data between multiple kernels. I think of addressing these in future PRs. :) |
Sorry I missed looking the |
@kazum Question regarding multiple kernel support: does the code example of a vector add pipeline imply that your code generation backend generates multiple OpenCL kernels? Would you care to paste the corresponding generated OpenCL code for your example? I'm curious to see if you have control between piping data via an SRAM buffer between those kernels or sending it back to DRAM, or host Memory, and how you would control this via TVM. Finally, since we have multiple kernels in this example, is the |
@tmoreau89 These are the generated two kernel codes (Vivado HLS).
#include <ap_int.h>
extern "C" void myadd_kernel0( float* C, float* A, float* B) {
#pragma HLS INTERFACE m_axi port=C offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=C bundle=control
#pragma HLS INTERFACE m_axi port=A offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=A bundle=control
#pragma HLS INTERFACE m_axi port=B offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=B bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control
for (int i0_inner = 0; i0_inner < 1024; ++i0_inner) {
C[i0_inner] = (A[i0_inner] + B[i0_inner]);
}
}
#include <ap_int.h>
extern "C" void myadd_kernel1( float* D, float* A, float* C) {
#pragma HLS INTERFACE m_axi port=D offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=D bundle=control
#pragma HLS INTERFACE m_axi port=A offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=A bundle=control
#pragma HLS INTERFACE m_axi port=C offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=C bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control
for (int i0_inner = 0; i0_inner < 1024; ++i0_inner) {
D[i0_inner] = (A[i0_inner] + C[i0_inner]);
}
} The result of myadd_kernel0 (float* C) is the same buffer with the third argument of myadd_kernel1 and it resides in the global memory (an OpenCL term). Basically, DRAM is used for global memories, but it's up to OpenCL toolchain implementation, IIUC. With regard to SDAccel, I've found the below explanation:
It would be better, e.g., if we could create a BRAM-based FIFO and use it to pipe data from myadd_kernel0 to myadd_kernel1, but supporting it in TVM would require more effort and discussion, I think.
Sounds good to add the pragma now, thanks! |
@kazum I think this is a good start, thank you for sharing the code, this really helps understand how TVM currently generates OpenCL kernels for SDAccel. Eventually we'll want to insert more pipeline pragmas to improve the performance of the generated kernels. It might be nice to have control over how data gets piped from one kernel to another in TVM, perhaps. It seems like currently, we are piping data to DRAM between the kernels? As @comaniac this may be inefficient. I'd suggest using a special scope for buffers to explicitly tell the compiler that we want to buffer it in SRAM, or in DRAM. That being said, this is a good start, and we want to incrementally build this back-end, so I'm happy with this current PR. Unless @comaniac or @tqchen has other requests, I approve this change. |
Looks good to me. I don't have further requests. The following discussion is just my brainstorming and personal opinion for the issues raised by @tmoreau89. Free to ignore. ===
To pipeline j-loop, we simply add pragma:
However, if we add pragma to i-loop, it actually becomes (the unroll pragma will be added by VivadoHLS automatically whatever you want):
So this is still a fine-grained pipeline. In many cases, we don't want j-loop to be fully unrolled due to resource constraints. This is one of the alternative solutions:
By manually wrapping j-loop to a non-inline function (which will become a standalone module in the design), VivadoHLS won't automatically unroll j-loop and results in the coarse-grained pipeline we desired. The same concept applies to loop parallelism as well.
|
@comaniac thank your for your expertise on HLS, these are all really good points. I agree with your point on the DATAFLOW pragma, I have had limited success using it. w.r.t. where to insert PIPELINE pragmas, it can indeed be tricky when it involves multiple loops. Learning these rules as the tools evolve can be tricky, especially since SDAccel's handling of pragmas might be different of Intel/Altera's OpenCL compiler. It might be interesting in the long term to have a learning-based tuning approach that uses the achieved II/resources of the HLS report as objective function to insert the pragmas correctly. This is more researchy than practical of course, but it could help mitigate these special cases. This is just me thinking out loud! |
Thanks a lot for your input about HLS. I'd be happy to address pipeline improvement based on those in future PRs. :) About the DATAFLOW pragma, I think of checking how it works with TVM first, and then, will send PR later to add support it if necessary. @comaniac I've addressed your comment about the arguments of compile_vhls(). Thanks. |
src/codegen/codegen_vhls.cc
Outdated
std::string whole_code = cg.Finish(); | ||
|
||
// Generate source code for compilation. | ||
Array<Array<Expr>> kernel_info; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expr>> -> Expr> >(add space) to accommodate parser efforts in old MSVC
@tqchen Addressed, thanks. |
This PR allows having multiple kernels for the SDAccel runtime.