[RUNTIME][SDACCEL] Add support for multiple kernels #1424

kazum · 2018-07-12T07:53:36Z

This PR allows having multiple kernels for the SDAccel runtime.

kazum · 2018-07-12T08:43:46Z

@tqchen @tmoreau89 @comaniac Can you review this PR?

comaniac

I think this PR is reasonable, although it might have room for further improvement. One problem I could imagine for the current implementation to result in poor performance is the overhead of reconfiguration and data transfer since this PR only supports the code generation for multiple kernels.

For example in your test case, those two kernels should be merged because there's no host computation in between, unless you cannot put both of them on an FPGA at the same time due to resource limitation. Lacking this kind of optimization will introduce nonignorable overhead and amortize the end-to-end performance improvement. My suggestion would be generating multiple kernels only when you really cannot put all of them into the same bit-stream, and this shouldn't happen frequently.

(Just provide my thought for future improvement and it's not necessary to be addressed in this PR.)

comaniac · 2018-07-12T17:42:20Z

python/tvm/contrib/sdaccel.py


-    kernel : str
-        The kernel to compile or link.
+    kernels : str


To me "funcnames" used in codegen_vhls.cc describes this variable better.
The explanation of this variable can also be more specific, such as "Array of the kernel function names..."

comaniac · 2018-07-12T17:46:13Z

python/tvm/contrib/sdaccel.py

-    if returncode != 0:
-        raise RuntimeError("Compile error")
+    tmp_xo_files = []
+    for code, kernel in zip(codes, kernels):


You used two arrays to store code and funcname respectively in codegen_vhls.cc but zip them here again. It seems unnecessary to me. Could you use tuple to bind the code with its corresponding funcname instead?

I'm not a bit sure how to pass an Array of tuple to a PackedFunc. What I came up with was creating an Array of Array like as follows. Is there a better way?

diff --git a/src/codegen/codegen_vhls.cc b/src/codegen/codegen_vhls.cc index 4b291d6..5b402d8 100644 --- a/src/codegen/codegen_vhls.cc +++ b/src/codegen/codegen_vhls.cc @@ -80,7 +80,7 @@ runtime::Module BuildSDAccel(Array<LoweredFunc> funcs) { std::string whole_code = cg.Finish(); // Generate source code for compilation. - Array<Expr> codes, funcnames; + Array<Array<Expr>> params; for (LoweredFunc f : funcs) { CodeGenVivadoHLS cg; cg.Init(output_ssa); @@ -89,13 +89,15 @@ runtime::Module BuildSDAccel(Array<LoweredFunc> funcs) { if (const auto* f = runtime::Registry::Get("tvm_callback_vhls_postproc")) { code = (*f)(code).operator std::string(); } - codes.push_back(code); - funcnames.push_back(f->name); + Array<Expr> param; + param.push_back(code); + param.push_back(f->name); + params.push_back(param); } std::string xclbin; if (const auto* f = Registry::Get("tvm_callback_sdaccel_compile")) { - xclbin = (*f)(codes, funcnames).operator std::string(); + xclbin = (*f)(params).operator std::string(); } else { LOG(FATAL) << "Cannot compile Vivado HLS code."; }

I'm not quite sure either, but I felt comfortable with this alternative.

kazum · 2018-07-12T19:02:39Z

@comaniac Thanks for your comments. This PR generates one bitstream with multiple kernels, so there is no overhead of reconfiguration. However, as you said, there is plenty of room for performance improvement like how to fuse computation into one kernel as far as possible or how to transfer data between multiple kernels. I think of addressing these in future PRs. :)

comaniac · 2018-07-12T19:18:16Z

Sorry I missed looking the append after you got the xo for each kernel. It does generate one bit-stream for all kernels.

tmoreau89 · 2018-07-12T20:42:26Z

@kazum Question regarding multiple kernel support: does the code example of a vector add pipeline imply that your code generation backend generates multiple OpenCL kernels? Would you care to paste the corresponding generated OpenCL code for your example?

I'm curious to see if you have control between piping data via an SRAM buffer between those kernels or sending it back to DRAM, or host Memory, and how you would control this via TVM.

Finally, since we have multiple kernels in this example, is the #pragma HLS DATAFLOW directive enabled by default to allow for task level parallelism?

kazum · 2018-07-13T17:32:30Z

@tmoreau89 These are the generated two kernel codes (Vivado HLS).

myadd_kernel0

#include <ap_int.h>

extern "C" void myadd_kernel0( float* C,  float* A,  float* B) {
#pragma HLS INTERFACE m_axi port=C  offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=C bundle=control
#pragma HLS INTERFACE m_axi port=A  offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=A bundle=control
#pragma HLS INTERFACE m_axi port=B  offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=B bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control

  for (int i0_inner = 0; i0_inner < 1024; ++i0_inner) {
    C[i0_inner] = (A[i0_inner] + B[i0_inner]);
  }
}

myadd_kernel1

#include <ap_int.h>

extern "C" void myadd_kernel1( float* D,  float* A,  float* C) {
#pragma HLS INTERFACE m_axi port=D  offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=D bundle=control
#pragma HLS INTERFACE m_axi port=A  offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=A bundle=control
#pragma HLS INTERFACE m_axi port=C  offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=C bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control

  for (int i0_inner = 0; i0_inner < 1024; ++i0_inner) {
    D[i0_inner] = (A[i0_inner] + C[i0_inner]);
  }
}

The result of myadd_kernel0 (float* C) is the same buffer with the third argument of myadd_kernel1 and it resides in the global memory (an OpenCL term). Basically, DRAM is used for global memories, but it's up to OpenCL toolchain implementation, IIUC. With regard to SDAccel, I've found the below explanation:
https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_2/ug1207-sdaccel-optimization-guide.pdf#page=23

Global and constant memories are any memory that is connected to the FPGA
device. These are usually memory chips (e.g. SDRAM) that are physically
connected to the FPGA device, but might also include distributed memories
(e.g. BlockRAM) within the FPGA fabric. The host processor has access to these
memory banks through infrastructure provided by the FPGA platform.

It would be better, e.g., if we could create a BRAM-based FIFO and use it to pipe data from myadd_kernel0 to myadd_kernel1, but supporting it in TVM would require more effort and discussion, I think.

Finally, since we have multiple kernels in this example, is the #pragma HLS DATAFLOW directive enabled by default to allow for task level parallelism?

Sounds good to add the pragma now, thanks!

tmoreau89 · 2018-07-13T17:59:46Z

@kazum I think this is a good start, thank you for sharing the code, this really helps understand how TVM currently generates OpenCL kernels for SDAccel.

Eventually we'll want to insert more pipeline pragmas to improve the performance of the generated kernels.
We may need some TVM pragmas at first to easily convey the directives down to the code generator - this should be pretty easy.

It might be nice to have control over how data gets piped from one kernel to another in TVM, perhaps. It seems like currently, we are piping data to DRAM between the kernels? As @comaniac this may be inefficient. I'd suggest using a special scope for buffers to explicitly tell the compiler that we want to buffer it in SRAM, or in DRAM.

That being said, this is a good start, and we want to incrementally build this back-end, so I'm happy with this current PR. Unless @comaniac or @tqchen has other requests, I approve this change.

comaniac · 2018-07-13T18:19:16Z

Looks good to me.
One warning for using DATAFLOW pragma: This feature in VivadoHLS is relatively unstable. Sometimes it may be stuck due to imperfect FIFO behaviors, although it's getting improved from time to time. One alternative promising solution is constructing double buffering, but it costs more BRAM and it's not easy to construct from the IR.

I don't have further requests. The following discussion is just my brainstorming and personal opinion for the issues raised by @tmoreau89. Free to ignore.

===

For inserting pragmas, it indeed can be easily done for fine-grained optimizations such as loop unrolling and fine-grained pipeline (pipeline the innermost loop). On the other hand, for coarse-grained loops (the loop with at least one sub-loop that doesn't be fully unrolled), simply inserting pragma won't work. For example:

for (int i = 0; i < 32; ++i) {
    for (int j = 0; j < 64; ++j) {
        ...
    }
}

To pipeline j-loop, we simply add pragma:

for (int i = 0; i < 32; ++i) {
    for (int j = 0; j < 64; ++j) {
#pragma HLS pipeline
        ...
    }
}

However, if we add pragma to i-loop, it actually becomes (the unroll pragma will be added by VivadoHLS automatically whatever you want):

for (int i = 0; i < 32; ++i) {
#pragma HLS pipeline
    for (int j = 0; j < 64; ++j) {
#pragma HLS unroll
        ...
    }
}

So this is still a fine-grained pipeline. In many cases, we don't want j-loop to be fully unrolled due to resource constraints. This is one of the alternative solutions:

void func(...) {
#pragma HLS inline off
    for (int j = 0; j < 64; ++j) {
        ...
    }
}

for (int i = 0; i < 32; ++i) {
#pragma HLS pipeline
    func(...);
}

By manually wrapping j-loop to a non-inline function (which will become a standalone module in the design), VivadoHLS won't automatically unroll j-loop and results in the coarse-grained pipeline we desired.

The same concept applies to loop parallelism as well.

For the use of block RAM (or SRAM), we usually declare a local array and explicitly use memcpy to enable memory burst from DRAM to BRAM. Maybe we can figure out a proper programming model for users to specify the data chunk size so that we can easily create the BRAM and enable memory burst.

tmoreau89 · 2018-07-13T18:31:33Z

@comaniac thank your for your expertise on HLS, these are all really good points. I agree with your point on the DATAFLOW pragma, I have had limited success using it.

w.r.t. where to insert PIPELINE pragmas, it can indeed be tricky when it involves multiple loops. Learning these rules as the tools evolve can be tricky, especially since SDAccel's handling of pragmas might be different of Intel/Altera's OpenCL compiler. It might be interesting in the long term to have a learning-based tuning approach that uses the achieved II/resources of the HLS report as objective function to insert the pragmas correctly. This is more researchy than practical of course, but it could help mitigate these special cases. This is just me thinking out loud!

kazum · 2018-07-13T19:32:40Z

Thanks a lot for your input about HLS. I'd be happy to address pipeline improvement based on those in future PRs. :) About the DATAFLOW pragma, I think of checking how it works with TVM first, and then, will send PR later to add support it if necessary.

@comaniac I've addressed your comment about the arguments of compile_vhls().

Thanks.

tqchen · 2018-07-14T01:56:18Z

src/codegen/codegen_vhls.cc

+  std::string whole_code = cg.Finish();
+
+  // Generate source code for compilation.
+  Array<Array<Expr>> kernel_info;


Expr>> -> Expr> >(add space) to accommodate parser efforts in old MSVC

kazum · 2018-07-14T07:43:04Z

@tqchen Addressed, thanks.

[RUNTIME][SDACCEL] Add support for multiple kernels

0bc3784

comaniac reviewed Jul 12, 2018

View reviewed changes

pass tuple of funcname and code to compile_vhls

06e3b99

tqchen requested changes Jul 14, 2018

View reviewed changes

add space to accommodate parser efforts in old MSVC

4622354

tqchen approved these changes Jul 14, 2018

View reviewed changes

tqchen merged commit 1af1287 into apache:master Jul 14, 2018

tqchen added the status: accepted label Jul 14, 2018

tqchen pushed a commit to tqchen/tvm that referenced this pull request Aug 4, 2018

[RUNTIME][SDACCEL] Add support for multiple kernels (apache#1424)

12cf343

sergei-mironov pushed a commit to sergei-mironov/tvm that referenced this pull request Aug 8, 2018

[RUNTIME][SDACCEL] Add support for multiple kernels (apache#1424)

0ba1310

kazum deleted the sdaccel-multi-kernels branch August 23, 2018 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RUNTIME][SDACCEL] Add support for multiple kernels #1424

[RUNTIME][SDACCEL] Add support for multiple kernels #1424

kazum commented Jul 12, 2018

kazum commented Jul 12, 2018

comaniac left a comment

comaniac Jul 12, 2018

comaniac Jul 12, 2018

kazum Jul 12, 2018 •

edited

Loading

comaniac Jul 12, 2018

kazum commented Jul 12, 2018

comaniac commented Jul 12, 2018

tmoreau89 commented Jul 12, 2018

kazum commented Jul 13, 2018

tmoreau89 commented Jul 13, 2018

comaniac commented Jul 13, 2018

tmoreau89 commented Jul 13, 2018

kazum commented Jul 13, 2018

tqchen Jul 14, 2018

kazum commented Jul 14, 2018

[RUNTIME][SDACCEL] Add support for multiple kernels #1424

[RUNTIME][SDACCEL] Add support for multiple kernels #1424

Conversation

kazum commented Jul 12, 2018

kazum commented Jul 12, 2018

comaniac left a comment

Choose a reason for hiding this comment

comaniac Jul 12, 2018

Choose a reason for hiding this comment

comaniac Jul 12, 2018

Choose a reason for hiding this comment

kazum Jul 12, 2018 • edited Loading

Choose a reason for hiding this comment

comaniac Jul 12, 2018

Choose a reason for hiding this comment

kazum commented Jul 12, 2018

comaniac commented Jul 12, 2018

tmoreau89 commented Jul 12, 2018

kazum commented Jul 13, 2018

tmoreau89 commented Jul 13, 2018

comaniac commented Jul 13, 2018

tmoreau89 commented Jul 13, 2018

kazum commented Jul 13, 2018

tqchen Jul 14, 2018

Choose a reason for hiding this comment

kazum commented Jul 14, 2018

kazum Jul 12, 2018 •

edited

Loading