Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RUNTIME][SDACCEL] Add support for multiple kernels #1424

Merged
merged 3 commits into from
Jul 14, 2018

Conversation

kazum
Copy link
Contributor

@kazum kazum commented Jul 12, 2018

This PR allows having multiple kernels for the SDAccel runtime.

@kazum
Copy link
Contributor Author

kazum commented Jul 12, 2018

@tqchen @tmoreau89 @comaniac Can you review this PR?

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is reasonable, although it might have room for further improvement. One problem I could imagine for the current implementation to result in poor performance is the overhead of reconfiguration and data transfer since this PR only supports the code generation for multiple kernels.

For example in your test case, those two kernels should be merged because there's no host computation in between, unless you cannot put both of them on an FPGA at the same time due to resource limitation. Lacking this kind of optimization will introduce nonignorable overhead and amortize the end-to-end performance improvement. My suggestion would be generating multiple kernels only when you really cannot put all of them into the same bit-stream, and this shouldn't happen frequently.

(Just provide my thought for future improvement and it's not necessary to be addressed in this PR.)


kernel : str
The kernel to compile or link.
kernels : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me "funcnames" used in codegen_vhls.cc describes this variable better.
The explanation of this variable can also be more specific, such as "Array of the kernel function names..."

if returncode != 0:
raise RuntimeError("Compile error")
tmp_xo_files = []
for code, kernel in zip(codes, kernels):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You used two arrays to store code and funcname respectively in codegen_vhls.cc but zip them here again. It seems unnecessary to me. Could you use tuple to bind the code with its corresponding funcname instead?

Copy link
Contributor Author

@kazum kazum Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a bit sure how to pass an Array of tuple to a PackedFunc. What I came up with was creating an Array of Array like as follows. Is there a better way?

diff --git a/src/codegen/codegen_vhls.cc b/src/codegen/codegen_vhls.cc
index 4b291d6..5b402d8 100644
--- a/src/codegen/codegen_vhls.cc
+++ b/src/codegen/codegen_vhls.cc
@@ -80,7 +80,7 @@ runtime::Module BuildSDAccel(Array<LoweredFunc> funcs) {
   std::string whole_code = cg.Finish();
 
   // Generate source code for compilation.
-  Array<Expr> codes, funcnames;
+  Array<Array<Expr>> params;
   for (LoweredFunc f : funcs) {
     CodeGenVivadoHLS cg;
     cg.Init(output_ssa);
@@ -89,13 +89,15 @@ runtime::Module BuildSDAccel(Array<LoweredFunc> funcs) {
     if (const auto* f = runtime::Registry::Get("tvm_callback_vhls_postproc")) {
       code = (*f)(code).operator std::string();
     }
-    codes.push_back(code);
-    funcnames.push_back(f->name);
+    Array<Expr> param;
+    param.push_back(code);
+    param.push_back(f->name);
+    params.push_back(param);
   }
 
   std::string xclbin;
   if (const auto* f = Registry::Get("tvm_callback_sdaccel_compile")) {
-    xclbin = (*f)(codes, funcnames).operator std::string();
+    xclbin = (*f)(params).operator std::string();
   } else {
     LOG(FATAL) << "Cannot compile Vivado HLS code.";
   }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure either, but I felt comfortable with this alternative.

@kazum
Copy link
Contributor Author

kazum commented Jul 12, 2018

@comaniac Thanks for your comments. This PR generates one bitstream with multiple kernels, so there is no overhead of reconfiguration. However, as you said, there is plenty of room for performance improvement like how to fuse computation into one kernel as far as possible or how to transfer data between multiple kernels. I think of addressing these in future PRs. :)

@comaniac
Copy link
Contributor

Sorry I missed looking the append after you got the xo for each kernel. It does generate one bit-stream for all kernels.

@tmoreau89
Copy link
Contributor

@kazum Question regarding multiple kernel support: does the code example of a vector add pipeline imply that your code generation backend generates multiple OpenCL kernels? Would you care to paste the corresponding generated OpenCL code for your example?

I'm curious to see if you have control between piping data via an SRAM buffer between those kernels or sending it back to DRAM, or host Memory, and how you would control this via TVM.

Finally, since we have multiple kernels in this example, is the #pragma HLS DATAFLOW directive enabled by default to allow for task level parallelism?

@kazum
Copy link
Contributor Author

kazum commented Jul 13, 2018

@tmoreau89 These are the generated two kernel codes (Vivado HLS).

  • myadd_kernel0
#include <ap_int.h>

extern "C" void myadd_kernel0( float* C,  float* A,  float* B) {
#pragma HLS INTERFACE m_axi port=C  offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=C bundle=control
#pragma HLS INTERFACE m_axi port=A  offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=A bundle=control
#pragma HLS INTERFACE m_axi port=B  offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=B bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control

  for (int i0_inner = 0; i0_inner < 1024; ++i0_inner) {
    C[i0_inner] = (A[i0_inner] + B[i0_inner]);
  }
}
  • myadd_kernel1
#include <ap_int.h>

extern "C" void myadd_kernel1( float* D,  float* A,  float* C) {
#pragma HLS INTERFACE m_axi port=D  offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=D bundle=control
#pragma HLS INTERFACE m_axi port=A  offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=A bundle=control
#pragma HLS INTERFACE m_axi port=C  offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=C bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control

  for (int i0_inner = 0; i0_inner < 1024; ++i0_inner) {
    D[i0_inner] = (A[i0_inner] + C[i0_inner]);
  }
}

The result of myadd_kernel0 (float* C) is the same buffer with the third argument of myadd_kernel1 and it resides in the global memory (an OpenCL term). Basically, DRAM is used for global memories, but it's up to OpenCL toolchain implementation, IIUC. With regard to SDAccel, I've found the below explanation:
https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_2/ug1207-sdaccel-optimization-guide.pdf#page=23

Global and constant memories are any memory that is connected to the FPGA
device. These are usually memory chips (e.g. SDRAM) that are physically
connected to the FPGA device, but might also include distributed memories
(e.g. BlockRAM) within the FPGA fabric. The host processor has access to these
memory banks through infrastructure provided by the FPGA platform.

It would be better, e.g., if we could create a BRAM-based FIFO and use it to pipe data from myadd_kernel0 to myadd_kernel1, but supporting it in TVM would require more effort and discussion, I think.

Finally, since we have multiple kernels in this example, is the #pragma HLS DATAFLOW directive enabled by default to allow for task level parallelism?

Sounds good to add the pragma now, thanks!

@tmoreau89
Copy link
Contributor

@kazum I think this is a good start, thank you for sharing the code, this really helps understand how TVM currently generates OpenCL kernels for SDAccel.

Eventually we'll want to insert more pipeline pragmas to improve the performance of the generated kernels.
We may need some TVM pragmas at first to easily convey the directives down to the code generator - this should be pretty easy.

It might be nice to have control over how data gets piped from one kernel to another in TVM, perhaps. It seems like currently, we are piping data to DRAM between the kernels? As @comaniac this may be inefficient. I'd suggest using a special scope for buffers to explicitly tell the compiler that we want to buffer it in SRAM, or in DRAM.

That being said, this is a good start, and we want to incrementally build this back-end, so I'm happy with this current PR. Unless @comaniac or @tqchen has other requests, I approve this change.

@comaniac
Copy link
Contributor

Looks good to me.
One warning for using DATAFLOW pragma: This feature in VivadoHLS is relatively unstable. Sometimes it may be stuck due to imperfect FIFO behaviors, although it's getting improved from time to time. One alternative promising solution is constructing double buffering, but it costs more BRAM and it's not easy to construct from the IR.

I don't have further requests. The following discussion is just my brainstorming and personal opinion for the issues raised by @tmoreau89. Free to ignore.

===

  • For inserting pragmas, it indeed can be easily done for fine-grained optimizations such as loop unrolling and fine-grained pipeline (pipeline the innermost loop). On the other hand, for coarse-grained loops (the loop with at least one sub-loop that doesn't be fully unrolled), simply inserting pragma won't work. For example:
for (int i = 0; i < 32; ++i) {
    for (int j = 0; j < 64; ++j) {
        ...
    }
}

To pipeline j-loop, we simply add pragma:

for (int i = 0; i < 32; ++i) {
    for (int j = 0; j < 64; ++j) {
#pragma HLS pipeline
        ...
    }
}

However, if we add pragma to i-loop, it actually becomes (the unroll pragma will be added by VivadoHLS automatically whatever you want):

for (int i = 0; i < 32; ++i) {
#pragma HLS pipeline
    for (int j = 0; j < 64; ++j) {
#pragma HLS unroll
        ...
    }
}

So this is still a fine-grained pipeline. In many cases, we don't want j-loop to be fully unrolled due to resource constraints. This is one of the alternative solutions:

void func(...) {
#pragma HLS inline off
    for (int j = 0; j < 64; ++j) {
        ...
    }
}

for (int i = 0; i < 32; ++i) {
#pragma HLS pipeline
    func(...);
}

By manually wrapping j-loop to a non-inline function (which will become a standalone module in the design), VivadoHLS won't automatically unroll j-loop and results in the coarse-grained pipeline we desired.

The same concept applies to loop parallelism as well.

  • For the use of block RAM (or SRAM), we usually declare a local array and explicitly use memcpy to enable memory burst from DRAM to BRAM. Maybe we can figure out a proper programming model for users to specify the data chunk size so that we can easily create the BRAM and enable memory burst.

@tmoreau89
Copy link
Contributor

@comaniac thank your for your expertise on HLS, these are all really good points. I agree with your point on the DATAFLOW pragma, I have had limited success using it.

w.r.t. where to insert PIPELINE pragmas, it can indeed be tricky when it involves multiple loops. Learning these rules as the tools evolve can be tricky, especially since SDAccel's handling of pragmas might be different of Intel/Altera's OpenCL compiler. It might be interesting in the long term to have a learning-based tuning approach that uses the achieved II/resources of the HLS report as objective function to insert the pragmas correctly. This is more researchy than practical of course, but it could help mitigate these special cases. This is just me thinking out loud!

@kazum
Copy link
Contributor Author

kazum commented Jul 13, 2018

Thanks a lot for your input about HLS. I'd be happy to address pipeline improvement based on those in future PRs. :) About the DATAFLOW pragma, I think of checking how it works with TVM first, and then, will send PR later to add support it if necessary.

@comaniac I've addressed your comment about the arguments of compile_vhls().

Thanks.

std::string whole_code = cg.Finish();

// Generate source code for compilation.
Array<Array<Expr>> kernel_info;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expr>> -> Expr> >(add space) to accommodate parser efforts in old MSVC

@kazum
Copy link
Contributor Author

kazum commented Jul 14, 2018

@tqchen Addressed, thanks.

@tqchen tqchen merged commit 1af1287 into apache:master Jul 14, 2018
tqchen pushed a commit to tqchen/tvm that referenced this pull request Aug 4, 2018
sergei-mironov pushed a commit to sergei-mironov/tvm that referenced this pull request Aug 8, 2018
@kazum kazum deleted the sdaccel-multi-kernels branch August 23, 2018 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants