Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][XilinxFPGA] Known Issues #40

Open
agozillon opened this issue Jun 11, 2019 · 5 comments
Open

[SYCL][XilinxFPGA] Known Issues #40

agozillon opened this issue Jun 11, 2019 · 5 comments
Labels
bug Something isn't working enhancement New feature or request question Further information is requested

Comments

@agozillon
Copy link
Contributor

agozillon commented Jun 11, 2019

This is a non-exhaustive list of some larger problems relating to XIlinx FPGA compilation and runtime execution (some with more information than others) that need some thought long term:


Problem: If you create buffers and try to use some SYCL functionality that makes use of some underlying OpenCL functionality to modify the buffers as cl_mem objects before a kernel is invoked (single_task/parallel_for etc.) you'll incur XRT runtime errors e.g.:

[XRT] ERROR: Internal error. cl_mem doesn't map to buffer object

You should be able to see this in action if you comment out the "noop" SYCL kernel in the accessor_copy.cpp test.

Reason: XRT will not consider a device as "active" and use-able until you've loaded a binary as it can't query most of the information it needs, one unfortunate side affect of this is that cl_mem buffers are not appropriately assigned to a device and whenever you try to use something like a handler copy with a sycl accessor/buffer the underlying XRT OpenCL call will not be able to find the buffer in relation to a device (it queries devices for buffers, if they're not found, XRT is not pleased).

Possible fix Ideas:

  • Eager binary program loading as we know we'll only use pre-compiled binaries with our FPGAs
  • Lazy OpenCL buffer creation/operations, only do this after we know XRT will be happy i.e. binary loaded and good to go

Work around: Force start the device by using a noop kernel, not ideal and while it works around the issue on hw/sw emu I'm not too sure how real hardware will appreciate this.


Problem: Only 1 queue to a Xilinx FPGA device can exist at once, if you accidentally generate more than one XRT will not be happy.


Problem: All kernels require at least 1 accessor, 0 accessors will cause a compile error in xocc relating to no argument being bound to AXI_GMEM. Not too sure how many use cases there are for no accessors in a kernel but perhaps it shouldn't emit an error like this.


Problem: Cannot compile for Xilinx FPGA with -g, this prevents users using debug mode on the SYCL runtime not just the kernel.

Reason: I believe this is because we're generating a kernel file with a lot of debug code and trying to pipe that through xocc which doesn't really know how to handle all of the debug information.

Possible fix Ideas:

  • A long shot, but quick fix if it worked, add -g to the xocc compilation components of the sycl-xocc script. Perhaps it will realize that the kernel may come attached with some debug information in this case and handle it better. I find this unlikely to work, but it's low hanging fruit if it does...
  • More surefire fix: Do not compile the device compilation component of SYCL with -g, only compile the host component with -g and remove the -g from being pushed onto the device compilation. This will create a much simpler SPIR-df kernel to pass to xocc. Instead -g should be applied to the xocc compilation and linker commands to get the debug kernel information. This should circumvent any issues with debug information breaking xocc kernel compilation whilst still giving kernel debug information and SYCL runtime debug information. Shouldn't be too hard, just requires some driver tweaks.

Problem: Related to issue: #32 mixing structures inside of kernels can cause ICE's in one of xocc's passes: aggressive dead code elimination (AGDCE). The relevant minimal triggering SYCL test case for this is issue_related/agdce_ice.cpp.

Reason: Seems to be a problem relating to address space casting from a structure that is implemented outside of a kernel (ergo no address space) and when you try to index an accessor containing several passed in instantiations of the class/struct it will explode the AGDCE pass inside of the compiler as it will try to address space cast.

Status: I am led to believe it's a bug with xocc/Vivado HLS, so it appears to be outside of our jurisdiction, I have forwarded this issue onto someone on their team but it's low priority. May take a while for a fix without some follow up.


Problem: Boost Hana's times::with_index in conjunction with it's overloaded + operator will kill hw_emu and very likely hw as it will not completely optimize and inline with -03 as you would expect (and as it does in a non-SYCL -O3 pass). This leaves some external declarations and calls to functions but no definitions of the functions, so partial optimization/in-lining. Which is a little odd as the definitions exist prior to the -O3 pass and other boost hana functionality is appropriately inlined.

Reason: Current best guess is that the required index argument passed into the lambda passed to and invoked by with_index is the probable cause. It seems like it could be another address space cast related issue. The minimal test case for this issue is the example: boost_hana_functor_arg.cpp inside of the issue_related directory.

  cgh.single_task<class array_add>([=]() {
     boost::hana::int_<5>::times.with_index([&](const auto i) {
          a_rw[i+1] = 6;
      });
   });

So to highlight the issue, In the example there is an variable passed into the lambda from an externally defined function, this then gets used with the + operator. This + operator is overloaded inside of Boost Hana to support compile time usage. This snippet of code should be unrolled and inlined removing all of the Boost Hana related functionality. This doesn't happen and it seems to be because the argument passed into the function and being incremented and used with the value 1 will trigger an address space cast which I think will prevent the appropriate and required optimizations.

This is an an assumption though, based on the fact that the below code works fine:

  cgh.single_task<class array_add>([=]() {
          int i = 0;
           boost::hana::int_<N>::times([&] {
             a_rw[i+0] = 6;
             ++i;
         });
   });

The index variable and the random constant variable now exist in the same address space and the world all seems to be fine as far as the compiler is concerned.


Some of these "problems" are peculiarities in our FPGA compilation pipeline and non-standard OpenCL implementation and may not necessarily be "problems".

@agozillon agozillon added bug Something isn't working enhancement New feature or request question Further information is requested labels Jun 11, 2019
@j-stephan
Copy link
Member

j-stephan commented Nov 20, 2019

Related to issue: #32 mixing structures inside of kernels can cause ICE's in one of xocc's passes: aggressive dead code elimination (AGDCE). The relevant minimal triggering SYCL test case for this is issue_related/agdce_ice.cpp.

This also affects SYCL's own structures which are used without an accessor, in particular the partitioned array extension. Example code, error log:

[INFO] Could not find v++ executable in /opt/xilinx/SDx/2019.1/bin
[INFO] Try with xocc...
warning: Linking two modules of different data layouts: '/opt/xilinx/SDx/2019.1/bin/../lnx64/lib/libspir64-39-hls.bc' is 'e-m:e-i64:64-i128:128-i256:256-i512:512-i1024:1024-i2048:2048-i4096:4096-n8:16:32:64-S128-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas 'llvm-link' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024'

warning: Linking two modules of different target triples: /opt/xilinx/SDx/2019.1/bin/../lnx64/lib/libspir64-39-hls.bc' is 'fpga64-xilinx-none' whereas 'llvm-link' is 'spir64'

Invoking Kernel Compilation with /opt/xilinx/SDx/2019.1/bin/xocc
--target: hw_emu
--platform: xilinx_u200_xdma_201830_2
Compiling kernel: xSYCL15417455371595987851
Outputting file to: /tmp/xSYCL15417455371595987851.xo
Input file is: /tmp/main_kernels-linked.xpirbc
Option Map File Used: '/opt/xilinx/SDx/2019.1/data/sdx/xocc/optMap.xml'

****** xocc v2019.1 (64-bit)
  **** SW Build 2552052 on Fri May 24 14:47:09 MDT 2019
    ** Copyright 1986-2019 Xilinx, Inc. All Rights Reserved.

Attempting to get a license: ap_opencl
Feature available: ap_opencl
INFO: [XOCC 60-1306] Additional information associated with this xocc compile can be found at:
	Reports: /home/jan/workspace/sycl-box/_x/reports/xSYCL15417455371595987851
	Log files: /home/jan/workspace/sycl-box/_x/logs/xSYCL15417455371595987851
INFO: [XOCC 60-585] Compiling for hardware emulation target
INFO: [XOCC 60-1316] Initiating connection to rulecheck server, at Wed Nov 20 15:58:25 2019
Running Rule Check Server on port:39317
INFO: [XOCC 60-1315] Creating rulecheck session with output '/home/jan/workspace/sycl-box/_x/reports/xSYCL15417455371595987851/xocc_compile_xSYCL15417455371595987851_guidance.html', at Wed Nov 20 15:58:26 2019
INFO: [XOCC 60-895]   Target platform: /opt/xilinx/platforms/xilinx_u200_xdma_201830_2/xilinx_u200_xdma_201830_2.xpfm
INFO: [XOCC 60-423]   Target device: xilinx_u200_xdma_201830_2
INFO: [XOCC 60-242] Creating kernel: 'xSYCL15417455371595987851'

===>The following messages were generated while  performing high-level synthesis for kernel: xSYCL15417455371595987851 Log file: /home/jan/workspace/sycl-box/_x/xSYCL15417455371595987851/xSYCL15417455371595987851/vivado_hls.log :
INFO: [XOCC 204-61] Option 'relax_ii_for_timing' is enabled, will increase II to preserve clock frequency constraints.
INFO: [XOCC 204-61] Pipelining loop 'XCL_WG_DIM_Y_XCL_WG_DIM_X'.
INFO: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 77.
INFO: [XOCC 60-594] Finished kernel compilation
INFO: [XOCC 60-244] Generating system estimate report...
INFO: [XOCC 60-1092] Generated system estimate report: /home/jan/workspace/sycl-box/_x/reports/xSYCL15417455371595987851/system_estimate_xSYCL15417455371595987851.xtxt
INFO: [XOCC 60-586] Created /tmp/xSYCL15417455371595987851.xo
INFO: [XOCC 60-791] Total elapsed time: 0h 0m 34s
Invoking Kernel Compilation with /opt/xilinx/SDx/2019.1/bin/xocc
--target: hw_emu
--platform: xilinx_u200_xdma_201830_2
Compiling kernel: xSYCL18068824734715316925
Outputting file to: /tmp/xSYCL18068824734715316925.xo
Input file is: /tmp/main_kernels-linked.xpirbc
Option Map File Used: '/opt/xilinx/SDx/2019.1/data/sdx/xocc/optMap.xml'

****** xocc v2019.1 (64-bit)
  **** SW Build 2552052 on Fri May 24 14:47:09 MDT 2019
    ** Copyright 1986-2019 Xilinx, Inc. All Rights Reserved.

Attempting to get a license: ap_opencl
Feature available: ap_opencl
INFO: [XOCC 60-1306] Additional information associated with this xocc compile can be found at:
	Reports: /home/jan/workspace/sycl-box/_x/reports/xSYCL18068824734715316925
	Log files: /home/jan/workspace/sycl-box/_x/logs/xSYCL18068824734715316925
INFO: [XOCC 60-585] Compiling for hardware emulation target
INFO: [XOCC 60-1316] Initiating connection to rulecheck server, at Wed Nov 20 15:59:01 2019
Running Rule Check Server on port:38567
INFO: [XOCC 60-1315] Creating rulecheck session with output '/home/jan/workspace/sycl-box/_x/reports/xSYCL18068824734715316925/xocc_compile_xSYCL18068824734715316925_guidance.html', at Wed Nov 20 15:59:02 2019
INFO: [XOCC 60-895]   Target platform: /opt/xilinx/platforms/xilinx_u200_xdma_201830_2/xilinx_u200_xdma_201830_2.xpfm
INFO: [XOCC 60-423]   Target device: xilinx_u200_xdma_201830_2
INFO: [XOCC 60-242] Creating kernel: 'xSYCL18068824734715316925'
ERROR: [XOCC 17-1309] Gcc: #24 0x00007faf6be98b6b __libc_start_main /build/glibc-KRRWSm/glibc-2.29/csu/../csu/libc-start.c:308:16
ERROR: [XOCC 60-398] clang failed
ERROR: [XOCC 60-599] Kernel compilation failed to complete
ERROR: [XOCC 60-592] Failed to finish compilation
Option Map File Used: '/opt/xilinx/SDx/2019.1/data/sdx/xocc/optMap.xml'

****** xocc v2019.1 (64-bit)
  **** SW Build 2552052 on Fri May 24 14:47:09 MDT 2019
    ** Copyright 1986-2019 Xilinx, Inc. All Rights Reserved.

ERROR: [XOCC 60-602] Source file does not exist: /tmp/xSYCL18068824734715316925.xo
ERROR: [XOCC 60-623] Unsupported input file type specified.
rm: cannot remove '/tmp/xSYCL18068824734715316925.xo': No such file or directory

Ralender pushed a commit to Ralender/sycl that referenced this issue Jul 1, 2020
Ralender pushed a commit to Ralender/sycl that referenced this issue Jul 1, 2020
  CONFLICT (content): Merge conflict in clang/lib/CodeGen/CGLoopInfo.h
  CONFLICT (content): Merge conflict in clang/lib/CodeGen/CGLoopInfo.cpp
@Ralender
Copy link
Contributor

Ralender commented Dec 2, 2020

Some bugs on this list have been fixed since it was written.

This is a non-exhaustive list of some larger problems relating to XIlinx FPGA compilation and runtime execution (some with more information than others) that need some thought long term:

Problem: If you create buffers and try to use some SYCL functionality that makes use of some underlying OpenCL functionality to modify the buffers as cl_mem objects before a kernel is invoked (single_task/parallel_for etc.) you'll incur XRT runtime errors e.g.:

[XRT] ERROR: Internal error. cl_mem doesn't map to buffer object

You should be able to see this in action if you comment out the "noop" SYCL kernel in the accessor_copy.cpp test.

Reason: XRT will not consider a device as "active" and use-able until you've loaded a binary as it can't query most of the information it needs, one unfortunate side affect of this is that cl_mem buffers are not appropriately assigned to a device and whenever you try to use something like a handler copy with a sycl accessor/buffer the underlying XRT OpenCL call will not be able to find the buffer in relation to a device (it queries devices for buffers, if they're not found, XRT is not pleased).

Possible fix Ideas:

* Eager binary program loading as we know we'll only use pre-compiled binaries with our FPGAs

* Lazy OpenCL buffer creation/operations, only do this after we know XRT will be happy i.e. binary loaded and good to go

Work around: Force start the device by using a noop kernel, not ideal and while it works around the issue on hw/sw emu I'm not too sure how real hardware will appreciate this.

This is fixed is fixed sw_emu and hw_emu still not in hw.

Problem: Only 1 queue to a Xilinx FPGA device can exist at once, if you accidentally generate more than one XRT will not be happy.

Problem: All kernels require at least 1 accessor, 0 accessors will cause a compile error in xocc relating to no argument being bound to AXI_GMEM. Not too sure how many use cases there are for no accessors in a kernel but perhaps it shouldn't emit an error like this.

Both issues are still present issue.

Problem: Cannot compile for Xilinx FPGA with -g, this prevents users using debug mode on the SYCL runtime not just the kernel.

Reason: I believe this is because we're generating a kernel file with a lot of debug code and trying to pipe that through xocc which doesn't really know how to handle all of the debug information.

Possible fix Ideas:

* A long shot, but quick fix if it worked, add -g to the xocc compilation components of the sycl-xocc script. Perhaps it will realize that the kernel may come attached with some debug information in this case and handle it better. I find this unlikely to work, but it's low hanging fruit if it does...

* More surefire fix: Do not compile the device compilation component of SYCL with -g, only compile the host component with -g and remove the -g from being pushed onto the device compilation. This will create a much simpler SPIR-df kernel to pass to xocc. Instead -g should be applied to the xocc compilation and linker commands to get the debug kernel information. This should circumvent any issues with debug information breaking xocc kernel compilation whilst still giving kernel debug information and SYCL runtime debug information. Shouldn't be too hard, just requires some driver tweaks.

It is now possible to use -g when compiling in SYCL mode. but this will only affect host code.
device code is always optimized because optimizations naturally remove many instances of IR constructs that v++ can't deal with like double pointers.

Problem: Related to issue: #32 mixing structures inside of kernels can cause ICE's in one of xocc's passes: aggressive dead code elimination (AGDCE). The relevant minimal triggering SYCL test case for this is issue_related/agdce_ice.cpp.

Reason: Seems to be a problem relating to address space casting from a structure that is implemented outside of a kernel (ergo no address space) and when you try to index an accessor containing several passed in instantiations of the class/struct it will explode the AGDCE pass inside of the compiler as it will try to address space cast.

Status: I am led to believe it's a bug with xocc/Vivado HLS, so it appears to be outside of our jurisdiction, I have forwarded this issue onto someone on their team but it's low priority. May take a while for a fix without some follow up.

This issue doesn't occur anymore

Problem: Boost Hana's times::with_index in conjunction with it's overloaded + operator will kill hw_emu and very likely hw as it will not completely optimize and inline with -03 as you would expect (and as it does in a non-SYCL -O3 pass). This leaves some external declarations and calls to functions but no definitions of the functions, so partial optimization/in-lining. Which is a little odd as the definitions exist prior to the -O3 pass and other boost hana functionality is appropriately inlined.

Reason: Current best guess is that the required index argument passed into the lambda passed to and invoked by with_index is the probable cause. It seems like it could be another address space cast related issue. The minimal test case for this issue is the example: boost_hana_functor_arg.cpp inside of the issue_related directory.

  cgh.single_task<class array_add>([=]() {
     boost::hana::int_<5>::times.with_index([&](const auto i) {
          a_rw[i+1] = 6;
      });
   });

So to highlight the issue, In the example there is an variable passed into the lambda from an externally defined function, this then gets used with the + operator. This + operator is overloaded inside of Boost Hana to support compile time usage. This snippet of code should be unrolled and inlined removing all of the Boost Hana related functionality. This doesn't happen and it seems to be because the argument passed into the function and being incremented and used with the value 1 will trigger an address space cast which I think will prevent the appropriate and required optimizations.

This is an an assumption though, based on the fact that the below code works fine:

  cgh.single_task<class array_add>([=]() {
          int i = 0;
           boost::hana::int_<N>::times([&] {
             a_rw[i+0] = 6;
             ++i;
         });
   });

The index variable and the random constant variable now exist in the same address space and the world all seems to be fine as far as the compiler is concerned.

This issue doesn't happend anymore on the provided code.

Some of these "problems" are peculiarities in our FPGA compilation pipeline and non-standard OpenCL implementation and may not necessarily be "problems".

@agozillon
Copy link
Contributor Author

agozillon commented Dec 2, 2020

I think this is the failing test case your missing: https://github.com/agozillon/sycl/blob/sycl/unified/next-applied-fixes/sycl/test/xocc_tests/issue_related/agdce_ice.cpp hopefully it works now! Great work.

@Ralender
Copy link
Contributor

Ralender commented Dec 3, 2020

I think this is the failing test case your missing: https://github.com/agozillon/sycl/blob/sycl/unified/next-applied-fixes/sycl/test/xocc_tests/issue_related/agdce_ice.cpp hopefully it works now! Great work.

Thanks, I updated the comment.

@j-stephan
Copy link
Member

j-stephan commented Apr 12, 2021

Work around: Force start the device by using a noop kernel, not ideal and while it works around the issue on hw/sw emu I'm not too sure how real hardware will appreciate this.

It doesn't seem to cause any issues on real hardware. My Alveo U200 executes the other kernels in the program just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants