[fix] Problem with FPGA execution for multiple tasks and the default scheduler #416

stratika · 2024-05-10T16:34:25Z

Description

This PR provides a fix for the issue described in #401.

Note: This PR is tested on Intel Emulation mode. I do not have access to Xilinx FPGA to test it.

Problem description

There are two identified problems:

In the OCLCodeCache class we have a method that checks if force compilation has been triggered, and the FPGA compilers for Intel are triggered to compile only if the check is true. This seems to have been an old check that we had from the time we had the lookupbuffer kernel, and we were waiting till all LAUNCH bytecodes that corresponds to all task indices (all tasks within a TaskGraph) are issued, in order to trigger the forceCompilation() method from the TornadoVM class. See here.
The executor.withDefaultScheduler() configuration in the ExecutionPlan seems to break the execution and results in OpenCL error (CL_INVALID_WORK_GROUP_SIZE) when the clEnqueueNDRangeKernel function is invoked.

To fix the first problem, I removed the shouldCompile check that existed in OCLCodeCache. To my understanding this is an old check, and it is not required since we deprecated the lookupbuffer kernel.

To fix the second problem, I performed a short refactoring in the OCLKernelScheduler (i.e., an abstract class) and the OCLFPGAScheduler which extends the abstract class, to assess the default scheduling local work group for FPGAs when the executor.withDefaultScheduler() is enabled in a TornadoExecutionPlan.

This change made me think of testing also to run the BlurFilter example with a WorkerGrid, and applied a small update in the OCLGridInfo to check the default FPGA local work group.

Backend/s tested

Mark the backends affected by this PR.

OpenCL
PTX
SPIRV

OS tested

Mark the OS where this PR is tested.

Linux
OSx
Windows

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

Yes
No

How to test the new patch?

First you need to have an image in the /tmp directory named (image.jpg). You can use this one:

Then, you can run, as described also in the issue #401:

rm -rf fpga-source-comp
tornado --debug --threadInfo --jvm="-Dblur.red.device=0:3 -Dblur.green.device=0:3 -Dblur.blue.device=0:3 -Dtornado.recover.bailout=False" -m tornado.examples/uk.ac.manchester.tornado.examples.compute.BlurFilter

Output:

WARNING: Using incubator modules: jdk.incubator.vector
[DEBUG] JIT compilation for the FPGA
Task info: blur.red
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [448, 640]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : [7, 640]

[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Task info: blur.green
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [448, 640]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : [7, 640]

[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Task info: blur.blue
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [448, 640]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : [7, 640]

[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Parallel Total time: 
	ns = 2340610196
	seconds = 2.340610196
[DEBUG] JIT compilation for the FPGA
Task info: blur.red
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [448, 640]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : [7, 640]

[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Task info: blur.green
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [448, 640]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : [7, 640]

[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Task info: blur.blue
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [448, 640]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : [7, 640]

[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Parallel Total time: 
	ns = 156405899
	seconds = 0.15640589900000001

To test with GridScheduler and FPGAs, I have this patch that uses a grid with a local work group (128, 1, 1) that is different from the default one.

You can download, apply the patch and build TornadoVM:

git apply fpga_gridscheduler.patch
make

and then run the same example:

rm -rf fpga-source-comp
tornado --debug --threadInfo --jvm="-Dblur.red.device=0:3 -Dblur.green.device=0:3 -Dblur.blue.device=0:3 -Dtornado.recover.bailout=False" -m tornado.examples/uk.ac.manchester.tornado.examples.compute.BlurFilter

Output:

WARNING: Using incubator modules: jdk.incubator.vector
[DEBUG] JIT compilation for the FPGA
[TornadoVM] Warning: The loop bounds will be configured by the GridScheduler. Check the grid by using the flag --threadInfo.
[TornadoVM OCL] Warning: TornadoVM changed the user-defined local size to: [64, 1, 1].
Task info: blur.red
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 2
	Global work offset: [0, 0, 0]
	Global work size  : [448, 640, 1]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : null

Task info: blur.green
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 2
	Global work offset: [0, 0, 0]
	Global work size  : [448, 640, 1]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : null

Task info: blur.blue
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 2
	Global work offset: [0, 0, 0]
	Global work size  : [448, 640, 1]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : null

Parallel Total time: 
	ns = 2347304227
	seconds = 2.347304227
[DEBUG] JIT compilation for the FPGA
[TornadoVM OCL] Warning: TornadoVM changed the user-defined local size to: [64, 1, 1].
Task info: blur.red
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 2
	Global work offset: [0, 0, 0]
	Global work size  : [448, 640, 1]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : null

Task info: blur.green
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 2
	Global work offset: [0, 0, 0]
	Global work size  : [448, 640, 1]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : null

Task info: blur.blue
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 2
	Global work offset: [0, 0, 0]
	Global work size  : [448, 640, 1]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : null

Parallel Total time: 
	ns = 161529220
	seconds = 0.16152922

To test with multiple tasks that will run on the FPGA, I have this patch that updates the MultipleTasks example that runs two kernels on the FPGA.

You can download, apply the patch and build TornadoVM:

git apply fpga_multiple_tasks.patch
make

and then run the same example:

rm -rf fpga-source-comp
tornado --threadInfo --jvm="-Dexample.foo.device=0:3 -Dexample.bar.device=0:3" -m tornado.examples/uk.ac.manchester.tornado.examples.MultipleTasks

… to obtain the default local work group from the abstract class

jjfumero · 2024-05-13T08:27:54Z

I could reproduce the fix with my configuration. Thanks @stratika.

mairooni

LGTM

Improvements ~~~~~~~~~~~~~~~~~~ - beehive-lab#402 <beehive-lab#402>: Support for TornadoNativeArrays from FFI buffers. - beehive-lab#403 <beehive-lab#403>: Clean-up and refactoring for the code analysis of the loop-interchange. - beehive-lab#405 <beehive-lab#405>: Disable Loop-Interchange for CPU offloading.. - beehive-lab#407 <beehive-lab#407>: Debugging OpenCL Kernels builds improved. - beehive-lab#410 <beehive-lab#410>: CPU block scheduler disabled by default and option to switch between different thread-schedulers added. - beehive-lab#418 <beehive-lab#418>: TornadoOptions and TornadoLogger improved. - beehive-lab#423 <beehive-lab#423>: MxM using ns instead of ms to report performance. - beehive-lab#425 <beehive-lab#425>: Vector types for ``Float<Width>`` and ``Int<Width>`` supported. - beehive-lab#429 <beehive-lab#429>: Documentation of the installation process updated and improved. - beehive-lab#432 <beehive-lab#432>: Support for SPIR-V code generation and dispatcher using the TornadoVM OpenCL runtime. Compatibility ~~~~~~~~~~~~~~~~~~ - beehive-lab#409 <beehive-lab#409>: Guidelines to build the documentation. - beehive-lab#411 <beehive-lab#411>: Windows installer improved. - beehive-lab#412 <beehive-lab#412>: Python installer improved to check download all Python dependencies before the main installer. - beehive-lab#413 <beehive-lab#413>: Improved documentation for installing all configurations of backends and OS. - beehive-lab#424 <beehive-lab#424>: Use Generic GPU Scheduler for some older NVIDIA Drivers for the OpenCL runtime. - beehive-lab#430 <beehive-lab#430>: Improved the installer by checking that the TornadoVM environment is loaded upfront. Bug Fixes ~~~~~~~~~~~~~~~~~~ - beehive-lab#400 <beehive-lab#400>: Fix batch computation when the global thread indexes are used to compute the outputs. - beehive-lab#414 <beehive-lab#414>: Recover Test-Field unit-tests using Panama types. - beehive-lab#415 <beehive-lab#415>: Check style errors fixed. - beehive-lab#416 <beehive-lab#416>: FPGA execution with multiple tasks in a task-graph fixed. - beehive-lab#417 <beehive-lab#417>: Lazy-copy out fixed for Java fields. - beehive-lab#420 <beehive-lab#420>: Fix Mandelbrot example. - beehive-lab#421 <beehive-lab#421>: OpenCL 2D thread-scheduler fixed for NVIDIA GPUs. - beehive-lab#422 <beehive-lab#422>: Compilation for NVIDIA Jetson Nano fixed. - beehive-lab#426 <beehive-lab#426>: Fix Logger for all backends. - beehive-lab#428 <beehive-lab#428>: Math cos/sin operations supported for vector types. - beehive-lab#431 <beehive-lab#431>: Jenkins files fixed.

stratika added 5 commits May 1, 2024 14:15

[wip] Resolve issue with multiple tasks for FPGA path in OpenCL

e254c18

[feat] Add FPGA configuration file for Intel oneAPI HLS compiler

c5e3341

[fix] Set local work group for OCLFPGAScheduler and minor refactoring…

adf4402

… to obtain the default local work group from the abstract class

[feat] Update checkGripDimensions from GridScheduler for FPGAs

ba0a280

Merge branch 'develop' into fix/401/fpga-multiple-tasks

88bcbda

stratika requested review from jjfumero and mairooni May 10, 2024 16:34

stratika self-assigned this May 10, 2024

stratika added bug Something isn't working fpga FPGA labels May 10, 2024

jjfumero approved these changes May 13, 2024

View reviewed changes

mairooni approved these changes May 13, 2024

View reviewed changes

jjfumero merged commit 61a126a into beehive-lab:develop May 13, 2024
2 checks passed

stratika deleted the fix/401/fpga-multiple-tasks branch May 13, 2024 14:36

jjfumero mentioned this pull request May 28, 2024

[release] TornadoVM 1.0.5 #433

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Problem with FPGA execution for multiple tasks and the default scheduler #416

[fix] Problem with FPGA execution for multiple tasks and the default scheduler #416

stratika commented May 10, 2024 •

edited

Loading

jjfumero commented May 13, 2024

mairooni left a comment

[fix] Problem with FPGA execution for multiple tasks and the default scheduler #416

[fix] Problem with FPGA execution for multiple tasks and the default scheduler #416

Conversation

stratika commented May 10, 2024 • edited Loading

Description

Problem description

Backend/s tested

OS tested

Did you check on FPGAs?

How to test the new patch?

jjfumero commented May 13, 2024

mairooni left a comment

Choose a reason for hiding this comment

stratika commented May 10, 2024 •

edited

Loading