Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ComputeAorta/OCK thread scheduler for the OpenCL backend #532

Merged
merged 3 commits into from
Aug 28, 2024

Conversation

jjfumero
Copy link
Member

Description

Fix the OpenCL CPU thread scheduler when the Compute Aorta from the oneAPI Construction Kit is used. This scheduler gives us access a RISC-V Device executed using the integrated SPIKE simulator.

Problem description

The problem was that, because it runs using a simulator, it was using the wrong number of threads. This PR fixes that by invoking the generated Thread GPU Scheduler.

Backend/s tested

Mark the backends affected by this PR.

  • OpenCL using the OCK as a runtime.
  • PTX
  • SPIRV

OS tested

Mark the OS where this PR is tested.

  • Linux
  • OSx
  • Windows

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

  • Yes
  • No

How to test the new patch?

Using the OCK for RISC-V: https://github.com/codeplaysoftware/oneapi-construction-kit?tab=readme-ov-file#compiling-oneapi-construction-kit-for-risc-v

$ tornado --devices

Number of Tornado drivers: 1
Driver: OpenCL
  Total number of OpenCL devices  : 4
  Tornado device=0:0  (DEFAULT)
	OPENCL --  [NVIDIA CUDA] -- NVIDIA GeForce GTX 1050
		Global Memory Size: 3.9 GB
		Local Memory Size: 48.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [1024]
		Max WorkGroup Configuration: [1024, 1024, 64]
		Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:1
	OPENCL --  [Intel(R) OpenCL Graphics] -- Intel(R) HD Graphics 630
		Global Memory Size: 28.9 GB
		Local Memory Size: 64.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [256]
		Max WorkGroup Configuration: [256, 256, 256]
		Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:2
	OPENCL --  [ComputeAorta] -- ComputeAorta x86_64
		Global Memory Size: 7.8 GB
		Local Memory Size: 32.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [1024]
		Max WorkGroup Configuration: [1024, 1024, 1024]
		Device OpenCL C version: OpenCL C 1.2 Clang 18.1.8

  Tornado device=0:3
	OPENCL --  [ComputeAorta] -- RefSi G1 RV64
		Global Memory Size: 2.0 GB
		Local Memory Size: 256.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [1024]
		Max WorkGroup Configuration: [1024, 1024, 1024]
		Device OpenCL C version: OpenCL C 1.2 Clang 18.1.8


$ tornado --printKernel --jvm="-Ds0.t0.device=0:3" --threadInfo -m tornado.examples/uk.ac.manchester.tornado.examples.arrays.ArrayAddInt 


#pragma OPENCL EXTENSION cl_khr_fp64 : enable  
#pragma OPENCL EXTENSION cl_khr_fp16 : enable  
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable  
__kernel void add(__global long *_kernel_context, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics, __global uchar *a, __global uchar *b, __global uchar *c)
{
  int i_10, i_12, i_14, i_15, i_3, i_4, i_5, i_6; 
  long l_7, l_8; 
  ulong ul_0, ul_1, ul_2, ul_11, ul_13, ul_9; 

  // BLOCK 0
  ul_0  =  (ulong) a;
  ul_1  =  (ulong) b;
  ul_2  =  (ulong) c;
  i_3  =  get_global_size(0);
  i_4  =  get_global_id(0);
  // BLOCK 1 MERGES [0 2 ]
  i_5  =  i_4;
  for(;i_5 < 8;)
  {
    // BLOCK 2
    i_6  =  i_5 + 6;
    l_7  =  (long) i_6;
    l_8  =  l_7 << 2;
    ul_9  =  ul_0 + l_8;
    i_10  =  *((__global int *) ul_9);
    ul_11  =  ul_1 + l_8;
    i_12  =  *((__global int *) ul_11);
    ul_13  =  ul_2 + l_8;
    i_14  =  i_10 + i_12;
    *((__global int *) ul_13)  =  i_14;
    i_15  =  i_3 + i_5;
    i_5  =  i_15;
  }  // B2
  // BLOCK 3
  return;
}  //  kernel

Task info: s0.t0
	Backend           : OPENCL
	Device            : RefSi G1 RV64 CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 1
	Global work offset: [0]
	Global work size  : [8]
	Local  work size  : [8, 1, 1]
	Number of workgroups  : [1]

@stratika
Copy link
Collaborator

how can we test this PR? did you cross-compile the OCK in your test?
After that, we should be able to see the RISC-V device?

@jjfumero
Copy link
Member Author

Hi @stratika , it is automatic. If you have installed OCK, you will see a new device (Compute Aorta Device).

@jjfumero
Copy link
Member Author

jjfumero commented Aug 26, 2024

Here the instructions to get OCK for Intel CPUs:

mkdir ock
cd ock
baseDIR=$PWD 

## Clone LLVM
git clone git@github.com:llvm/llvm-project.git
llvmDIR=$PWD 
cd llvm

## Go to branch 18.X
git checkout release/18.x

LLVMINSTALL=$llvmDIR/build-x86_64/install

## Configure and Install LLVM
cmake llvm -GNinja   \
    -Bbuild-x86_64   \
    -DCMAKE_BUILD_TYPE=Release   \
    -DCMAKE_INSTALL_PREFIX=$LLVMINSTALL \
    -DLLVM_ENABLE_PROJECTS=clang;lld  \
    -DLLVM_TARGETS_TO_BUILD=X86 

ninja -C build-x86_64 install 

cd $baseDIR

## Configure and install oneAPI Construction Kit
git clone https://github.com/codeplaysoftware/oneapi-construction-kit
cd $baseDIR/oneapi-construction-kit 

cmake . -GNinja  \
    -Bbuild-x86_64 \
   -DCMAKE_BUILD_TYPE=Release \
   -DCMAKE_INSTALL_PREFIX=$PWD/build-x86_64/install \
   -DCA_ENABLE_API=cl \
   -DCA_ENABLE_DOCUMENTATION=OFF \
   -DCA_LLVM_INSTALL_DIR=$LLVMINSTALL

ninja -C build-x86_64 install

@jjfumero
Copy link
Member Author

Ready for review - the CPU scheduler is correct, even for RISC-V cores

Copy link
Collaborator

@stratika stratika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, LGTM! I tried it also with the RISC-V. Just a note for the test example, the printing does not print the values so that to ensure that the result of the addition is correct. I tested it and they are correct.

There is an issue with the SPIRV OCL for the RISC-V device. But I think @jjfumero is already looking at it for a future PR.

@jjfumero
Copy link
Member Author

thanks, LGTM! I tried it also with the RISC-V. Just a note for the test example, the printing does not print the values so that to ensure that the result of the addition is correct. I tested it and they are correct.

There is an issue with the SPIRV OCL for the RISC-V device. But I think @jjfumero is already looking at it for a future PR.

Yes, I am still working on this

Copy link
Member

@mikepapadim mikepapadim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jjfumero jjfumero merged commit 32975df into beehive-lab:develop Aug 28, 2024
2 checks passed
@jjfumero jjfumero deleted the fix/ock/scheduler branch August 28, 2024 10:20
jjfumero added a commit to jjfumero/TornadoVM that referenced this pull request Aug 30, 2024
Improvements
~~~~~~~~~~~~~~~~~~

- beehive-lab#468: Cleanup Abstract Metadata Class.
- beehive-lab#473: Add maven plugin to build TornadoVM source for the releases.
- beehive-lab#474: Refactor `<X>TornadoDevice` to place common methods in the `TornadoXPUInterface`.
- beehive-lab#482: Help messages improve when an out-of-memory exception is raised.
- beehive-lab#484: Double-type for the trigonometric functions added in the `TornadoMath` class.
- beehive-lab#487: Prebuilt API simplified.
- beehive-lab#494: Add test to trigger unsupported features related to direct use of Memory Segments.
- beehive-lab#509: Add a quick pass configuration to skip the heavy tests during active development.
- beehive-lab#532: Improve thread scheduler to support RISC-V Accelerators from Codeplay.
- beehive-lab#533: Support for scalar values to be passed via lambda expressions as tasks.
- beehive-lab#538: `README` file updated.
- beehive-lab#539: Refactor core classes and add new API methods to pass compilation flags to the low-level driver compilers (OpenCL, PTX and Level Zero).
- beehive-lab#542: Tagged LevelZero JNI and Beehive Toolkit dependencies added in the build and installer.

Compatibility
~~~~~~~~~~~~~~~~~~

- beehive-lab#465: Support for JDK 22 and GraalVM 24.0.2.
- beehive-lab#486: Temurin for Windows added in the list of supported JDKs.
- beehive-lab#525: Revert usage of String Templates in preparation for JDK 23.
- beehive-lab#527: SPIR-V version parameter added. TornadoVM may run previous SPIR-V versions (e.g., ComputeAorta from Codeplay).
- beehive-lab#513: LevelZero JNI Library updated to v0.1.4.

Bug Fixes
~~~~~~~~~~~~~~~~~~

- beehive-lab#470: README documentation fixed.
- beehive-lab#478: Fix the test names that are present in the white list.
- beehive-lab#488: FP64 Kind for radian operations and the PTX backend fixed.
- beehive-lab#493: Tests Whitelist for PTX backend fixed.
- beehive-lab#502: Fix barrier type in the documentation regarding programmability of reductions.
- beehive-lab#514: Installer script fixed.
- beehive-lab#540: Fix  issue with clean-up execution IDs function.
- beehive-lab#541: Fix Data Accessors for the prebuilt API.
- beehive-lab#543: Fix checkstyle condition and FP16 error message improved.
jjfumero added a commit to jjfumero/TornadoVM that referenced this pull request Aug 30, 2024
Improvements
~~~~~~~~~~~~~~~~~~

- beehive-lab#468: Cleanup Abstract Metadata Class.
- beehive-lab#473: Add maven plugin to build TornadoVM source for the releases.
- beehive-lab#474: Refactor `<X>TornadoDevice` to place common methods in the `TornadoXPUInterface`.
- beehive-lab#482: Help messages improved when an out-of-memory exception is raised.
- beehive-lab#484: Double-type for the trigonometric functions added in the `TornadoMath` class.
- beehive-lab#487: Prebuilt API simplified.
- beehive-lab#494: Add test to trigger unsupported features related to direct use of Memory Segments.
- beehive-lab#509: Add a quick pass configuration to skip the heavy tests during active development.
- beehive-lab#532: Improve thread scheduler to support RISC-V Accelerators from Codeplay.
- beehive-lab#533: Support for scalar values to be passed via lambda expressions as tasks.
- beehive-lab#538: `README` file updated.
- beehive-lab#539: Refactor core classes and add new API methods to pass compilation flags to the low-level driver compilers (OpenCL, PTX and Level Zero).
- beehive-lab#542: Tagged LevelZero JNI and Beehive Toolkit dependencies added in the build and installer.

Compatibility
~~~~~~~~~~~~~~~~~~

- beehive-lab#465: Support for JDK 22 and GraalVM 24.0.2.
- beehive-lab#486: Temurin for Windows added in the list of supported JDKs.
- beehive-lab#525: Revert usage of String Templates in preparation for JDK 23.
- beehive-lab#527: SPIR-V version parameter added. TornadoVM may run previous SPIR-V versions (e.g., ComputeAorta from Codeplay).
- beehive-lab#513: LevelZero JNI Library updated to v0.1.4.

Bug Fixes
~~~~~~~~~~~~~~~~~~

- beehive-lab#470: README documentation fixed.
- beehive-lab#478: Fix the test names that are present in the white list.
- beehive-lab#488: FP64 Kind for radian operations and the PTX backend fixed.
- beehive-lab#493: Tests Whitelist for PTX backend fixed.
- beehive-lab#502: Fix barrier type in the documentation regarding programmability of reductions.
- beehive-lab#514: Installer script fixed.
- beehive-lab#540: Fix  issue with clean-up execution IDs function.
- beehive-lab#541: Fix Data Accessors for the prebuilt API.
- beehive-lab#543: Fix checkstyle condition and FP16 error message improved.

minor change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

3 participants