[SYCL] [MATRIX] Enable joint_matrix_load, joint_matrix_store, and joint_matrix_mad for AMX #3503

yubingex007-a11y · 2021-04-07T13:29:45Z

We provide new interfaces for matrix muliply in this patch:

A new class called joint_matrix is introduced, and The user needs to
specify the type of the elements, sizes, and the memory layout.
joint_matrix_load is used for loading data from main momory to tiles of
AMX or kernel's local memory.
joint_matrix_store is used for storing data tiles of AMX or kernel's
local memory to main memory.
joint_matrix_mad is used for the matrix multiply and add function.
It performs the multiply operation on the matrices A and B, accumulates the
result with C and return the result.

With this patch, the following operation can be realized:
C = A*B+C

All cases where A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major)
All cases where A(bf16, any-size, row_major), B(bf16, any-size, packed_b), C(float, any-size, row_major)

bader

LGTM in general, but there are bunch of comments regrading code style and organization we apply to the project.

sycl/include/CL/sycl/ONEAPI/intel_matrix/matrix-amx.hpp

bader · 2021-04-07T14:51:34Z

sycl/test/intel_matrix/matrix-amx-bf16-test.cpp

@@ -0,0 +1,184 @@
+// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -DAMX -O2 %s


Suggested change

// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -DAMX -O2 %s

// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -DAMX -O2 %s -o %t

I think we should execute this test, but I assume it requires HW support for AMX ISA. Right?
This might require additional LIT configuration:

Add new feature to the config, detect if HW supports it and enable it for LIT tests.

Test must require AMX feature to run.

According to https://github.com/intel/llvm/blob/sycl/CONTRIBUTING.md#tests-development, we should also move this test to https://github.com/intel/llvm/tree/sycl/sycl/test/on-device/ (https://github.com/intel/llvm/tree/sycl/sycl/test/on-device/extensions in particular) directory.

The same is applicable for the second test as well.

Thanks for comments, Alexey. Eh, BTW, If the testcase can't run for now, could we still move it to sycl/test/on-device/extensions for now?

If you are going to use it for testing "does it compile?" mode, than there is no need to move.

The on-device directory is intended for the tests requiring special HW for execution and testing features under active development. When feature API is finalized and it's ready for end users, we move such tests to llvm-test-suite repository.

yubingex007-a11y · 2021-04-08T05:40:03Z

BTW, after I address the comments, should I create a another commit in PR or "git commit --amend"?

bader · 2021-04-08T11:05:26Z

BTW, after I address the comments, should I create a another commit in PR or "git commit --amend"?

We recommend address comments in a separate commits. It allows to track if/how comments are addressed.
Force-push removes old version of the patch from the pull request and GitHub can't correlate comments with the new version of the patch.

bader · 2021-04-08T13:04:39Z

sycl/include/CL/sycl/INTEL/intel_matrix/matrix-amx.hpp

+// ===--------------------------------------------------------------------=== //
+
+#pragma once
+


Please, add following include defining __SYCL_INLINE_NAMESPACE and __SYCL_ALWAYS_INLINE.

Suggested change

#include <CL/sycl/detail/defines_elementary.hpp>

yubingex007-a11y · 2021-04-08T16:12:33Z

BTW, after I address the comments, should I create a another commit in PR or "git commit --amend"?

We recommend address comments in a separate commits. It allows to track if/how comments are addressed.
Force-push removes old version of the patch from the pull request and GitHub can't correlate comments with the new version of the patch.

@bader If we have multiple commits in PR, could we fuse them into one commit?
@dkhaldi I guess we prefer one commit for this feature?

joint_matrix_mad for AMX We provide new interfaces for matrix muliply in this patch: 1. A new class called joint_matrix is introduced, and the user needs to specify the type of the elements, sizes, and the memory layout. 2. joint_matrix_load is used for loading data from main memory to tiles of AMX or kernel's local memory. 3. joint_matrix_store is used for storing data tiles of AMX or kernel's local memory to main memory. 4. joint_matrix_mad is used for the matrix multiply and add function. It performs the multiply operation on the matrices A and B, accumulates the result with C and returns the result. With this patch, the following operation can be realized: C = A*B+C 1. All cases where A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major) 2. All cases where A(bf16, any-size, row_major), B(bf16, any-size, packed_b), C(float, any-size, row_major)

bader · 2021-04-08T16:35:01Z

BTW, after I address the comments, should I create a another commit in PR or "git commit --amend"?

We recommend address comments in a separate commits. It allows to track if/how comments are addressed.
Force-push removes old version of the patch from the pull request and GitHub can't correlate comments with the new version of the patch.

@bader If we have multiple commits in PR, could we fuse them into one commit?

Please, do not fuse the commits within one pull request. All commits will be squashed when PR is merged.

yubingex007-a11y · 2021-04-09T03:34:01Z

Hi, @bader. It seems the testcase fail SYCL :: Reduction/reduction_nd_N_vars.cpp(http://icl-jenkins.sc.intel.com:8080/blue/organizations/jenkins/SYCL_CI%2Fintel%2FLin%2FLLVM_Test_Suite/detail/LLVM_Test_Suite/3350/pipeline/) is because the jenkin is using the old testcase. it fails on CPU.
the latest code(https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Reduction/reduction_nd_N_vars.cpp) shows CPU is unsupported:
// TODO: The test irregularly reports incorrect results on CPU. // UNSUPPORTED: cpu

So can we ignore the case fail?

bader · 2021-04-09T06:46:13Z

Hi, @bader. It seems the testcase fail SYCL :: Reduction/reduction_nd_N_vars.cpp(http://icl-jenkins.sc.intel.com:8080/blue/organizations/jenkins/SYCL_CI%2Fintel%2FLin%2FLLVM_Test_Suite/detail/LLVM_Test_Suite/3350/pipeline/) is because the jenkin is using the old testcase. it fails on CPU.
the latest code(https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Reduction/reduction_nd_N_vars.cpp) shows CPU is unsupported:
// TODO: The test irregularly reports incorrect results on CPU. // UNSUPPORTED: cpu

So can we ignore the case fail?

@tfzhu, how can we retest this PR with the latest sources?

yubingex007-a11y · 2021-04-09T08:34:35Z

Hi, @bader. It seems the testcase fail SYCL :: Reduction/reduction_nd_N_vars.cpp(http://icl-jenkins.sc.intel.com:8080/blue/organizations/jenkins/SYCL_CI%2Fintel%2FLin%2FLLVM_Test_Suite/detail/LLVM_Test_Suite/3350/pipeline/) is because the jenkin is using the old testcase. it fails on CPU.
the latest code(https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Reduction/reduction_nd_N_vars.cpp) shows CPU is unsupported:
// TODO: The test irregularly reports incorrect results on CPU. // UNSUPPORTED: cpu
So can we ignore the case fail?

@tfzhu, how can we retest this PR with the latest sources?

I've got support from @DoyleLi and I've just rerun the Jenkins/Precommit . jenkin will fetch the latest source. Besides, We can observe it in the label "Check out from version control" in http://icl-jenkins.sc.intel.com:8080/blue/organizations/jenkins/SYCL_CI%2Fintel%2FLin%2FLLVM_Test_Suite/detail/LLVM_Test_Suite/3381/pipeline

bader · 2021-04-09T09:49:10Z

@againull, @intel/llvm-reviewers-runtime, ping.

steffenlarsen

Good stuff! Is there an extension document for this available somewhere?

steffenlarsen · 2021-04-09T14:06:58Z

sycl/include/CL/sycl/ONEAPI/intel_matrix/matrix-amx.hpp

+          matrix_layout Layout>
+__SYCL_ALWAYS_INLINE static typename std::enable_if<
+    (NumRows > tile_size) || (NumCols * sizeof(T) / 4 > tile_size), void>::type
+submatrix_load(detail::submatrix<T> &sub_m,


Should this be using the submatrix class defined above rather than the one in detail? Same question for functions like submatrix_mad and submatrix_store.

steffenlarsen · 2021-04-09T14:10:50Z

sycl/include/CL/sycl/ONEAPI/intel_matrix/matrix.hpp

@@ -0,0 +1,16 @@
+//==---------------- submatrix.hpp - SYCL matrix ---------------*- C++ -*---==//


Suggested change

//==---------------- submatrix.hpp - SYCL matrix ---------------*- C++ -*---==//

//==------------------ matrix.hpp - SYCL matrix ----------------*- C++ -*---==//

LuoYuanke · 2021-04-10T01:12:27Z

sycl/test/on-device/extensions/matrix-amx-bf16-test.cpp

@@ -0,0 +1,186 @@
+// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -O2 %s -o %t.out


Using -march=sapphirerapids instead of "-mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi"?

LuoYuanke · 2021-04-10T01:12:54Z

sycl/test/on-device/extensions/matrix-amx-int8-test.cpp

@@ -0,0 +1,171 @@
+// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -O2 %s -o %t.out


Using -march=sapphirerapids instead of "-mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi"?

1. Remove useless "class submatrix" defined in matrix namespace; 2. Move submatrix_load, submatrix_store and submatrix_mad into detail namespace; 3. Use -march=sapphirerapids instead of "-mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi";

yubingex007-a11y · 2021-04-12T05:30:48Z

Good stuff! Is there an extension document for this available somewhere?

I think @dkhaldi will provide it this week. Could we merge this patch first?

dkhaldi · 2021-04-12T15:08:07Z

sycl/include/CL/sycl/ONEAPI/intel_matrix/matrix.hpp

+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+// ===--------------------------------------------------------------------=== //
+/// -DAMX will enable joint_matrix feature for AMX


We don't need -DAMX anymore. Please update.
Also, what should be the compilation command line the user has to use to trigger this extension:
clang++ -fsycl -march=sapphirerapids
Is it this one? shouldn't we add the AOT line too here
-fsycl-targets="spir64_x86_64-uknown-linux-sycldevice"
to avoid the user generating JIT code that does not work on the GPU or other devices?

dkhaldi · 2021-04-12T15:10:01Z

sycl/test/on-device/extensions/matrix-amx-bf16-test.cpp

+#include <assert.h>
+#include <cstdint>
+#include <cstdio>
+#include <immintrin.h>


Does the user need to add all these include for such simple code, please double check?
can we move #include <immintrin.h> to matrix-amx.hpp to avoid the user adding it here?

dkhaldi · 2021-04-12T15:19:13Z

Good stuff! Is there an extension document for this available somewhere?

I think @dkhaldi will provide it this week. Could we merge this patch first?

The spec doc should make it to intel/llvm this week. It is currently under internal review

dkhaldi · 2021-04-12T16:04:37Z

sycl/test/on-device/extensions/matrix-amx-bf16-test.cpp

+
+           ONEAPI::sub_group sg = spmd_item.get_sub_group();
+           joint_matrix<ONEAPI::sub_group, unsigned short, TM, TK> sub_a(sg);
+           joint_matrix<ONEAPI::sub_group, unsigned short, TK / 2, TN * 2, matrix_layout::packed_b> sub_b(sg); // ???? hide in new interface


Remove the comment.
Instead, add a detailed comment that for AMX, users need to explicitly use this packed_b layout along with the VNNI sizes for B matrix.
By default, the layout is row_major and size is (TK, TN).
I am adding this comment to the interface document as well.

keryell

I have the feeling it is possible to adopt a terser API using a C++ coding style instead of a C API.

keryell · 2021-04-12T17:19:43Z

sycl/include/CL/sycl/ONEAPI/intel_matrix/matrix-amx.hpp

+    typename std::enable_if<(NumRows > matrix::tile_size) ||
+                                (NumCols * sizeof(T) / 4 > matrix::tile_size),
+                            void>::type
+    submatrix_load(detail::submatrix<T> &sub_m,


Just curious: why not submatrix::load instead?
And while it can be a static member, could it be a member function?

Hi, @keryell. Thanks for your comments. I really agree we should change into member function but we are reaching a deadline of release. So, could we merge this patch and Address your comments in our next PR?

Whatever is good. Just need the tersest SYCL syntax at the end. :-)

keryell · 2021-04-12T17:20:45Z

sycl/include/CL/sycl/ONEAPI/intel_matrix/matrix-amx.hpp

+
+// This handles cases where T1 is int8, T2 is int32.
+inline __SYCL_ALWAYS_INLINE static void
+submatrix_mad(detail::submatrix<int8_t> &sub_ma,


submatrix::mad?

keryell · 2021-04-12T17:21:23Z

sycl/include/CL/sycl/ONEAPI/intel_matrix/matrix-amx.hpp

+          matrix_layout Layout, access::address_space Space>
+inline __SYCL_ALWAYS_INLINE typename std::enable_if<
+    (NumRows > tile_size) || (NumCols * sizeof(T) / 4 > tile_size), void>::type
+joint_matrix_load(Group sg,


joint_matrix::load?

@dkhaldi Do you agree to change it to member function, too?

@keryell, For joint_matrix_load/store/mad, we are following the current existing group algorithms of SYCL 2020 like joint_reduce. So these should be free functions. Having said that, this is being approved as an experimental interface. We can revise these details once we take the extension to the SYCL group.

Yes it can be free functions too but it looks like the syntax is terser in that case with classes and members. I do not know how generic your joint_matrix is, but it looks really like a coherent set of operations on some operands.

keryell · 2021-04-12T17:22:24Z

sycl/include/CL/sycl/ONEAPI/intel_matrix/matrix-amx.hpp

+  T *mem = src.get();
+  // memcpy from mem to jm.raw_storage
+  for (int i = 0; i < NumRows; ++i) {
+    char *srcptr = (char *)mem + i * stride * sizeof(T);


Use C++ casts instead?

yeah, I will change it. BTW, is there any advantage of reinterpret_cast, compared with c-style cast? Is that because it is more eye-catching?

keryell · 2021-04-12T17:25:16Z

sycl/test/on-device/extensions/matrix-amx-bf16-test.cpp

+           joint_matrix_load(sg, sub_c,
+                             accC.get_pointer() + (sg_startx * TM) * N +
+                                 sg_starty * TN,
+                             N, matrix_layout::row_major);


With a member function you would have

Suggested change

joint_matrix_load(sg, sub_c,

accC.get_pointer() + (sg_startx * TM) * N +

sg_starty * TN,

N, matrix_layout::row_major);

sub_c.jointload(sg, accC.get_pointer() + (sg_startx * TM) * N +

sg_starty * TN,

N, matrix_layout::row_major);

2. Modify some comments

bader · 2021-04-13T13:53:25Z

@intel/llvm-reviewers-runtime, ping.

againull

LGTM in general considering that some of the comments going to be addressed with follow up PR.

* upstream/sycl: (39 commits) [CI] Switch to default clang-format version. (intel#3540) [Driver][NFC] Cleanup some option setting for SYCL offload (intel#3542) [GitHub Actions] Update main branch sync schedule [SYCL][NFC] Fix potential namespace conflicts with PSTL in tuple.hpp (intel#3541) [SYCL] Bump sycl library minor version (intel#3538) [SYCL][CUDA] Implemented cuda_piextUSMEnqueueMemAdvise (intel#3365) [SYCL][FPGA] Add mutual diagnostic of max_concurrency attribute in conjunction of disable_loop_pipelining attribute (intel#3512) [SYCL] [MATRIX] Enable joint_matrix_load, joint_matrix_store, and joint_matrix_mad for AMX (intel#3503) [ESIMD] Skip rewriting functions used through function pointers (intel#3527) [SYCL] Fix address space for spec constants buffer (intel#3521) [SYCL] Correct the tablegen for checking mutually exclusive stmt attrs (intel#3519) [SYCL][PI][L0][NFC] Refactor setting of LastCommandEvent (intel#3528) [SYCL] Fix group local memory sharing issue (intel#3489) [SYCL][NFC] Fix post-commit failure (intel#3532) [SYCL][Doc] Remove extension mechanism (intel#3526) [SYCL] Move sycl.hpp in install directory and adjust driver to match (intel#3523) [SYCL][ESIMD] Update ESIMD docs to address recent user comments: (intel#3516) [NFCI][SYCL] Correct -fdeclare-spirv-builtins to use marshalling (intel#3515) [SYCL] Rework MarkDevice and children (intel#3475) [SYCL] Fix StringLiteral Ctor issue from intel#3504. (intel#3520) ...

yubingex007-a11y requested review from bader, dkhaldi, LuoYuanke and vzakhari April 7, 2021 13:29

yubingex007-a11y requested a review from a team as a code owner April 7, 2021 13:29

yubingex007-a11y requested a review from againull April 7, 2021 13:29

bader reviewed Apr 7, 2021

View reviewed changes

yubingex007-a11y force-pushed the jm branch from 0c8cbf3 to 2c7ffd2 Compare April 8, 2021 05:38

yubingex007-a11y force-pushed the jm branch 2 times, most recently from df3c9d3 to 5b351f5 Compare April 8, 2021 08:52

bader reviewed Apr 8, 2021

View reviewed changes

yubingex007-a11y force-pushed the jm branch from 5b351f5 to 155992f Compare April 8, 2021 16:28

steffenlarsen reviewed Apr 9, 2021

View reviewed changes

LuoYuanke reviewed Apr 10, 2021

View reviewed changes

yubingex007-a11y added 2 commits April 12, 2021 10:07

Address comments shown above:

0881ad5

1. Remove useless "class submatrix" defined in matrix namespace; 2. Move submatrix_load, submatrix_store and submatrix_mad into detail namespace; 3. Use -march=sapphirerapids instead of "-mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi";

Just fix some comments

b3ffc7b

yubingex007-a11y requested a review from bader April 12, 2021 05:20

dkhaldi reviewed Apr 12, 2021

View reviewed changes

keryell reviewed Apr 12, 2021

View reviewed changes

1. Use c++ casts instead of c-style casts

c5a81b9

2. Modify some comments

dkhaldi approved these changes Apr 13, 2021

View reviewed changes

againull approved these changes Apr 13, 2021

View reviewed changes

againull merged commit 35db973 into intel:sycl Apr 13, 2021

		@@ -0,0 +1,184 @@
		// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -DAMX -O2 %s

		// ===--------------------------------------------------------------------=== //

		#pragma once

		@@ -0,0 +1,16 @@
		//==---------------- submatrix.hpp - SYCL matrix ---------------- C++ ----==//

	//==---------------- submatrix.hpp - SYCL matrix ---------------- C++ ----==//
	//==------------------ matrix.hpp - SYCL matrix ----------------- C++ ----==//

		@@ -0,0 +1,186 @@
		// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -O2 %s -o %t.out

		@@ -0,0 +1,171 @@
		// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -O2 %s -o %t.out

[SYCL] [MATRIX] Enable joint_matrix_load, joint_matrix_store, and joint_matrix_mad for AMX #3503

[SYCL] [MATRIX] Enable joint_matrix_load, joint_matrix_store, and joint_matrix_mad for AMX #3503

Conversation

yubingex007-a11y commented Apr 7, 2021

bader left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yubingex007-a11y commented Apr 8, 2021

bader commented Apr 8, 2021

Choose a reason for hiding this comment

yubingex007-a11y commented Apr 8, 2021

bader commented Apr 8, 2021

yubingex007-a11y commented Apr 9, 2021

bader commented Apr 9, 2021

yubingex007-a11y commented Apr 9, 2021

bader commented Apr 9, 2021

steffenlarsen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yubingex007-a11y commented Apr 12, 2021

dkhaldi Apr 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkhaldi commented Apr 12, 2021

Choose a reason for hiding this comment

keryell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bader commented Apr 13, 2021

againull left a comment

Choose a reason for hiding this comment

dkhaldi Apr 12, 2021 •

edited

Loading