-
Notifications
You must be signed in to change notification settings - Fork 768
[SYCL] [MATRIX] Enable joint_matrix_load, joint_matrix_store, and joint_matrix_mad for AMX #3503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general, but there are bunch of comments regrading code style and organization we apply to the project.
@@ -0,0 +1,184 @@ | |||
// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -DAMX -O2 %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -DAMX -O2 %s | |
// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -DAMX -O2 %s -o %t |
I think we should execute this test, but I assume it requires HW support for AMX ISA. Right?
This might require additional LIT configuration:
- Add new feature to the config, detect if HW supports it and enable it for LIT tests.
- Test must require AMX feature to run.
According to https://github.com/intel/llvm/blob/sycl/CONTRIBUTING.md#tests-development, we should also move this test to https://github.com/intel/llvm/tree/sycl/sycl/test/on-device/ (https://github.com/intel/llvm/tree/sycl/sycl/test/on-device/extensions in particular) directory.
The same is applicable for the second test as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comments, Alexey. Eh, BTW, If the testcase can't run for now, could we still move it to sycl/test/on-device/extensions for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are going to use it for testing "does it compile?" mode, than there is no need to move.
The on-device
directory is intended for the tests requiring special HW for execution and testing features under active development. When feature API is finalized and it's ready for end users, we move such tests to llvm-test-suite repository.
BTW, after I address the comments, should I create a another commit in PR or "git commit --amend"? |
df3c9d3
to
5b351f5
Compare
We recommend address comments in a separate commits. It allows to track if/how comments are addressed. |
// ===--------------------------------------------------------------------=== // | ||
|
||
#pragma once | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add following include defining __SYCL_INLINE_NAMESPACE
and __SYCL_ALWAYS_INLINE
.
#include <CL/sycl/detail/defines_elementary.hpp> |
@bader If we have multiple commits in PR, could we fuse them into one commit? |
joint_matrix_mad for AMX We provide new interfaces for matrix muliply in this patch: 1. A new class called joint_matrix is introduced, and the user needs to specify the type of the elements, sizes, and the memory layout. 2. joint_matrix_load is used for loading data from main memory to tiles of AMX or kernel's local memory. 3. joint_matrix_store is used for storing data tiles of AMX or kernel's local memory to main memory. 4. joint_matrix_mad is used for the matrix multiply and add function. It performs the multiply operation on the matrices A and B, accumulates the result with C and returns the result. With this patch, the following operation can be realized: C = A*B+C 1. All cases where A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major) 2. All cases where A(bf16, any-size, row_major), B(bf16, any-size, packed_b), C(float, any-size, row_major)
Please, do not fuse the commits within one pull request. All commits will be squashed when PR is merged. |
Hi, @bader. It seems the testcase fail SYCL :: Reduction/reduction_nd_N_vars.cpp(http://icl-jenkins.sc.intel.com:8080/blue/organizations/jenkins/SYCL_CI%2Fintel%2FLin%2FLLVM_Test_Suite/detail/LLVM_Test_Suite/3350/pipeline/) is because the jenkin is using the old testcase. it fails on CPU. So can we ignore the case fail? |
@tfzhu, how can we retest this PR with the latest sources? |
I've got support from @DoyleLi and I've just rerun the Jenkins/Precommit . jenkin will fetch the latest source. Besides, We can observe it in the label "Check out from version control" in http://icl-jenkins.sc.intel.com:8080/blue/organizations/jenkins/SYCL_CI%2Fintel%2FLin%2FLLVM_Test_Suite/detail/LLVM_Test_Suite/3381/pipeline |
@againull, @intel/llvm-reviewers-runtime, ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! Is there an extension document for this available somewhere?
matrix_layout Layout> | ||
__SYCL_ALWAYS_INLINE static typename std::enable_if< | ||
(NumRows > tile_size) || (NumCols * sizeof(T) / 4 > tile_size), void>::type | ||
submatrix_load(detail::submatrix<T> &sub_m, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be using the submatrix
class defined above rather than the one in detail
? Same question for functions like submatrix_mad
and submatrix_store
.
@@ -0,0 +1,16 @@ | |||
//==---------------- submatrix.hpp - SYCL matrix ---------------*- C++ -*---==// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//==---------------- submatrix.hpp - SYCL matrix ---------------*- C++ -*---==// | |
//==------------------ matrix.hpp - SYCL matrix ----------------*- C++ -*---==// |
@@ -0,0 +1,186 @@ | |||
// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -O2 %s -o %t.out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using -march=sapphirerapids instead of "-mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi"?
@@ -0,0 +1,171 @@ | |||
// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -O2 %s -o %t.out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using -march=sapphirerapids instead of "-mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi"?
1. Remove useless "class submatrix" defined in matrix namespace; 2. Move submatrix_load, submatrix_store and submatrix_mad into detail namespace; 3. Use -march=sapphirerapids instead of "-mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi";
I think @dkhaldi will provide it this week. Could we merge this patch first? |
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
// | ||
// ===--------------------------------------------------------------------=== // | ||
/// -DAMX will enable joint_matrix feature for AMX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need -DAMX anymore. Please update.
Also, what should be the compilation command line the user has to use to trigger this extension:
clang++ -fsycl -march=sapphirerapids
Is it this one? shouldn't we add the AOT line too here
-fsycl-targets="spir64_x86_64-uknown-linux-sycldevice"
to avoid the user generating JIT code that does not work on the GPU or other devices?
#include <assert.h> | ||
#include <cstdint> | ||
#include <cstdio> | ||
#include <immintrin.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the user need to add all these include for such simple code, please double check?
can we move #include <immintrin.h> to matrix-amx.hpp to avoid the user adding it here?
The spec doc should make it to intel/llvm this week. It is currently under internal review |
|
||
ONEAPI::sub_group sg = spmd_item.get_sub_group(); | ||
joint_matrix<ONEAPI::sub_group, unsigned short, TM, TK> sub_a(sg); | ||
joint_matrix<ONEAPI::sub_group, unsigned short, TK / 2, TN * 2, matrix_layout::packed_b> sub_b(sg); // ???? hide in new interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the comment.
Instead, add a detailed comment that for AMX, users need to explicitly use this packed_b layout along with the VNNI sizes for B matrix.
By default, the layout is row_major and size is (TK, TN).
I am adding this comment to the interface document as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling it is possible to adopt a terser API using a C++ coding style instead of a C API.
typename std::enable_if<(NumRows > matrix::tile_size) || | ||
(NumCols * sizeof(T) / 4 > matrix::tile_size), | ||
void>::type | ||
submatrix_load(detail::submatrix<T> &sub_m, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: why not submatrix::load
instead?
And while it can be a static member, could it be a member function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @keryell. Thanks for your comments. I really agree we should change into member function but we are reaching a deadline of release. So, could we merge this patch and Address your comments in our next PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever is good. Just need the tersest SYCL syntax at the end. :-)
|
||
// This handles cases where T1 is int8, T2 is int32. | ||
inline __SYCL_ALWAYS_INLINE static void | ||
submatrix_mad(detail::submatrix<int8_t> &sub_ma, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
submatrix::mad
?
matrix_layout Layout, access::address_space Space> | ||
inline __SYCL_ALWAYS_INLINE typename std::enable_if< | ||
(NumRows > tile_size) || (NumCols * sizeof(T) / 4 > tile_size), void>::type | ||
joint_matrix_load(Group sg, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joint_matrix::load
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkhaldi Do you agree to change it to member function, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keryell, For joint_matrix_load/store/mad, we are following the current existing group algorithms of SYCL 2020 like joint_reduce. So these should be free functions. Having said that, this is being approved as an experimental interface. We can revise these details once we take the extension to the SYCL group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it can be free functions too but it looks like the syntax is terser in that case with classes and members. I do not know how generic your joint_matrix is, but it looks really like a coherent set of operations on some operands.
T *mem = src.get(); | ||
// memcpy from mem to jm.raw_storage | ||
for (int i = 0; i < NumRows; ++i) { | ||
char *srcptr = (char *)mem + i * stride * sizeof(T); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use C++ casts instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I will change it. BTW, is there any advantage of reinterpret_cast, compared with c-style cast? Is that because it is more eye-catching?
joint_matrix_load(sg, sub_c, | ||
accC.get_pointer() + (sg_startx * TM) * N + | ||
sg_starty * TN, | ||
N, matrix_layout::row_major); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a member function you would have
joint_matrix_load(sg, sub_c, | |
accC.get_pointer() + (sg_startx * TM) * N + | |
sg_starty * TN, | |
N, matrix_layout::row_major); | |
sub_c.jointload(sg, accC.get_pointer() + (sg_startx * TM) * N + | |
sg_starty * TN, | |
N, matrix_layout::row_major); |
2. Modify some comments
@intel/llvm-reviewers-runtime, ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general considering that some of the comments going to be addressed with follow up PR.
* upstream/sycl: (39 commits) [CI] Switch to default clang-format version. (intel#3540) [Driver][NFC] Cleanup some option setting for SYCL offload (intel#3542) [GitHub Actions] Update main branch sync schedule [SYCL][NFC] Fix potential namespace conflicts with PSTL in tuple.hpp (intel#3541) [SYCL] Bump sycl library minor version (intel#3538) [SYCL][CUDA] Implemented cuda_piextUSMEnqueueMemAdvise (intel#3365) [SYCL][FPGA] Add mutual diagnostic of max_concurrency attribute in conjunction of disable_loop_pipelining attribute (intel#3512) [SYCL] [MATRIX] Enable joint_matrix_load, joint_matrix_store, and joint_matrix_mad for AMX (intel#3503) [ESIMD] Skip rewriting functions used through function pointers (intel#3527) [SYCL] Fix address space for spec constants buffer (intel#3521) [SYCL] Correct the tablegen for checking mutually exclusive stmt attrs (intel#3519) [SYCL][PI][L0][NFC] Refactor setting of LastCommandEvent (intel#3528) [SYCL] Fix group local memory sharing issue (intel#3489) [SYCL][NFC] Fix post-commit failure (intel#3532) [SYCL][Doc] Remove extension mechanism (intel#3526) [SYCL] Move sycl.hpp in install directory and adjust driver to match (intel#3523) [SYCL][ESIMD] Update ESIMD docs to address recent user comments: (intel#3516) [NFCI][SYCL] Correct -fdeclare-spirv-builtins to use marshalling (intel#3515) [SYCL] Rework MarkDevice and children (intel#3475) [SYCL] Fix StringLiteral Ctor issue from intel#3504. (intel#3520) ...
We provide new interfaces for matrix muliply in this patch:
A new class called joint_matrix is introduced, and The user needs to
specify the type of the elements, sizes, and the memory layout.
joint_matrix_load is used for loading data from main momory to tiles of
AMX or kernel's local memory.
joint_matrix_store is used for storing data tiles of AMX or kernel's
local memory to main memory.
joint_matrix_mad is used for the matrix multiply and add function.
It performs the multiply operation on the matrices A and B, accumulates the
result with C and return the result.
With this patch, the following operation can be realized:
C = A*B+C