Skip to content

[SYCL] [MATRIX] Enable joint_matrix_load, joint_matrix_store, and joint_matrix_mad for AMX #3503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 13, 2021

Conversation

yubingex007-a11y
Copy link
Contributor

We provide new interfaces for matrix muliply in this patch:

  1. A new class called joint_matrix is introduced, and The user needs to
    specify the type of the elements, sizes, and the memory layout.

  2. joint_matrix_load is used for loading data from main momory to tiles of
    AMX or kernel's local memory.

  3. joint_matrix_store is used for storing data tiles of AMX or kernel's
    local memory to main memory.

  4. joint_matrix_mad is used for the matrix multiply and add function.
    It performs the multiply operation on the matrices A and B, accumulates the
    result with C and return the result.

With this patch, the following operation can be realized:
C = A*B+C

  1. All cases where A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major)
  2. All cases where A(bf16, any-size, row_major), B(bf16, any-size, packed_b), C(float, any-size, row_major)

Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, but there are bunch of comments regrading code style and organization we apply to the project.

@@ -0,0 +1,184 @@
// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -DAMX -O2 %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -DAMX -O2 %s
// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -DAMX -O2 %s -o %t

I think we should execute this test, but I assume it requires HW support for AMX ISA. Right?
This might require additional LIT configuration:

  1. Add new feature to the config, detect if HW supports it and enable it for LIT tests.
  2. Test must require AMX feature to run.

According to https://github.com/intel/llvm/blob/sycl/CONTRIBUTING.md#tests-development, we should also move this test to https://github.com/intel/llvm/tree/sycl/sycl/test/on-device/ (https://github.com/intel/llvm/tree/sycl/sycl/test/on-device/extensions in particular) directory.

The same is applicable for the second test as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comments, Alexey. Eh, BTW, If the testcase can't run for now, could we still move it to sycl/test/on-device/extensions for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going to use it for testing "does it compile?" mode, than there is no need to move.

The on-device directory is intended for the tests requiring special HW for execution and testing features under active development. When feature API is finalized and it's ready for end users, we move such tests to llvm-test-suite repository.

@yubingex007-a11y
Copy link
Contributor Author

BTW, after I address the comments, should I create a another commit in PR or "git commit --amend"?

@yubingex007-a11y yubingex007-a11y force-pushed the jm branch 2 times, most recently from df3c9d3 to 5b351f5 Compare April 8, 2021 08:52
@bader
Copy link
Contributor

bader commented Apr 8, 2021

BTW, after I address the comments, should I create a another commit in PR or "git commit --amend"?

We recommend address comments in a separate commits. It allows to track if/how comments are addressed.
Force-push removes old version of the patch from the pull request and GitHub can't correlate comments with the new version of the patch.

// ===--------------------------------------------------------------------=== //

#pragma once

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add following include defining __SYCL_INLINE_NAMESPACE and __SYCL_ALWAYS_INLINE.

Suggested change
#include <CL/sycl/detail/defines_elementary.hpp>

@yubingex007-a11y
Copy link
Contributor Author

BTW, after I address the comments, should I create a another commit in PR or "git commit --amend"?

We recommend address comments in a separate commits. It allows to track if/how comments are addressed.
Force-push removes old version of the patch from the pull request and GitHub can't correlate comments with the new version of the patch.

@bader If we have multiple commits in PR, could we fuse them into one commit?
@dkhaldi I guess we prefer one commit for this feature?

joint_matrix_mad for AMX

We provide new interfaces for matrix muliply in this patch:
1. A new class called joint_matrix is introduced, and the user needs to
specify the type of the elements, sizes, and the memory layout.

2. joint_matrix_load is used for loading data from main memory to tiles of
AMX or kernel's local memory.

3. joint_matrix_store is used for storing data tiles of AMX or kernel's
local memory to main memory.

4. joint_matrix_mad is used for the matrix multiply and add function.
It performs the multiply operation on the matrices A and B, accumulates the
result with C and returns the result.

With this patch, the following operation can be realized:
  C = A*B+C
1. All cases where A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major)
2. All cases where A(bf16, any-size, row_major), B(bf16, any-size, packed_b), C(float, any-size, row_major)
@bader
Copy link
Contributor

bader commented Apr 8, 2021

BTW, after I address the comments, should I create a another commit in PR or "git commit --amend"?

We recommend address comments in a separate commits. It allows to track if/how comments are addressed.
Force-push removes old version of the patch from the pull request and GitHub can't correlate comments with the new version of the patch.

@bader If we have multiple commits in PR, could we fuse them into one commit?

Please, do not fuse the commits within one pull request. All commits will be squashed when PR is merged.

@yubingex007-a11y
Copy link
Contributor Author

Hi, @bader. It seems the testcase fail SYCL :: Reduction/reduction_nd_N_vars.cpp(http://icl-jenkins.sc.intel.com:8080/blue/organizations/jenkins/SYCL_CI%2Fintel%2FLin%2FLLVM_Test_Suite/detail/LLVM_Test_Suite/3350/pipeline/) is because the jenkin is using the old testcase. it fails on CPU.
the latest code(https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Reduction/reduction_nd_N_vars.cpp) shows CPU is unsupported:
// TODO: The test irregularly reports incorrect results on CPU. // UNSUPPORTED: cpu

So can we ignore the case fail?

@bader
Copy link
Contributor

bader commented Apr 9, 2021

Hi, @bader. It seems the testcase fail SYCL :: Reduction/reduction_nd_N_vars.cpp(http://icl-jenkins.sc.intel.com:8080/blue/organizations/jenkins/SYCL_CI%2Fintel%2FLin%2FLLVM_Test_Suite/detail/LLVM_Test_Suite/3350/pipeline/) is because the jenkin is using the old testcase. it fails on CPU.
the latest code(https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Reduction/reduction_nd_N_vars.cpp) shows CPU is unsupported:
// TODO: The test irregularly reports incorrect results on CPU. // UNSUPPORTED: cpu

So can we ignore the case fail?

@tfzhu, how can we retest this PR with the latest sources?

@yubingex007-a11y
Copy link
Contributor Author

Hi, @bader. It seems the testcase fail SYCL :: Reduction/reduction_nd_N_vars.cpp(http://icl-jenkins.sc.intel.com:8080/blue/organizations/jenkins/SYCL_CI%2Fintel%2FLin%2FLLVM_Test_Suite/detail/LLVM_Test_Suite/3350/pipeline/) is because the jenkin is using the old testcase. it fails on CPU.
the latest code(https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Reduction/reduction_nd_N_vars.cpp) shows CPU is unsupported:
// TODO: The test irregularly reports incorrect results on CPU. // UNSUPPORTED: cpu
So can we ignore the case fail?

@tfzhu, how can we retest this PR with the latest sources?

I've got support from @DoyleLi and I've just rerun the Jenkins/Precommit . jenkin will fetch the latest source. Besides, We can observe it in the label "Check out from version control" in http://icl-jenkins.sc.intel.com:8080/blue/organizations/jenkins/SYCL_CI%2Fintel%2FLin%2FLLVM_Test_Suite/detail/LLVM_Test_Suite/3381/pipeline

@bader
Copy link
Contributor

bader commented Apr 9, 2021

@againull, @intel/llvm-reviewers-runtime, ping.

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Is there an extension document for this available somewhere?

matrix_layout Layout>
__SYCL_ALWAYS_INLINE static typename std::enable_if<
(NumRows > tile_size) || (NumCols * sizeof(T) / 4 > tile_size), void>::type
submatrix_load(detail::submatrix<T> &sub_m,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be using the submatrix class defined above rather than the one in detail? Same question for functions like submatrix_mad and submatrix_store.

@@ -0,0 +1,16 @@
//==---------------- submatrix.hpp - SYCL matrix ---------------*- C++ -*---==//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//==---------------- submatrix.hpp - SYCL matrix ---------------*- C++ -*---==//
//==------------------ matrix.hpp - SYCL matrix ----------------*- C++ -*---==//

@@ -0,0 +1,186 @@
// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -O2 %s -o %t.out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using -march=sapphirerapids instead of "-mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi"?

@@ -0,0 +1,171 @@
// RUN: %clangxx -mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi -fsycl -O2 %s -o %t.out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using -march=sapphirerapids instead of "-mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi"?

1. Remove useless "class submatrix" defined in matrix namespace;
2. Move submatrix_load, submatrix_store and submatrix_mad into detail namespace;
3. Use -march=sapphirerapids instead of "-mamx-bf16 -mamx-int8 -mavx512bw -mavx512vbmi";
@yubingex007-a11y yubingex007-a11y requested a review from bader April 12, 2021 05:20
@yubingex007-a11y
Copy link
Contributor Author

Good stuff! Is there an extension document for this available somewhere?

I think @dkhaldi will provide it this week. Could we merge this patch first?

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
// ===--------------------------------------------------------------------=== //
/// -DAMX will enable joint_matrix feature for AMX
Copy link
Contributor

@dkhaldi dkhaldi Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need -DAMX anymore. Please update.
Also, what should be the compilation command line the user has to use to trigger this extension:
clang++ -fsycl -march=sapphirerapids
Is it this one? shouldn't we add the AOT line too here
-fsycl-targets="spir64_x86_64-uknown-linux-sycldevice"
to avoid the user generating JIT code that does not work on the GPU or other devices?

#include <assert.h>
#include <cstdint>
#include <cstdio>
#include <immintrin.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the user need to add all these include for such simple code, please double check?
can we move #include <immintrin.h> to matrix-amx.hpp to avoid the user adding it here?

@dkhaldi
Copy link
Contributor

dkhaldi commented Apr 12, 2021

Good stuff! Is there an extension document for this available somewhere?

I think @dkhaldi will provide it this week. Could we merge this patch first?

The spec doc should make it to intel/llvm this week. It is currently under internal review


ONEAPI::sub_group sg = spmd_item.get_sub_group();
joint_matrix<ONEAPI::sub_group, unsigned short, TM, TK> sub_a(sg);
joint_matrix<ONEAPI::sub_group, unsigned short, TK / 2, TN * 2, matrix_layout::packed_b> sub_b(sg); // ???? hide in new interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the comment.
Instead, add a detailed comment that for AMX, users need to explicitly use this packed_b layout along with the VNNI sizes for B matrix.
By default, the layout is row_major and size is (TK, TN).
I am adding this comment to the interface document as well.

Copy link
Contributor

@keryell keryell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling it is possible to adopt a terser API using a C++ coding style instead of a C API.

typename std::enable_if<(NumRows > matrix::tile_size) ||
(NumCols * sizeof(T) / 4 > matrix::tile_size),
void>::type
submatrix_load(detail::submatrix<T> &sub_m,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: why not submatrix::load instead?
And while it can be a static member, could it be a member function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @keryell. Thanks for your comments. I really agree we should change into member function but we are reaching a deadline of release. So, could we merge this patch and Address your comments in our next PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever is good. Just need the tersest SYCL syntax at the end. :-)


// This handles cases where T1 is int8, T2 is int32.
inline __SYCL_ALWAYS_INLINE static void
submatrix_mad(detail::submatrix<int8_t> &sub_ma,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submatrix::mad?

matrix_layout Layout, access::address_space Space>
inline __SYCL_ALWAYS_INLINE typename std::enable_if<
(NumRows > tile_size) || (NumCols * sizeof(T) / 4 > tile_size), void>::type
joint_matrix_load(Group sg,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

joint_matrix::load?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkhaldi Do you agree to change it to member function, too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keryell, For joint_matrix_load/store/mad, we are following the current existing group algorithms of SYCL 2020 like joint_reduce. So these should be free functions. Having said that, this is being approved as an experimental interface. We can revise these details once we take the extension to the SYCL group.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it can be free functions too but it looks like the syntax is terser in that case with classes and members. I do not know how generic your joint_matrix is, but it looks really like a coherent set of operations on some operands.

T *mem = src.get();
// memcpy from mem to jm.raw_storage
for (int i = 0; i < NumRows; ++i) {
char *srcptr = (char *)mem + i * stride * sizeof(T);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use C++ casts instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I will change it. BTW, is there any advantage of reinterpret_cast, compared with c-style cast? Is that because it is more eye-catching?

Comment on lines +74 to +77
joint_matrix_load(sg, sub_c,
accC.get_pointer() + (sg_startx * TM) * N +
sg_starty * TN,
N, matrix_layout::row_major);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a member function you would have

Suggested change
joint_matrix_load(sg, sub_c,
accC.get_pointer() + (sg_startx * TM) * N +
sg_starty * TN,
N, matrix_layout::row_major);
sub_c.jointload(sg, accC.get_pointer() + (sg_startx * TM) * N +
sg_starty * TN,
N, matrix_layout::row_major);

@bader
Copy link
Contributor

bader commented Apr 13, 2021

@intel/llvm-reviewers-runtime, ping.

Copy link
Contributor

@againull againull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general considering that some of the comments going to be addressed with follow up PR.

@againull againull merged commit 35db973 into intel:sycl Apr 13, 2021
alexbatashev pushed a commit to alexbatashev/llvm that referenced this pull request Apr 14, 2021
* upstream/sycl: (39 commits)
  [CI] Switch to default clang-format version. (intel#3540)
  [Driver][NFC] Cleanup some option setting for SYCL offload (intel#3542)
  [GitHub Actions] Update main branch sync schedule
  [SYCL][NFC] Fix potential namespace conflicts with PSTL in tuple.hpp (intel#3541)
  [SYCL] Bump sycl library minor version (intel#3538)
  [SYCL][CUDA] Implemented cuda_piextUSMEnqueueMemAdvise (intel#3365)
  [SYCL][FPGA] Add mutual diagnostic of max_concurrency attribute in conjunction of disable_loop_pipelining attribute (intel#3512)
  [SYCL] [MATRIX] Enable joint_matrix_load, joint_matrix_store, and joint_matrix_mad for AMX (intel#3503)
  [ESIMD] Skip rewriting functions used through function pointers (intel#3527)
  [SYCL] Fix address space for spec constants buffer (intel#3521)
  [SYCL] Correct the tablegen for checking mutually exclusive stmt attrs (intel#3519)
  [SYCL][PI][L0][NFC] Refactor setting of LastCommandEvent (intel#3528)
  [SYCL] Fix group local memory sharing issue (intel#3489)
  [SYCL][NFC] Fix post-commit failure (intel#3532)
  [SYCL][Doc] Remove extension mechanism (intel#3526)
  [SYCL] Move sycl.hpp in install directory and adjust driver to match (intel#3523)
  [SYCL][ESIMD] Update ESIMD docs to address recent user comments: (intel#3516)
  [NFCI][SYCL] Correct -fdeclare-spirv-builtins to use marshalling (intel#3515)
  [SYCL] Rework MarkDevice and children (intel#3475)
  [SYCL] Fix StringLiteral Ctor issue from intel#3504. (intel#3520)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants