[SYCL] Add ballot_group support to algorithms #8784

Pennycook · 2023-03-27T18:23:05Z

Enables the following functions to be used with ballot_group arguments:

group_barrier
group_broadcast
any_of_group
all_of_group
none_of_group
reduce_over_group
exclusive_scan_over_group
inclusive_scan_over_group

Signed-off-by: John Pennycook john.pennycook@intel.com

To avoid duplicating logic and introducing even more overloads of the group algorithms, it is desirable to move some of the implementation details into the detail::spirv namespace. This commit makes a few changes to enable that to happen: - spirv:: functions with a Group template now take a group object, to enable run-time information (e.g. group membership) to pass through. - ControlBarrier and the OpGroup* instruction used to implement reduce/scan now forward to spirv::, similar to other group functions and algorithms. - The calc helper used to map functors to SPIR-V instructions is updated to use the new spirv:: functions, instead of calling __spirv intrinsics. Signed-off-by: John Pennycook <john.pennycook@intel.com>

Nested detail namespaces cause problems for name lookup. Signed-off-by: John Pennycook <john.pennycook@intel.com>

Enables the following functions to be used with ballot_group arguments: - group_barrier - group_broadcast - any_of_group - all_of_group - none_of_group - reduce_over_group - exclusive_scan_over_group - inclusive_scan_over_group Signed-off-by: John Pennycook <john.pennycook@intel.com>

Pennycook · 2023-03-27T18:27:07Z

A few quick notes to reviewers:

I split this out into three commits to try and make the review a little easier. But I reconstructed them from the true history, so I'm not 100% sure if things will work if you try to build one of the earlier commits. 😄
Tests are in [SYCL] Add tests for ballot_group algorithms llvm-test-suite#1698.
This may look like a lot of refactoring for the sake of one group type, but there are a few more coming. The idea of having something like spirv::GroupBroadcast resolve either to __spirv_GroupBroadcast or a sequence of __spirv_GroupNonUniformBroadcast calls depending on the type of group is to facilitate simpler implementation of the remaining non-uniform groups.

sycl/include/sycl/ext/oneapi/experimental/non_uniform_groups.hpp

sycl/include/sycl/detail/spirv.hpp

Fixes compilation at -O0.

aelovikov-intel · 2023-03-29T15:40:45Z

Hi @Pennycook , with intel/llvm-test-suite moved in-tree you need to add the tests from intel/llvm-test-suite#1698 into this PR now.

Tests the ability to create an instance of each new group type, and the correctness of the core member functions. Signed-off-by: John Pennycook <john.pennycook@intel.com>

This commit adds tests for using ballot_group and the following algorithms: - group_barrier - group_broadcast - any_of_group - all_of_group - none_of_group - reduce_over_group - exclusive_scan_over_group - inclusive_scan_over_group Signed-off-by: John Pennycook <john.pennycook@intel.com>

Pennycook · 2023-03-29T15:57:36Z

Hi @Pennycook , with intel/llvm-test-suite moved in-tree you need to add the tests from intel/llvm-test-suite#1698 into this PR now.

Thanks, @aelovikov-intel. I've copied over the tests from intel/llvm-test-suite#1698, and also the related tests from intel/llvm-test-suite#1574 which didn't get merged before the move.

sycl/include/sycl/detail/spirv.hpp

sycl/test-e2e/NonUniformGroups/ballot_group.cpp

aelovikov-intel · 2023-03-29T17:14:46Z

sycl/test-e2e/NonUniformGroups/ballot_group_algorithms.cpp

+  sycl::buffer<bool, 1> BarrierBuf{sycl::range{32}};
+  sycl::buffer<bool, 1> BroadcastBuf{sycl::range{32}};
+  sycl::buffer<bool, 1> AnyBuf{sycl::range{32}};
+  sycl::buffer<bool, 1> AllBuf{sycl::range{32}};
+  sycl::buffer<bool, 1> NoneBuf{sycl::range{32}};
+  sycl::buffer<bool, 1> ReduceBuf{sycl::range{32}};
+  sycl::buffer<bool, 1> ExScanBuf{sycl::range{32}};
+  sycl::buffer<bool, 1> IncScanBuf{sycl::range{32}};


In my debug experiments I'm using something like

constexpr int N_RESULTS = 32; sycl::buffer<bool, 1> results(32 * N_RESULTS); ... accessor res_acc{results, cgh}; // kernel auto *res = res_acc.get_pointer() + WI*N_RESULTS; ... *res++ = res1; *res++ = res2; ... // could be outlined to a helper and shared between different places. host_accessor res_acc{results}; bool success = std::all_of(res_acc.begin(), res_acc.end(), [](bool r) { return r; }); if (!success) { for (int j = 0; j< N_RESULTS; ++j) { for (int i = 0; i < res_acc.size() / N_RESULTS; ++i) { if (i % 8 == 0) std::cout << " |"; std::cout << " " << res_acc[i*N_RESULTS + j]; } std::cout << std::endl; } assert(false); }

I think it might be suitable here as well, but up to you.

sycl/test-e2e/NonUniformGroups/ballot_group_algorithms.cpp

aelovikov-intel · 2023-03-29T17:23:54Z

sycl/test-e2e/NonUniformGroups/ballot_group_algorithms.cpp

+          uint32_t ReduceResult =
+              sycl::reduce_over_group(BallotGroup, 1, sycl::plus<>());
+          ReduceAcc[WI] =
+              (ReduceResult == BallotGroup.get_local_linear_range());


Strictly speaking, the previous test only verified get_local_range() and not get_local_linear_range() but we can leave this to CTS.

aelovikov-intel · 2023-03-29T17:28:19Z

sycl/test-e2e/NonUniformGroups/is_user_constructed.cpp

+// RUN: %clangxx -fsycl -fsyntax-only -fsycl-targets=%sycl_triple %s -o %t.out
+
+#include <sycl/sycl.hpp>
+namespace syclex = sycl::ext::oneapi::experimental;
+
+static_assert(
+    syclex::is_user_constructed_group_v<syclex::ballot_group<sycl::sub_group>>);
+static_assert(syclex::is_user_constructed_group_v<
+              syclex::cluster_group<1, sycl::sub_group>>);
+static_assert(syclex::is_user_constructed_group_v<
+              syclex::cluster_group<2, sycl::sub_group>>);
+static_assert(
+    syclex::is_user_constructed_group_v<syclex::tangle_group<sycl::sub_group>>);
+static_assert(syclex::is_user_constructed_group_v<syclex::opportunistic_group>);


I'd slightly prefer to have this fused with the previous one but won't insist on that.

aelovikov-intel · 2023-03-29T17:30:41Z

sycl/test-e2e/NonUniformGroups/opportunistic_group.cpp

+            Match &= (OpportunisticGroup.get_group_id() == 0);
+            Match &= (OpportunisticGroup.get_local_id() <
+                      OpportunisticGroup.get_local_range());
+            Match &= (OpportunisticGroup.get_group_range() == 1);
+            Match &= (OpportunisticGroup.get_local_linear_range() <=
+                      SG.get_local_linear_range());


I'd suggest writing all the ranges/WIs and verifying their sum/existence on the host.

In some cases, that requires a lot more data to be sent back to the host, though. Doing the test inside of the kernel means we have access to all the values of nd_item, sub_group and the opportunistic_group. Getting all those values on the host would require a bunch of additional accessors.

sycl/test-e2e/NonUniformGroups/tangle_group.cpp

sycl/include/sycl/detail/spirv.hpp

Pennycook · 2023-03-30T00:40:04Z

All the tests passed, so I'll work on making some of these formatting changes tomorrow.

One thing to possibly look at on the CI side: this got stuck waiting on "Stop AWS" for a really long time -- several hours after all the test suite runs had completed. I'm not sure what that action does, but it struck me as odd that it didn't get scheduled sooner.

aelovikov-intel · 2023-03-30T15:13:15Z

this got stuck waiting on "Stop AWS" for a really long time

That seems to be the issue with Github's public runners that we use for the ultra-lightweight tasks. We've been hitting this issue elsewhere recently too.

Co-authored-by: aelovikov-intel <andrei.elovikov@intel.com>

JackAKirk · 2023-04-06T17:05:02Z

Hi @Pennycook

I'm in the middle of implementing cuda support for these algorithms on top of your implementation and I'm at the point where it would be good to ask for your feedback on a few small implementation issues. I've also implemented cuda support for cluster_group and opportunistic_group on top of your implementations in the commit I reference here.
Note that here:
c546762#diff-23a47145f432e0af1de65698d91f80a6d0c61c2dcfa6db8a8d97ef84b19dbf59R124
I've added a new friend function to ballot_group GetMask, and I've correspondingly removed the friend status from sycl::detail::IdToMaskPosition since it can simply call GetMask. This function is required for at least ballot_group and opportunistic_group which already have a SubGroupMask member that is required in the NVPTX backend for the different supported group algorithms. This leads me to the questions:

The NVPTX backend could implement cluster_group without introducing a SubGroupMask Mask member, but at the moment I have added the Mask only for NVPTX here: c546762#diff-9f01b860115d8a8dab22444aefe9db812b7d11ef489d35a6e2470340436f098bR115
The advantage to storing the mask is that it would otherwise need to be recreated for every non_uniform group algorithm call taking a cluster_group in the NVPTX backend via cluster_group methods. Do you prefer the mask storing approach or the mask recreation approach?
If the mask storing approach is preferable then GetMask would obviously additionally need to support the cluster_group case in the NVPTX backend. Currently I notice that your implementation of sycl::detail::IdToMaskPosition supports all Non-Uniform groups, rather than just ballot_group and opportunistic_group, but I guessed that this was an oversight? This observation made me want to check that in the mask storing approach: would you prefer to wrap GetMask for cluster_group and the SubGroupMask member of cluster_group in #if defined(NVPTX), or just support them for Intel backends too but have them unused?

Other than this I don't think there are any issues. I have implemented GroupAny, GroupAll, GroupBarrier, and GroupBroadcast for ballot_group as you can see, and all these algorithms should call identical code for all NVPTX non uniform groups that store a mask. I've been testing them with the tests that you already wrote. Reduce/(inc/exc)Scan are a bit more complex but I do not believe they introduce any other implementation questions.
Let me know if you have any other thoughts/questions on the NVPTX implemention so far.

Thanks

Pennycook · 2023-04-20T21:07:12Z

I'm in the middle of implementing cuda support for these algorithms on top of your implementation and I'm at the point where it would be good to ask for your feedback on a few small implementation issues.

Sorry for the delayed response, @JackAKirk.

I've also implemented cuda support for cluster_group and opportunistic_group on top of your implementations in the commit I reference here. Note that here: c546762#diff-23a47145f432e0af1de65698d91f80a6d0c61c2dcfa6db8a8d97ef84b19dbf59R124 I've added a new friend function to ballot_group GetMask, and I've correspondingly removed the friend status from sycl::detail::IdToMaskPosition since it can simply call GetMask. This function is required for at least ballot_group and opportunistic_group which already have a SubGroupMask member that is required in the NVPTX backend for the different supported group algorithms.

This is a good idea.

The NVPTX backend could implement cluster_group without introducing a SubGroupMask Mask member, but at the moment I have added the Mask only for NVPTX here: c546762#diff-9f01b860115d8a8dab22444aefe9db812b7d11ef489d35a6e2470340436f098bR115 The advantage to storing the mask is that it would otherwise need to be recreated for every non_uniform group algorithm call taking a cluster_group in the NVPTX backend via cluster_group methods. Do you prefer the mask storing approach or the mask recreation approach?

Hm. I don't think I have a strong preference, because it's not immediately obvious to me which the compiler is going to be better at optimizing. My gut says that storing the mask might be slightly easier to optimize: it's unlikely that somebody would create a cluster_group and not use it--so we don't have to worry about the compiler's ability to optimize away the mask logic--and I trust the compiler to avoid spilling/filling the calculated mask as best it can. I'm less confident that the compiler could reliably optimize away repeated mask calculations.

Currently I notice that your implementation of sycl::detail::IdToMaskPosition supports all Non-Uniform groups, rather than just ballot_group and opportunistic_group, but I guessed that this was an oversight? This observation made me want to check that in the mask storing approach: would you prefer to wrap GetMask for cluster_group and the SubGroupMask member of cluster_group in #if defined(NVPTX), or just support them for Intel backends too but have them unused?

Honestly, I'm not sure. I think you're right that it would make sense for IdToMaskPosition to only support the non-uniform groups that actually store masks. Perhaps this should be a more general, and not mention masks? The other groups may need to perform similar operations (e.g. in the case of a cluster_group, we probably need to do some modular arithmetic somewhere).

Other than this I don't think there are any issues. I have implemented GroupAny, GroupAll, GroupBarrier, and GroupBroadcast for ballot_group as you can see, and all these algorithms should call identical code for all NVPTX non uniform groups that store a mask. I've been testing them with the tests that you already wrote. Reduce/(inc/exc)Scan are a bit more complex but I do not believe they introduce any other implementation questions. Let me know if you have any other thoughts/questions on the NVPTX implemention so far.

The only thing I'm curious about is this:

res[0] = __nvvm_vote_ballot_sync(threads, predicate); // couldnt call this within intel impl because undefined behaviour if not all reach it?

I understand the comment, I think, -- threads is from __clc_membermask(), which always returns a mask representing the entire warp. But it's not obvious to me where this is being used by the new groups. Have I missed it?

Thank you again for working on this, I really appreciate it. Now that I'm back from vacation, I'll renew my efforts to get this fixed and merged in, along with the other group types.

againull · 2023-04-21T19:42:11Z

Hi @Pennycook, could you please fix post-commit issues for this PR. They are mostly werror problems:
70a35de

aelovikov-intel · 2023-04-21T19:47:55Z

I also see this https://github.com/intel/llvm/actions/runs/4767700304/jobs/8476298480#step:7:3089 :

D:\github\_work\llvm\llvm\build\include\sycl/ext/oneapi/experimental/non_uniform_groups.hpp(57): error C3861: '__builtin_unreachable': identifier not found
D:\github\_work\llvm\llvm\build\include\sycl/ext/oneapi/experimental/non_uniform_groups.hpp(57): note: '__builtin_unreachable': function declaration must be available as none of the arguments depend on a template parameter

I'm not 100% sure it's caused by this, but looks so.

JackAKirk · 2023-04-24T17:03:06Z

I understand the comment, I think, -- threads is from __clc_membermask(), which always returns a mask representing the entire warp. But it's not obvious to me where this is being used by the new groups. Have I missed it?

Thank you again for working on this, I really appreciate it. Now that I'm back from vacation, I'll renew my efforts to get this fixed and merged in, along with the other group types.

Sorry for the confusion of the comment. This was really a note to myself when I was considering whether I could make a cuda impl that reused the same spirv functions used by the intel impl that do not take a mask (I can't even if it were desirable). The relevance is simply that you can use __nvvm_vote_ballot_sync(FULLMASK, predicate); to return the mask for the ballot_group.

All sounds good. I also think it is best to store a mask for fixed_size_group for cuda backend, which I have done in #9182 (see this PR for more details on cuda backend impl status). I opened this draft PR mainly for your convenience of seeing the requirements from the cuda backend implementation. I think I've now implemented enough so that there should be no more surprises with cuda backend requirements. Really the only missing thing is the scans implementations which will follow a very similar pattern to the reduce_over_group implementation, except it is simpler because there is no special sm_80 version.
I will wait to see how you implement the opportunistic_group algorithms before finalizing the design. In the mean-time I mainly just need to implement this case (https://github.com/intel/llvm/pull/9182/files#diff-185200e89425bd69d2e35ca3ba250d40934230436b718d4f2eecbab9325cc972R170) which is quite involved.

Pennycook added 3 commits March 27, 2023 10:52

[SYCL][NFC] Adjust some detail namespaces

483ef0b

Nested detail namespaces cause problems for name lookup. Signed-off-by: John Pennycook <john.pennycook@intel.com>

Pennycook added the spec extension All issues/PRs related to extensions specifications label Mar 27, 2023

Pennycook requested review from cperkinsintel, steffenlarsen and JackAKirk March 27, 2023 18:23

Pennycook requested a review from a team as a code owner March 27, 2023 18:23

Pennycook mentioned this pull request Mar 27, 2023

[SYCL] Using ballot_group with group_barrier #8783

Closed

Pennycook temporarily deployed to aws March 27, 2023 19:22 — with GitHub Actions Inactive

Pennycook added 3 commits March 27, 2023 12:59

SYCL_EXTERNAL => __DPCPP_SYCL_EXTERNAL

64e3f9f

Add extra include for vec<>

2ec7198

Do not mix struct/class in definitions

a25bdd2

Pennycook temporarily deployed to aws March 27, 2023 21:09 — with GitHub Actions Inactive

Fix a few missed renames

8dc124a

Pennycook temporarily deployed to aws March 27, 2023 22:06 — with GitHub Actions Inactive

Pennycook mentioned this pull request Mar 27, 2023

[SYCL] Add tests for ballot_group algorithms intel/llvm-test-suite#1698

Open

Pennycook temporarily deployed to aws March 27, 2023 23:03 — with GitHub Actions Inactive

JackAKirk reviewed Mar 28, 2023

View reviewed changes

sycl/include/sycl/ext/oneapi/experimental/non_uniform_groups.hpp Show resolved Hide resolved

JackAKirk reviewed Mar 28, 2023

View reviewed changes

sycl/include/sycl/detail/spirv.hpp Show resolved Hide resolved

JackAKirk approved these changes Mar 28, 2023

View reviewed changes

Ensure Scope and GroupOperation are constexpr

af983f4

Fixes compilation at -O0.

Pennycook temporarily deployed to aws March 28, 2023 20:25 — with GitHub Actions Inactive

Pennycook closed this Mar 28, 2023

Pennycook reopened this Mar 28, 2023

Pennycook added 2 commits March 28, 2023 14:09

Merge branch 'sycl' into ballot_group_algorithms

dfe2e50

Fix nested detail:: namespace for group_ballot

01ecf06

Pennycook temporarily deployed to aws March 29, 2023 04:07 — with GitHub Actions Inactive

Pennycook temporarily deployed to aws March 29, 2023 08:18 — with GitHub Actions Inactive

Pennycook added 2 commits March 29, 2023 08:52

Add basic tests for non-uniform groups

b2a4a11

Tests the ability to create an instance of each new group type, and the correctness of the core member functions. Signed-off-by: John Pennycook <john.pennycook@intel.com>

Clarify intent of ballot_group control flow branch

56e05ce

aelovikov-intel reviewed Mar 29, 2023

View reviewed changes

sycl/include/sycl/detail/spirv.hpp Outdated Show resolved Hide resolved

sycl/include/sycl/detail/spirv.hpp Outdated Show resolved Hide resolved

sycl/include/sycl/detail/spirv.hpp Outdated Show resolved Hide resolved

sycl/include/sycl/detail/spirv.hpp Outdated Show resolved Hide resolved

aelovikov-intel reviewed Mar 29, 2023

View reviewed changes

Pennycook temporarily deployed to aws March 29, 2023 17:33 — with GitHub Actions Inactive

Pennycook temporarily deployed to aws March 29, 2023 19:50 — with GitHub Actions Inactive

steffenlarsen reviewed Mar 29, 2023

View reviewed changes

sycl/include/sycl/detail/spirv.hpp Outdated Show resolved Hide resolved

Pennycook and others added 6 commits March 30, 2023 11:20

Update comment formatting

bd4c3a2

Co-authored-by: aelovikov-intel <andrei.elovikov@intel.com>

Remove names of unused parameters

ee1f3a3

Improve readability of ballot_group tests

c1579fd

Check number of Visible updates in barrier test

9df2196

Update any_of test to use per-work-item predicate

7359d6b

Attempt to satisfy clang-format

57a6324

aelovikov-intel approved these changes Mar 30, 2023

View reviewed changes

Pennycook temporarily deployed to aws March 30, 2023 22:13 — with GitHub Actions Inactive

Pennycook temporarily deployed to aws March 31, 2023 00:59 — with GitHub Actions Inactive

Fix any_of_group test

3541793

Pennycook temporarily deployed to aws April 21, 2023 15:43 — with GitHub Actions Inactive

Pennycook temporarily deployed to aws April 21, 2023 15:44 — with GitHub Actions Inactive

steffenlarsen merged commit 70a35de into intel:sycl Apr 21, 2023

[SYCL] Add ballot_group support to algorithms #8784

[SYCL] Add ballot_group support to algorithms #8784

Uh oh!

Conversation

Pennycook commented Mar 27, 2023

Uh oh!

Pennycook commented Mar 27, 2023

Uh oh!

Uh oh!

Uh oh!

aelovikov-intel commented Mar 29, 2023

Uh oh!

Pennycook commented Mar 29, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aelovikov-intel Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aelovikov-intel Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

aelovikov-intel Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

aelovikov-intel Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

Pennycook Mar 30, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Pennycook commented Mar 30, 2023

Uh oh!

aelovikov-intel commented Mar 30, 2023

Uh oh!

JackAKirk commented Apr 6, 2023

Uh oh!

Pennycook commented Apr 20, 2023

Uh oh!

againull commented Apr 21, 2023

Uh oh!

aelovikov-intel commented Apr 21, 2023

Uh oh!

JackAKirk commented Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

JackAKirk commented Apr 24, 2023 •

edited

Loading