Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Add ballot_group support to algorithms #8784

Merged
merged 20 commits into from
Apr 21, 2023

Conversation

Pennycook
Copy link
Contributor

Enables the following functions to be used with ballot_group arguments:

  • group_barrier
  • group_broadcast
  • any_of_group
  • all_of_group
  • none_of_group
  • reduce_over_group
  • exclusive_scan_over_group
  • inclusive_scan_over_group

Signed-off-by: John Pennycook john.pennycook@intel.com

To avoid duplicating logic and introducing even more overloads of the group
algorithms, it is desirable to move some of the implementation details into
the detail::spirv namespace.

This commit makes a few changes to enable that to happen:

- spirv:: functions with a Group template now take a group object, to enable
  run-time information (e.g. group membership) to pass through.

- ControlBarrier and the OpGroup* instruction used to implement reduce/scan
  now forward to spirv::, similar to other group functions and algorithms.

- The calc helper used to map functors to SPIR-V instructions is updated to
  use the new spirv:: functions, instead of calling __spirv intrinsics.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
Nested detail namespaces cause problems for name lookup.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
Enables the following functions to be used with ballot_group arguments:
- group_barrier
- group_broadcast
- any_of_group
- all_of_group
- none_of_group
- reduce_over_group
- exclusive_scan_over_group
- inclusive_scan_over_group

Signed-off-by: John Pennycook <john.pennycook@intel.com>
@Pennycook Pennycook added the spec extension All issues/PRs related to extensions specifications label Mar 27, 2023
@Pennycook Pennycook requested a review from a team as a code owner March 27, 2023 18:23
@Pennycook
Copy link
Contributor Author

A few quick notes to reviewers:

  1. I split this out into three commits to try and make the review a little easier. But I reconstructed them from the true history, so I'm not 100% sure if things will work if you try to build one of the earlier commits. 😄

  2. Tests are in [SYCL] Add tests for ballot_group algorithms llvm-test-suite#1698.

  3. This may look like a lot of refactoring for the sake of one group type, but there are a few more coming. The idea of having something like spirv::GroupBroadcast resolve either to __spirv_GroupBroadcast or a sequence of __spirv_GroupNonUniformBroadcast calls depending on the type of group is to facilitate simpler implementation of the remaining non-uniform groups.

@Pennycook Pennycook temporarily deployed to aws March 27, 2023 19:22 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to aws March 27, 2023 21:09 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to aws March 27, 2023 22:06 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to aws March 27, 2023 23:03 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to aws March 28, 2023 20:25 — with GitHub Actions Inactive
@Pennycook Pennycook closed this Mar 28, 2023
@Pennycook Pennycook reopened this Mar 28, 2023
@Pennycook Pennycook temporarily deployed to aws March 29, 2023 04:07 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to aws March 29, 2023 08:18 — with GitHub Actions Inactive
@aelovikov-intel
Copy link
Contributor

Hi @Pennycook , with intel/llvm-test-suite moved in-tree you need to add the tests from intel/llvm-test-suite#1698 into this PR now.

Tests the ability to create an instance of each new group type,
and the correctness of the core member functions.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
This commit adds tests for using ballot_group and the following algorithms:
- group_barrier
- group_broadcast
- any_of_group
- all_of_group
- none_of_group
- reduce_over_group
- exclusive_scan_over_group
- inclusive_scan_over_group

Signed-off-by: John Pennycook <john.pennycook@intel.com>
@Pennycook
Copy link
Contributor Author

Hi @Pennycook , with intel/llvm-test-suite moved in-tree you need to add the tests from intel/llvm-test-suite#1698 into this PR now.

Thanks, @aelovikov-intel. I've copied over the tests from intel/llvm-test-suite#1698, and also the related tests from intel/llvm-test-suite#1574 which didn't get merged before the move.

sycl/include/sycl/detail/spirv.hpp Outdated Show resolved Hide resolved
sycl/include/sycl/detail/spirv.hpp Outdated Show resolved Hide resolved
sycl/include/sycl/detail/spirv.hpp Outdated Show resolved Hide resolved
sycl/include/sycl/detail/spirv.hpp Outdated Show resolved Hide resolved
sycl/test-e2e/NonUniformGroups/ballot_group.cpp Outdated Show resolved Hide resolved
sycl/test-e2e/NonUniformGroups/ballot_group.cpp Outdated Show resolved Hide resolved
sycl/test-e2e/NonUniformGroups/ballot_group.cpp Outdated Show resolved Hide resolved
sycl/test-e2e/NonUniformGroups/ballot_group.cpp Outdated Show resolved Hide resolved
Comment on lines +23 to +30
sycl::buffer<bool, 1> BarrierBuf{sycl::range{32}};
sycl::buffer<bool, 1> BroadcastBuf{sycl::range{32}};
sycl::buffer<bool, 1> AnyBuf{sycl::range{32}};
sycl::buffer<bool, 1> AllBuf{sycl::range{32}};
sycl::buffer<bool, 1> NoneBuf{sycl::range{32}};
sycl::buffer<bool, 1> ReduceBuf{sycl::range{32}};
sycl::buffer<bool, 1> ExScanBuf{sycl::range{32}};
sycl::buffer<bool, 1> IncScanBuf{sycl::range{32}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my debug experiments I'm using something like

constexpr int N_RESULTS = 32;
sycl::buffer<bool, 1> results(32 * N_RESULTS);
...
accessor res_acc{results, cgh};
// kernel
auto *res = res_acc.get_pointer() + WI*N_RESULTS;
...
*res++ = res1;
*res++ = res2;
...
// could be outlined to a helper and shared between different places.
host_accessor res_acc{results};
bool success = std::all_of(res_acc.begin(), res_acc.end(), [](bool r) { return r; });
if (!success) {
  for (int j = 0; j< N_RESULTS; ++j) {
    for (int i = 0; i < res_acc.size() / N_RESULTS; ++i) {
      if (i % 8 == 0)
        std::cout << " |";
      std::cout << " " << res_acc[i*N_RESULTS + j];
    }
    std::cout << std::endl;
  }
  assert(false);
}

I think it might be suitable here as well, but up to you.

sycl/test-e2e/NonUniformGroups/ballot_group_algorithms.cpp Outdated Show resolved Hide resolved
uint32_t ReduceResult =
sycl::reduce_over_group(BallotGroup, 1, sycl::plus<>());
ReduceAcc[WI] =
(ReduceResult == BallotGroup.get_local_linear_range());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, the previous test only verified get_local_range() and not get_local_linear_range() but we can leave this to CTS.

Comment on lines +1 to +14
// RUN: %clangxx -fsycl -fsyntax-only -fsycl-targets=%sycl_triple %s -o %t.out

#include <sycl/sycl.hpp>
namespace syclex = sycl::ext::oneapi::experimental;

static_assert(
syclex::is_user_constructed_group_v<syclex::ballot_group<sycl::sub_group>>);
static_assert(syclex::is_user_constructed_group_v<
syclex::cluster_group<1, sycl::sub_group>>);
static_assert(syclex::is_user_constructed_group_v<
syclex::cluster_group<2, sycl::sub_group>>);
static_assert(
syclex::is_user_constructed_group_v<syclex::tangle_group<sycl::sub_group>>);
static_assert(syclex::is_user_constructed_group_v<syclex::opportunistic_group>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd slightly prefer to have this fused with the previous one but won't insist on that.

Comment on lines +42 to +47
Match &= (OpportunisticGroup.get_group_id() == 0);
Match &= (OpportunisticGroup.get_local_id() <
OpportunisticGroup.get_local_range());
Match &= (OpportunisticGroup.get_group_range() == 1);
Match &= (OpportunisticGroup.get_local_linear_range() <=
SG.get_local_linear_range());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest writing all the ranges/WIs and verifying their sum/existence on the host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases, that requires a lot more data to be sent back to the host, though. Doing the test inside of the kernel means we have access to all the values of nd_item, sub_group and the opportunistic_group. Getting all those values on the host would require a bunch of additional accessors.

sycl/test-e2e/NonUniformGroups/tangle_group.cpp Outdated Show resolved Hide resolved
@Pennycook Pennycook temporarily deployed to aws March 29, 2023 17:33 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to aws March 29, 2023 19:50 — with GitHub Actions Inactive
@Pennycook
Copy link
Contributor Author

All the tests passed, so I'll work on making some of these formatting changes tomorrow.

One thing to possibly look at on the CI side: this got stuck waiting on "Stop AWS" for a really long time -- several hours after all the test suite runs had completed. I'm not sure what that action does, but it struck me as odd that it didn't get scheduled sooner.

@aelovikov-intel
Copy link
Contributor

this got stuck waiting on "Stop AWS" for a really long time

That seems to be the issue with Github's public runners that we use for the ultra-lightweight tasks. We've been hitting this issue elsewhere recently too.

@Pennycook Pennycook temporarily deployed to aws March 30, 2023 22:13 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to aws March 31, 2023 00:59 — with GitHub Actions Inactive
@JackAKirk
Copy link
Contributor

Hi @Pennycook

I'm in the middle of implementing cuda support for these algorithms on top of your implementation and I'm at the point where it would be good to ask for your feedback on a few small implementation issues. I've also implemented cuda support for cluster_group and opportunistic_group on top of your implementations in the commit I reference here.
Note that here:
c546762#diff-23a47145f432e0af1de65698d91f80a6d0c61c2dcfa6db8a8d97ef84b19dbf59R124
I've added a new friend function to ballot_group GetMask, and I've correspondingly removed the friend status from sycl::detail::IdToMaskPosition since it can simply call GetMask. This function is required for at least ballot_group and opportunistic_group which already have a SubGroupMask member that is required in the NVPTX backend for the different supported group algorithms. This leads me to the questions:

The NVPTX backend could implement cluster_group without introducing a SubGroupMask Mask member, but at the moment I have added the Mask only for NVPTX here: c546762#diff-9f01b860115d8a8dab22444aefe9db812b7d11ef489d35a6e2470340436f098bR115
The advantage to storing the mask is that it would otherwise need to be recreated for every non_uniform group algorithm call taking a cluster_group in the NVPTX backend via cluster_group methods. Do you prefer the mask storing approach or the mask recreation approach?
If the mask storing approach is preferable then GetMask would obviously additionally need to support the cluster_group case in the NVPTX backend. Currently I notice that your implementation of sycl::detail::IdToMaskPosition supports all Non-Uniform groups, rather than just ballot_group and opportunistic_group, but I guessed that this was an oversight? This observation made me want to check that in the mask storing approach: would you prefer to wrap GetMask for cluster_group and the SubGroupMask member of cluster_group in #if defined(NVPTX), or just support them for Intel backends too but have them unused?

Other than this I don't think there are any issues. I have implemented GroupAny, GroupAll, GroupBarrier, and GroupBroadcast for ballot_group as you can see, and all these algorithms should call identical code for all NVPTX non uniform groups that store a mask. I've been testing them with the tests that you already wrote. Reduce/(inc/exc)Scan are a bit more complex but I do not believe they introduce any other implementation questions.
Let me know if you have any other thoughts/questions on the NVPTX implemention so far.

Thanks

@Pennycook
Copy link
Contributor Author

I'm in the middle of implementing cuda support for these algorithms on top of your implementation and I'm at the point where it would be good to ask for your feedback on a few small implementation issues.

Sorry for the delayed response, @JackAKirk.

I've also implemented cuda support for cluster_group and opportunistic_group on top of your implementations in the commit I reference here. Note that here: c546762#diff-23a47145f432e0af1de65698d91f80a6d0c61c2dcfa6db8a8d97ef84b19dbf59R124 I've added a new friend function to ballot_group GetMask, and I've correspondingly removed the friend status from sycl::detail::IdToMaskPosition since it can simply call GetMask. This function is required for at least ballot_group and opportunistic_group which already have a SubGroupMask member that is required in the NVPTX backend for the different supported group algorithms.

This is a good idea.

The NVPTX backend could implement cluster_group without introducing a SubGroupMask Mask member, but at the moment I have added the Mask only for NVPTX here: c546762#diff-9f01b860115d8a8dab22444aefe9db812b7d11ef489d35a6e2470340436f098bR115 The advantage to storing the mask is that it would otherwise need to be recreated for every non_uniform group algorithm call taking a cluster_group in the NVPTX backend via cluster_group methods. Do you prefer the mask storing approach or the mask recreation approach?

Hm. I don't think I have a strong preference, because it's not immediately obvious to me which the compiler is going to be better at optimizing. My gut says that storing the mask might be slightly easier to optimize: it's unlikely that somebody would create a cluster_group and not use it--so we don't have to worry about the compiler's ability to optimize away the mask logic--and I trust the compiler to avoid spilling/filling the calculated mask as best it can. I'm less confident that the compiler could reliably optimize away repeated mask calculations.

Currently I notice that your implementation of sycl::detail::IdToMaskPosition supports all Non-Uniform groups, rather than just ballot_group and opportunistic_group, but I guessed that this was an oversight? This observation made me want to check that in the mask storing approach: would you prefer to wrap GetMask for cluster_group and the SubGroupMask member of cluster_group in #if defined(NVPTX), or just support them for Intel backends too but have them unused?

Honestly, I'm not sure. I think you're right that it would make sense for IdToMaskPosition to only support the non-uniform groups that actually store masks. Perhaps this should be a more general, and not mention masks? The other groups may need to perform similar operations (e.g. in the case of a cluster_group, we probably need to do some modular arithmetic somewhere).

Other than this I don't think there are any issues. I have implemented GroupAny, GroupAll, GroupBarrier, and GroupBroadcast for ballot_group as you can see, and all these algorithms should call identical code for all NVPTX non uniform groups that store a mask. I've been testing them with the tests that you already wrote. Reduce/(inc/exc)Scan are a bit more complex but I do not believe they introduce any other implementation questions. Let me know if you have any other thoughts/questions on the NVPTX implemention so far.

The only thing I'm curious about is this:

res[0] = __nvvm_vote_ballot_sync(threads, predicate); // couldnt call this within intel impl because undefined behaviour if not all reach it?

I understand the comment, I think, -- threads is from __clc_membermask(), which always returns a mask representing the entire warp. But it's not obvious to me where this is being used by the new groups. Have I missed it?


Thank you again for working on this, I really appreciate it. Now that I'm back from vacation, I'll renew my efforts to get this fixed and merged in, along with the other group types.

@Pennycook Pennycook temporarily deployed to aws April 21, 2023 15:43 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to aws April 21, 2023 15:44 — with GitHub Actions Inactive
@steffenlarsen steffenlarsen merged commit 70a35de into intel:sycl Apr 21, 2023
@againull
Copy link
Contributor

Hi @Pennycook, could you please fix post-commit issues for this PR. They are mostly werror problems:
70a35de

@aelovikov-intel
Copy link
Contributor

I also see this https://github.com/intel/llvm/actions/runs/4767700304/jobs/8476298480#step:7:3089 :

D:\github\_work\llvm\llvm\build\include\sycl/ext/oneapi/experimental/non_uniform_groups.hpp(57): error C3861: '__builtin_unreachable': identifier not found
D:\github\_work\llvm\llvm\build\include\sycl/ext/oneapi/experimental/non_uniform_groups.hpp(57): note: '__builtin_unreachable': function declaration must be available as none of the arguments depend on a template parameter

I'm not 100% sure it's caused by this, but looks so.

@JackAKirk
Copy link
Contributor

JackAKirk commented Apr 24, 2023

I understand the comment, I think, -- threads is from __clc_membermask(), which always returns a mask representing the entire warp. But it's not obvious to me where this is being used by the new groups. Have I missed it?

Thank you again for working on this, I really appreciate it. Now that I'm back from vacation, I'll renew my efforts to get this fixed and merged in, along with the other group types.

Sorry for the confusion of the comment. This was really a note to myself when I was considering whether I could make a cuda impl that reused the same spirv functions used by the intel impl that do not take a mask (I can't even if it were desirable). The relevance is simply that you can use __nvvm_vote_ballot_sync(FULLMASK, predicate); to return the mask for the ballot_group.

All sounds good. I also think it is best to store a mask for fixed_size_group for cuda backend, which I have done in #9182 (see this PR for more details on cuda backend impl status). I opened this draft PR mainly for your convenience of seeing the requirements from the cuda backend implementation. I think I've now implemented enough so that there should be no more surprises with cuda backend requirements. Really the only missing thing is the scans implementations which will follow a very similar pattern to the reduce_over_group implementation, except it is simpler because there is no special sm_80 version.
I will wait to see how you implement the opportunistic_group algorithms before finalizing the design. In the mean-time I mainly just need to implement this case (https://github.com/intel/llvm/pull/9182/files#diff-185200e89425bd69d2e35ca3ba250d40934230436b718d4f2eecbab9325cc972R170) which is quite involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec extension All issues/PRs related to extensions specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants