[SYCL] Implement basic reduction for parallel_for() accepting nd_range #1585

v-klochkov · 2020-04-24T23:55:16Z

This patch adds the algorithm that implements 1 reduction in parallel_for().
It handles all types and operations, including user's custom ones.
The more efficient variants are on the way.

What is NOT supported by this patch:

parallel_for(range, ...) // i.e. simple range without work-group sizes
parallel_for(nd_range, reduction1, reduction1, ...) // i.e. more than
1 reductions in paralell_for
USM
vector reductions (dims > 1 & #elements > 1)
HOST. The implementation used in this patch uses barrier(), which
is not supported on HOST yet.

v-klochkov · 2020-04-25T07:00:40Z

Please see the current proposal for reduction feature here: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/Reduction/Reduction.md

The first patch adding reduction.hpp file and reduction classes is here: #1585

Sorry, for the big patch, but the majority of those newly added lines are the 5 new LIT tests.

AlexeySachkov

If I understand correctly, I was added as code owner for function_pointers.hpp and I have no objections against trivial change proposed to that file.

A few minor comments for the rest of PR

AlexeySachkov · 2020-04-27T08:14:41Z

sycl/include/CL/sycl/detail/cg.hpp

@@ -46,7 +46,7 @@ class interop_handler {

 public:
  using QueueImplPtr = std::shared_ptr<detail::queue_impl>;
-  using ReqToMem = std::pair<detail::Requirement*, pi_mem>;
+  using ReqToMem = std::pair<detail::Requirement *, pi_mem>;


This seems to be an unrelated change

It was done by clang-format

sycl/include/CL/sycl/handler.hpp

AlexeySachkov · 2020-04-27T08:29:31Z

sycl/test/reduction/reduction_nd_conditional.cpp

+template <typename T, int Dim, class BinaryOperation>
+class Unknown;
+
+template <typename T>


You could create a header file for definition of this class, since it is used in several tests

Agreed. I'd be tempted to rename it something like CustomType or CustomVec as well, to highlight why you're defining a new class instead of using sycl::vec.

Ok, I created a new reduction_utils.hpp header and renamed the type. Thank you.

sycl/include/CL/sycl/handler.hpp

andreyfe1 · 2020-04-27T09:07:37Z

sycl/include/CL/sycl/handler.hpp

+        size_t GID = NDIt.get_global_linear_id();
+        // Copy the element to local memory to prepare it for tree-reduction
+        LocalReds[LID] = (GID < NWorkItems) ? In[GID] : ReduIdentity;
+        LocalReds[WGSize] = ReduIdentity;


Will it be better to allow only one NDIt write to WGSize to not have multiple write operations from different items?

Excuse me, I do not understand what you suggest here. Please give more details.

I mean that you have WGSize work items: 0,1,..., WGSize-1. Each of work-items updates LocalReds[WGSize]. Would it better to have something like that if(LID==0){LocalReds[WGSize] = ...}. So, only one work item (not all) updates the value.

If you don't mind I'll try your suggestion in the next patch.
It is a tiny fix that should not hold the commit, because it will not change performance in any way.

Having additional conditional in the code is also bad for performance. (Hmm, both the conditional and that duplicated write could be eliminated by vectorizer that combines ops across work-items),
I'll check the device code and do or not do the additional fix with the next patch which is currently blocked by this PR.

OK. Thanks for the response

sycl/include/CL/sycl/handler.hpp

Pennycook

This looks really good to me. I spotted a few minor typos and have a few suggestions for refactoring, but nothing major.

Two other things that weren't tied to any particular line of code:

Could you please add a few tests using transparent functors as in https://github.com/intel/llvm/blob/sycl/sycl/test/group-algorithm/reduce.cpp#L75? I don't think this needs to be exhaustive, so just adding variants to the "Check with various operations" tests would be good.
Do you think it's clear what an "Aux" kernel is? It's clear to me, but may not be clear to readers less familiar with reductions. I don't want this point to block the merge, but thought I'd bring it up in case you had any ideas for alternative names.

sycl/include/CL/sycl/handler.hpp

Pennycook · 2020-04-27T14:24:41Z

sycl/include/CL/sycl/handler.hpp

+    // size may be not power of those. Those two cases considered inefficient
+    // as they require additional code and checks in the kernel.
+    bool IsUnderLoaded = NWorkGroups * WGSize != NWorkItems;
+    size_t InefficientCase = (IsUnderLoaded || (WGSize & (WGSize - 1))) ? 1 : 0;


It took me a while to work out why this is declared as a size_t.

Do you think it would be clearer if InefficientCase was a bool? You could shift the logic of whether you need to add 1 or 0 to line 945.

Yes, I did that. Thank you.

sycl/include/CL/sycl/handler.hpp

Pennycook · 2020-04-27T14:30:47Z

sycl/include/CL/sycl/handler.hpp

+      handler AuxHandler(QueueCopy, MIsHost);
+      AuxHandler.saveCodeLoc(MCodeLoc);
+
+      // The last kernel DOES write to reductions's accessor.


Suggested change

// The last kernel DOES write to reductions's accessor.

// The last kernel DOES write to reduction's accessor.

sycl/source/handler.cpp

Pennycook · 2020-04-27T14:34:23Z

sycl/test/reduction/reduction_nd_s0_dw.cpp

+// RUN: %CPU_RUN_PLACEHOLDER %t.out
+// RUN: %GPU_RUN_PLACEHOLDER %t.out
+// RUN: %ACC_RUN_PLACEHOLDER %t.out
+//==----------------reduction_ctor.cpp - SYCL reduction basic test ---------==//


Suggested change

//==----------------reduction_ctor.cpp - SYCL reduction basic test ---------==//

//==----------------reduction_nd_s0_dw.cpp - SYCL reduction basic test ---------==//

Although I think @bader said before that we can drop these licenses from tests, if we want to.

That is something new (I see that new sycl/test/abi/* tests don't have it. I dropped licenses.

Pennycook · 2020-04-27T14:36:03Z

sycl/test/reduction/reduction_nd_conditional.cpp

+template <typename T, int Dim, class BinaryOperation>
+class Unknown;
+
+template <typename T>


Agreed. I'd be tempted to rename it something like CustomType or CustomVec as well, to highlight why you're defining a new class instead of using sycl::vec.

Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>

This patch adds the algorithm that implements 1 reduction in parallel_for(). It handles all types and operations, including user's custom ones. The more efficient variants are on the way. What is NOT supported by this patch: - parallel_for(range, ...) // i.e. simple range without work-group sizes - parallel_for(nd_range, reduction1, reduction1, ...) // i.e. more than 1 reductions in paralell_for - USM - vector reductions (dims > 1 & #elements > 1) - HOST. The implmentation used in this patch uses barrier(), which is not supported on HOST yet. Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>

Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>

The fix also removes the field handler::MReductionsStorage and re-uses the existing MSharedPtrStorage to keep reductions buffers alive until the execution on device/host code using those buffers finishes. Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>

…IT tests Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>

v-klochkov · 2020-04-27T20:15:17Z

Thank you for quick response/review.
I added the additional changes as a separate (5th) commit of this PR.
Please re-review.

v-klochkov · 2020-04-27T20:18:24Z

@Pennycook
Ops, I missed one more comment. This one. It will take some time for me to add another test case.

Could you please add a few tests using transparent functors as in https://github.com/intel/llvm/blob/sycl/sycl/test/group-algorithm/reduce.cpp#L75? I don't think this needs to be exhaustive, so just adding variants to the "Check with various operations" tests would be good.

Do you think it's clear what an "Aux" kernel is? It's clear to me, but may not be clear to readers less familiar with reductions. I don't want this point to block the merge, but thought I'd bring it up in case you had any ideas for alternative names.

Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>

v-klochkov · 2020-04-27T21:45:45Z

@Pennycook
Ops, I missed one more comment.

This time I added a test to check reductions using transparent operators. Please see the 6th commit.

Regarding 'Aux' and naming. I don't see very good alternatives right now. Perhaps the comments, I added before reduAuxCGFunc() and inside inside parallel_for() lowering, help understand what they do.

tovinkere

Does queue_impl apply to ordered_queue as well? If so, these changes look good. It has been a while since I have looked at queues, so apologies for requesting clarification.

tovinkere · 2020-04-27T22:46:21Z

sycl/include/CL/sycl/handler.hpp

  /// \return a SYCL event object representing the command group
-  event finalize(const cl::sycl::detail::code_location &Payload = {});
+  event finalize();


It appears like you have refactored the code location parameter that was coming into finalize() - unfortunately, ordered_queue uses the previous convention. Since the signature of handler.finalize() has changed, this may break ordered_queue.

Can you please reflect the same changes to ordered_queue as well?

sycl/include/CL/sycl/handler.hpp

tovinkere · 2020-04-27T22:49:40Z

sycl/source/detail/queue_impl.hpp

@@ -362,8 +367,9 @@ class queue_impl {
                    shared_ptr_class<queue_impl> Self,
                    const detail::code_location &Loc) {
    handler Handler(std::move(Self), MHostQueue);
+    Handler.saveCodeLoc(Loc);


Apply comments from handler.hpp:247 - ordered_queue will require the same changes.

tovinkere · 2020-04-27T23:04:55Z

sycl/source/handler.cpp

+  Queue->addEvent(std::move(Event));
+}
+
+event handler::finalize() {


Ordered_queue sends in the code location information through finalize as a parameter. This change will affect ordered_queue. Please ensure that ordered_queue is correct as well.

Would you please show the code that calls finalize() method and passes code_location to it?
I grepped all files in SYCL folder and did not find any calls that would not be fixed.
The only finalize(code_loc) call was in queue_impl::submit_impl(), which is used by ordered_queue.

But I fixed submit_impl, so no additional changes required.

@v-klochkov I looked at it too and the share the queue_impl, so everything looks good.

romanovvlad · 2020-04-28T14:16:38Z

sycl/include/CL/sycl/handler.hpp

+    size_t NWorkGroups = Range.get_group_range().size();
+
+    bool IsUnderLoaded = (NWorkGroups * WGSize - NWorkItems) != 0;
+    bool IsEfficientCase = !IsUnderLoaded && ((WGSize & (WGSize - 1)) == 0);


Please add comments why these bool vars are computed in the way how they computed: Why you think one case is efficient, another - not.

romanovvlad · 2020-04-28T14:17:09Z

sycl/include/CL/sycl/handler.hpp

+    bool IsUnderLoaded = (NWorkGroups * WGSize - NWorkItems) != 0;
+    bool IsEfficientCase = !IsUnderLoaded && ((WGSize & (WGSize - 1)) == 0);
+
+    bool IsUpdateOfUserAcc =


Suggested change

bool IsUpdateOfUserAcc =

const bool IsUpdateOfUserAcc =

Please, apply to the whole patch.

Please explain what it changes.

IsUpdateOfUserAcc check includes 'NWorkGroups == 1', which is not known statically, and if so,then how this case is different from all other temp variables such as IsEfficientCase, etc.
Do you ask using 'const TYPE Var = ;' when 'Var' is initialized once and not changed after?

romanovvlad · 2020-04-28T14:19:11Z

sycl/include/CL/sycl/handler.hpp

+    // The additional last element is used to catch elements that could
+    // otherwise be lost in the tree-reduction algorithm.
+    size_t NumLocalElements = WGSize + (IsEfficientCase ? 0 : 1);
+    auto LocalReds = Redu.getReadWriteLocalAcc(NumLocalElements, *this);


Please, do not use auto in cases where the type is not clear. Please, apply to the whole patch.

That is one of reasons why I have that func in reduction_impl class.
Because the buffer.hpp and accessor.hpp cannot be included in handler.hpp.
I'll change if you really insist/repeat the request, I would prefer this 'auto'.

romanovvlad · 2020-04-28T14:22:33Z

sycl/include/CL/sycl/handler.hpp

+    size_t WGSize = Range.get_local_range().size();
+    size_t NWorkGroups = Range.get_group_range().size();
+
+    bool IsUnderLoaded = (NWorkGroups * WGSize - NWorkItems) != 0;


I guess the more common name for what is computed here is something like "HasNonUnfiromWG"

romanovvlad · 2020-04-28T15:38:42Z

sycl/include/CL/sycl/handler.hpp

+    // to user's accessor, then detach user's accessor from this kernel
+    // to make the dependencies between accessors and kernels more clean and
+    // correct.
+    if (NWorkGroups > 1)


How is it connected to the "last one kernel" from the comment above?

If the mainCGFunc is the last kernel, i.e. if nd_range was (N, N), then mainCGFunc and the kernel in it will write to user's accessor.
Otherwise (if NWorkGroups > 1, i.e. number of partial sums > 1), there will be another kernel following to the main kernel. and thus main kernel does not write to user's reduction variable and the dependency to user's variable is redundant.

romanovvlad · 2020-04-28T15:39:42Z

sycl/include/CL/sycl/handler.hpp

+    // 1. Call the kernel that includes user's lambda function.
+    // If this kernel is going to be now last one, i.e. it does not write
+    // to user's accessor, then detach user's accessor from this kernel
+    // to make the dependencies between accessors and kernels more clean and


What is the problem if we do not detach accessor? What exactly will be incorrect?

There will be redundant dependencies. Probably, nothing incorrect.
Let suppose there are

initialization of user's reduction accessor

main CGFunc (with user's lambda and 1 iteration to reduce elements)

Aux kernel 1

Aux kernel 2

Final Aux kernel 3

If do not detach reduction accessor: If detach reduction accessor:

(1) (2) can be started even before (1) (2) depends on (1) (3) depends on (2) (3) depends on (2) (4) depends on (3) (4) depends on (3) (1) may init reduction's buffer here (5) depends on (4) and (1) (5) depends on (5) and (1)

So, do you think I need just to remove dissociateWithHandler?

romanovvlad · 2020-04-28T15:52:36Z

sycl/include/CL/sycl/handler.hpp

+    while (NWorkItems > 1) {
+      // Before creating another kernel, add the event from the previous kernel
+      // to queue.
+      addEventToQueue(QueueCopy, MLastEvent);


It is done to mimic the dependencies chain that is created for normal SYCL code without reductions. I.e.:
Q.submit( { user's CGFunc and lambda func})
while (NWorkGroups > 1)
Q.submit( { Aux function to reduce elements; })

romanovvlad · 2020-04-28T15:59:54Z

sycl/include/CL/sycl/handler.hpp

+  ///
+  /// Briefly: user's lambda, tree-reduction, CUSTOM types/ops.
+  template <typename KernelName, typename KernelType, int Dims, class Reduction>
+  void reduCGFunc(KernelType KernelFunc, const nd_range<Dims> &Range,


Suggest moving these functions to some separate header. It seems strange that handler class handling reductions on this level.

I'll try that.

I'll move that to sycl/intel/reduction.hpp and make them separate routines. I don't see any value in adding them as static method to 'reducer' or 'reduction_impl' class.

romanovvlad · 2020-04-28T16:03:53Z

sycl/include/CL/sycl/handler.hpp

+      addEventToQueue(QueueCopy, MLastEvent);
+
+      // TODO: here the work-group size is not limited by user's needs,
+      // the better strategy here is to make the work-group-size as big


Why this is the better strategy?

Because user may specify nd_range with small work-group because of some good reasons requirements for user's lambda, for example nd_range=(1M, 4), So, the main kernel must obey that and use WGSize=4, but it is obviously more efficient to use bigger WGSize for aux kernels that only reduce elements and do that much faster when WGSize is big (it also will require less calls of aux kernel, when the WGSize is bigger).

I'll update comment.

romanovvlad · 2020-04-28T16:30:13Z

sycl/include/CL/sycl/handler.hpp

+    // correct.
+    if (NWorkGroups > 1)
+      dissociateWithHandler(Redu.MAcc);
+


Suggest creating separate handler even to run user's lambda. And set "command type" of this handler to something like a NOP or Aggregator so finalize does nothing for such a type.

What is the reasoning for that?
That approach seems more error prone. User may create many accessors, which get associated with 'this' handler and not with the new handler. New handler will not have any knowledge about those objects
User may do some additional calls for 'this' handler which will not have effect to new handler running user's code.

AlexeySachkov

As for code owner of function_pointers.hpp, changes in there still look good to me

romanovvlad

I'm OK if my comments are resolved in a separate PR.

v-klochkov requested a review from Pennycook April 24, 2020 23:55

v-klochkov requested review from AlexeySachkov and a team as code owners April 24, 2020 23:55

v-klochkov requested a review from sergey-semenov April 24, 2020 23:55

v-klochkov assigned Pennycook Apr 24, 2020

v-klochkov requested review from romanovvlad and removed request for sergey-semenov April 24, 2020 23:56

v-klochkov requested a review from againull April 25, 2020 07:26

AlexeySachkov reviewed Apr 27, 2020

View reviewed changes

andreyfe1 reviewed Apr 27, 2020

View reviewed changes

bader requested a review from tovinkere April 27, 2020 09:27

Pennycook requested changes Apr 27, 2020

View reviewed changes

v-klochkov added 5 commits April 27, 2020 13:11

[SYCL] Fix potential errors caused by new sycl::intel::detail namespace

29cdc5e

Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>

[SYCL][LIT] Add a new LIT test for reduction + conditional statement

f974f85

Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>

[SYCL] Do additional changes per reviewer's comments, fix regressed L…

a409752

…IT tests Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>

v-klochkov force-pushed the public_vklochkov_reduction_p2 branch from 315fb8b to a409752 Compare April 27, 2020 20:12

v-klochkov requested review from Pennycook and AlexeySachkov April 27, 2020 20:15

[SYCL] Add a test to check reductions using transparent operators

a6350ba

Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>

Pennycook approved these changes Apr 27, 2020

View reviewed changes

tovinkere requested changes Apr 27, 2020

View reviewed changes

v-klochkov requested a review from tovinkere April 27, 2020 23:42

tovinkere approved these changes Apr 28, 2020

View reviewed changes

romanovvlad requested changes Apr 28, 2020

View reviewed changes

AlexeySachkov approved these changes Apr 28, 2020

View reviewed changes

romanovvlad approved these changes Apr 28, 2020

View reviewed changes

v-klochkov merged commit bb73d92 into intel:sycl Apr 28, 2020

This was referenced Apr 29, 2020

[SYCL] Do additional mostly NFC changes for reduction patch(1585) #1602

Merged

[SYCL][CUDA] Reduction ext unsupported #1641

Merged

v-klochkov deleted the public_vklochkov_reduction_p2 branch May 8, 2020 06:35

masterleinad mentioned this pull request Oct 22, 2020

SYCL feature level 5 kokkos/kokkos#3480

Merged

	// The last kernel DOES write to reductions's accessor.
	// The last kernel DOES write to reduction's accessor.

	//==----------------reduction_ctor.cpp - SYCL reduction basic test ---------==//
	//==----------------reduction_nd_s0_dw.cpp - SYCL reduction basic test ---------==//

[SYCL] Implement basic reduction for parallel_for() accepting nd_range #1585

[SYCL] Implement basic reduction for parallel_for() accepting nd_range #1585

Uh oh!

Conversation

v-klochkov commented Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

v-klochkov commented Apr 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexeySachkov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Pennycook left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

v-klochkov commented Apr 27, 2020

Uh oh!

v-klochkov commented Apr 27, 2020

Uh oh!

v-klochkov commented Apr 27, 2020

Uh oh!

tovinkere left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

v-klochkov Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

v-klochkov commented Apr 24, 2020 •

edited

Loading

v-klochkov commented Apr 25, 2020 •

edited

Loading

tovinkere left a comment •

edited

Loading

v-klochkov Apr 27, 2020 •

edited

Loading