Skip to content

[SYCL] Implement basic reduction for parallel_for() accepting nd_range #1585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 28, 2020

Conversation

v-klochkov
Copy link
Contributor

@v-klochkov v-klochkov commented Apr 24, 2020

This patch adds the algorithm that implements 1 reduction in parallel_for().
It handles all types and operations, including user's custom ones.
The more efficient variants are on the way.

What is NOT supported by this patch:

  • parallel_for(range, ...) // i.e. simple range without work-group sizes
  • parallel_for(nd_range, reduction1, reduction1, ...) // i.e. more than
    1 reductions in paralell_for
  • USM
  • vector reductions (dims > 1 & #elements > 1)
  • HOST. The implementation used in this patch uses barrier(), which
    is not supported on HOST yet.

@v-klochkov v-klochkov requested a review from Pennycook April 24, 2020 23:55
@v-klochkov v-klochkov requested review from AlexeySachkov and a team as code owners April 24, 2020 23:55
@v-klochkov v-klochkov requested review from romanovvlad and removed request for sergey-semenov April 24, 2020 23:56
@v-klochkov
Copy link
Contributor Author

v-klochkov commented Apr 25, 2020

Please see the current proposal for reduction feature here: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/Reduction/Reduction.md

The first patch adding reduction.hpp file and reduction classes is here: #1585

Sorry, for the big patch, but the majority of those newly added lines are the 5 new LIT tests.

@v-klochkov v-klochkov requested a review from againull April 25, 2020 07:26
Copy link
Contributor

@AlexeySachkov AlexeySachkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, I was added as code owner for function_pointers.hpp and I have no objections against trivial change proposed to that file.

A few minor comments for the rest of PR

@@ -46,7 +46,7 @@ class interop_handler {

public:
using QueueImplPtr = std::shared_ptr<detail::queue_impl>;
using ReqToMem = std::pair<detail::Requirement*, pi_mem>;
using ReqToMem = std::pair<detail::Requirement *, pi_mem>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be an unrelated change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was done by clang-format

template <typename T, int Dim, class BinaryOperation>
class Unknown;

template <typename T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could create a header file for definition of this class, since it is used in several tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'd be tempted to rename it something like CustomType or CustomVec as well, to highlight why you're defining a new class instead of using sycl::vec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I created a new reduction_utils.hpp header and renamed the type. Thank you.

size_t GID = NDIt.get_global_linear_id();
// Copy the element to local memory to prepare it for tree-reduction
LocalReds[LID] = (GID < NWorkItems) ? In[GID] : ReduIdentity;
LocalReds[WGSize] = ReduIdentity;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be better to allow only one NDIt write to WGSize to not have multiple write operations from different items?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excuse me, I do not understand what you suggest here. Please give more details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that you have WGSize work items: 0,1,..., WGSize-1. Each of work-items updates LocalReds[WGSize]. Would it better to have something like that if(LID==0){LocalReds[WGSize] = ...}. So, only one work item (not all) updates the value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind I'll try your suggestion in the next patch.
It is a tiny fix that should not hold the commit, because it will not change performance in any way.

Having additional conditional in the code is also bad for performance. (Hmm, both the conditional and that duplicated write could be eliminated by vectorizer that combines ops across work-items),
I'll check the device code and do or not do the additional fix with the next patch which is currently blocked by this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Thanks for the response

@bader bader requested a review from tovinkere April 27, 2020 09:27
Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good to me. I spotted a few minor typos and have a few suggestions for refactoring, but nothing major.

Two other things that weren't tied to any particular line of code:

  1. Could you please add a few tests using transparent functors as in https://github.com/intel/llvm/blob/sycl/sycl/test/group-algorithm/reduce.cpp#L75? I don't think this needs to be exhaustive, so just adding variants to the "Check with various operations" tests would be good.

  2. Do you think it's clear what an "Aux" kernel is? It's clear to me, but may not be clear to readers less familiar with reductions. I don't want this point to block the merge, but thought I'd bring it up in case you had any ideas for alternative names.

// size may be not power of those. Those two cases considered inefficient
// as they require additional code and checks in the kernel.
bool IsUnderLoaded = NWorkGroups * WGSize != NWorkItems;
size_t InefficientCase = (IsUnderLoaded || (WGSize & (WGSize - 1))) ? 1 : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to work out why this is declared as a size_t.

Do you think it would be clearer if InefficientCase was a bool? You could shift the logic of whether you need to add 1 or 0 to line 945.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did that. Thank you.

handler AuxHandler(QueueCopy, MIsHost);
AuxHandler.saveCodeLoc(MCodeLoc);

// The last kernel DOES write to reductions's accessor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The last kernel DOES write to reductions's accessor.
// The last kernel DOES write to reduction's accessor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// RUN: %CPU_RUN_PLACEHOLDER %t.out
// RUN: %GPU_RUN_PLACEHOLDER %t.out
// RUN: %ACC_RUN_PLACEHOLDER %t.out
//==----------------reduction_ctor.cpp - SYCL reduction basic test ---------==//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//==----------------reduction_ctor.cpp - SYCL reduction basic test ---------==//
//==----------------reduction_nd_s0_dw.cpp - SYCL reduction basic test ---------==//

Although I think @bader said before that we can drop these licenses from tests, if we want to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is something new (I see that new sycl/test/abi/* tests don't have it. I dropped licenses.

template <typename T, int Dim, class BinaryOperation>
class Unknown;

template <typename T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'd be tempted to rename it something like CustomType or CustomVec as well, to highlight why you're defining a new class instead of using sycl::vec.

Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
This patch adds the algorithm that implements 1 reduction in parallel_for().
It handles all types and operations, including user's custom ones.
The more efficient variants are on the way.

What is NOT supported by this patch:
- parallel_for(range, ...) // i.e. simple range without work-group sizes
- parallel_for(nd_range, reduction1, reduction1, ...) // i.e. more than
  1 reductions in paralell_for
- USM
- vector reductions (dims > 1 & #elements > 1)
- HOST. The implmentation used in this patch uses barrier(), which
  is not supported on HOST yet.

Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
The fix also removes the field handler::MReductionsStorage and re-uses
the existing MSharedPtrStorage to keep reductions buffers alive
until the execution on device/host code using those buffers finishes.

Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
…IT tests

Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
@v-klochkov v-klochkov force-pushed the public_vklochkov_reduction_p2 branch from 315fb8b to a409752 Compare April 27, 2020 20:12
@v-klochkov
Copy link
Contributor Author

Thank you for quick response/review.
I added the additional changes as a separate (5th) commit of this PR.
Please re-review.

@v-klochkov
Copy link
Contributor Author

@Pennycook
Ops, I missed one more comment. This one. It will take some time for me to add another test case.

Could you please add a few tests using transparent functors as in https://github.com/intel/llvm/blob/sycl/sycl/test/group-algorithm/reduce.cpp#L75? I don't think this needs to be exhaustive, so just adding variants to the "Check with various operations" tests would be good.

Do you think it's clear what an "Aux" kernel is? It's clear to me, but may not be clear to readers less familiar with reductions. I don't want this point to block the merge, but thought I'd bring it up in case you had any ideas for alternative names.

Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
@v-klochkov
Copy link
Contributor Author

@Pennycook
Ops, I missed one more comment.

This time I added a test to check reductions using transparent operators. Please see the 6th commit.

Regarding 'Aux' and naming. I don't see very good alternatives right now. Perhaps the comments, I added before reduAuxCGFunc() and inside inside parallel_for() lowering, help understand what they do.

Copy link
Contributor

@tovinkere tovinkere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does queue_impl apply to ordered_queue as well? If so, these changes look good. It has been a while since I have looked at queues, so apologies for requesting clarification.

/// \return a SYCL event object representing the command group
event finalize(const cl::sycl::detail::code_location &Payload = {});
event finalize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears like you have refactored the code location parameter that was coming into finalize() - unfortunately, ordered_queue uses the previous convention. Since the signature of handler.finalize() has changed, this may break ordered_queue.

Can you please reflect the same changes to ordered_queue as well?

@@ -362,8 +367,9 @@ class queue_impl {
shared_ptr_class<queue_impl> Self,
const detail::code_location &Loc) {
handler Handler(std::move(Self), MHostQueue);
Handler.saveCodeLoc(Loc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply comments from handler.hpp:247 - ordered_queue will require the same changes.

Queue->addEvent(std::move(Event));
}

event handler::finalize() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordered_queue sends in the code location information through finalize as a parameter. This change will affect ordered_queue. Please ensure that ordered_queue is correct as well.

Copy link
Contributor Author

@v-klochkov v-klochkov Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please show the code that calls finalize() method and passes code_location to it?
I grepped all files in SYCL folder and did not find any calls that would not be fixed.
The only finalize(code_loc) call was in queue_impl::submit_impl(), which is used by ordered_queue.

But I fixed submit_impl, so no additional changes required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v-klochkov I looked at it too and the share the queue_impl, so everything looks good.

@v-klochkov v-klochkov requested a review from tovinkere April 27, 2020 23:42
size_t NWorkGroups = Range.get_group_range().size();

bool IsUnderLoaded = (NWorkGroups * WGSize - NWorkItems) != 0;
bool IsEfficientCase = !IsUnderLoaded && ((WGSize & (WGSize - 1)) == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments why these bool vars are computed in the way how they computed: Why you think one case is efficient, another - not.

bool IsUnderLoaded = (NWorkGroups * WGSize - NWorkItems) != 0;
bool IsEfficientCase = !IsUnderLoaded && ((WGSize & (WGSize - 1)) == 0);

bool IsUpdateOfUserAcc =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool IsUpdateOfUserAcc =
const bool IsUpdateOfUserAcc =

Please, apply to the whole patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain what it changes.

IsUpdateOfUserAcc check includes 'NWorkGroups == 1', which is not known statically, and if so,then how this case is different from all other temp variables such as IsEfficientCase, etc.
Do you ask using 'const TYPE Var = ;' when 'Var' is initialized once and not changed after?

// The additional last element is used to catch elements that could
// otherwise be lost in the tree-reduction algorithm.
size_t NumLocalElements = WGSize + (IsEfficientCase ? 0 : 1);
auto LocalReds = Redu.getReadWriteLocalAcc(NumLocalElements, *this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, do not use auto in cases where the type is not clear. Please, apply to the whole patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is one of reasons why I have that func in reduction_impl class.
Because the buffer.hpp and accessor.hpp cannot be included in handler.hpp.
I'll change if you really insist/repeat the request, I would prefer this 'auto'.

size_t WGSize = Range.get_local_range().size();
size_t NWorkGroups = Range.get_group_range().size();

bool IsUnderLoaded = (NWorkGroups * WGSize - NWorkItems) != 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the more common name for what is computed here is something like "HasNonUnfiromWG"

// to user's accessor, then detach user's accessor from this kernel
// to make the dependencies between accessors and kernels more clean and
// correct.
if (NWorkGroups > 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it connected to the "last one kernel" from the comment above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the mainCGFunc is the last kernel, i.e. if nd_range was (N, N), then mainCGFunc and the kernel in it will write to user's accessor.
Otherwise (if NWorkGroups > 1, i.e. number of partial sums > 1), there will be another kernel following to the main kernel. and thus main kernel does not write to user's reduction variable and the dependency to user's variable is redundant.

// 1. Call the kernel that includes user's lambda function.
// If this kernel is going to be now last one, i.e. it does not write
// to user's accessor, then detach user's accessor from this kernel
// to make the dependencies between accessors and kernels more clean and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the problem if we do not detach accessor? What exactly will be incorrect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be redundant dependencies. Probably, nothing incorrect.
Let suppose there are

  1. initialization of user's reduction accessor
  2. main CGFunc (with user's lambda and 1 iteration to reduce elements)
  3. Aux kernel 1
  4. Aux kernel 2
  5. Final Aux kernel 3

If do not detach reduction accessor: If detach reduction accessor:

(1)                                                                (2) can be started even before (1)
(2) depends on (1)                                       (3) depends on (2)
(3) depends on (2)                                       (4) depends on (3)
(4) depends on (3)                                       (1) may init reduction's buffer here 
(5) depends on (4) and (1)                           (5) depends on (5) and (1)

So, do you think I need just to remove dissociateWithHandler?

while (NWorkItems > 1) {
// Before creating another kernel, add the event from the previous kernel
// to queue.
addEventToQueue(QueueCopy, MLastEvent);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is done to mimic the dependencies chain that is created for normal SYCL code without reductions. I.e.:
Q.submit( { user's CGFunc and lambda func})
while (NWorkGroups > 1)
Q.submit( { Aux function to reduce elements; })

///
/// Briefly: user's lambda, tree-reduction, CUSTOM types/ops.
template <typename KernelName, typename KernelType, int Dims, class Reduction>
void reduCGFunc(KernelType KernelFunc, const nd_range<Dims> &Range,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest moving these functions to some separate header. It seems strange that handler class handling reductions on this level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move that to sycl/intel/reduction.hpp and make them separate routines. I don't see any value in adding them as static method to 'reducer' or 'reduction_impl' class.

addEventToQueue(QueueCopy, MLastEvent);

// TODO: here the work-group size is not limited by user's needs,
// the better strategy here is to make the work-group-size as big
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is the better strategy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because user may specify nd_range with small work-group because of some good reasons requirements for user's lambda, for example nd_range=(1M, 4), So, the main kernel must obey that and use WGSize=4, but it is obviously more efficient to use bigger WGSize for aux kernels that only reduce elements and do that much faster when WGSize is big (it also will require less calls of aux kernel, when the WGSize is bigger).

I'll update comment.

// correct.
if (NWorkGroups > 1)
dissociateWithHandler(Redu.MAcc);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest creating separate handler even to run user's lambda. And set "command type" of this handler to something like a NOP or Aggregator so finalize does nothing for such a type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning for that?
That approach seems more error prone. User may create many accessors, which get associated with 'this' handler and not with the new handler. New handler will not have any knowledge about those objects
User may do some additional calls for 'this' handler which will not have effect to new handler running user's code.

Copy link
Contributor

@AlexeySachkov AlexeySachkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for code owner of function_pointers.hpp, changes in there still look good to me

Copy link
Contributor

@romanovvlad romanovvlad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK if my comments are resolved in a separate PR.

@v-klochkov v-klochkov merged commit bb73d92 into intel:sycl Apr 28, 2020
@v-klochkov v-klochkov deleted the public_vklochkov_reduction_p2 branch May 8, 2020 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants