-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL] Implement basic reduction for parallel_for() accepting nd_range #1585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Implement basic reduction for parallel_for() accepting nd_range #1585
Conversation
Please see the current proposal for reduction feature here: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/Reduction/Reduction.md The first patch adding reduction.hpp file and reduction classes is here: #1585 Sorry, for the big patch, but the majority of those newly added lines are the 5 new LIT tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, I was added as code owner for function_pointers.hpp
and I have no objections against trivial change proposed to that file.
A few minor comments for the rest of PR
@@ -46,7 +46,7 @@ class interop_handler { | |||
|
|||
public: | |||
using QueueImplPtr = std::shared_ptr<detail::queue_impl>; | |||
using ReqToMem = std::pair<detail::Requirement*, pi_mem>; | |||
using ReqToMem = std::pair<detail::Requirement *, pi_mem>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be an unrelated change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was done by clang-format
template <typename T, int Dim, class BinaryOperation> | ||
class Unknown; | ||
|
||
template <typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could create a header file for definition of this class, since it is used in several tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'd be tempted to rename it something like CustomType
or CustomVec
as well, to highlight why you're defining a new class instead of using sycl::vec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I created a new reduction_utils.hpp header and renamed the type. Thank you.
size_t GID = NDIt.get_global_linear_id(); | ||
// Copy the element to local memory to prepare it for tree-reduction | ||
LocalReds[LID] = (GID < NWorkItems) ? In[GID] : ReduIdentity; | ||
LocalReds[WGSize] = ReduIdentity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be better to allow only one NDIt
write to WGSize
to not have multiple write operations from different items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excuse me, I do not understand what you suggest here. Please give more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that you have WGSize
work items: 0,1,..., WGSize-1
. Each of work-items updates LocalReds[WGSize]
. Would it better to have something like that if(LID==0){LocalReds[WGSize] = ...}
. So, only one work item (not all) updates the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind I'll try your suggestion in the next patch.
It is a tiny fix that should not hold the commit, because it will not change performance in any way.
Having additional conditional in the code is also bad for performance. (Hmm, both the conditional and that duplicated write could be eliminated by vectorizer that combines ops across work-items),
I'll check the device code and do or not do the additional fix with the next patch which is currently blocked by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Thanks for the response
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good to me. I spotted a few minor typos and have a few suggestions for refactoring, but nothing major.
Two other things that weren't tied to any particular line of code:
-
Could you please add a few tests using transparent functors as in https://github.com/intel/llvm/blob/sycl/sycl/test/group-algorithm/reduce.cpp#L75? I don't think this needs to be exhaustive, so just adding variants to the "Check with various operations" tests would be good.
-
Do you think it's clear what an "Aux" kernel is? It's clear to me, but may not be clear to readers less familiar with reductions. I don't want this point to block the merge, but thought I'd bring it up in case you had any ideas for alternative names.
sycl/include/CL/sycl/handler.hpp
Outdated
// size may be not power of those. Those two cases considered inefficient | ||
// as they require additional code and checks in the kernel. | ||
bool IsUnderLoaded = NWorkGroups * WGSize != NWorkItems; | ||
size_t InefficientCase = (IsUnderLoaded || (WGSize & (WGSize - 1))) ? 1 : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to work out why this is declared as a size_t
.
Do you think it would be clearer if InefficientCase
was a bool
? You could shift the logic of whether you need to add 1 or 0 to line 945.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did that. Thank you.
sycl/include/CL/sycl/handler.hpp
Outdated
handler AuxHandler(QueueCopy, MIsHost); | ||
AuxHandler.saveCodeLoc(MCodeLoc); | ||
|
||
// The last kernel DOES write to reductions's accessor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The last kernel DOES write to reductions's accessor. | |
// The last kernel DOES write to reduction's accessor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// RUN: %CPU_RUN_PLACEHOLDER %t.out | ||
// RUN: %GPU_RUN_PLACEHOLDER %t.out | ||
// RUN: %ACC_RUN_PLACEHOLDER %t.out | ||
//==----------------reduction_ctor.cpp - SYCL reduction basic test ---------==// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//==----------------reduction_ctor.cpp - SYCL reduction basic test ---------==// | |
//==----------------reduction_nd_s0_dw.cpp - SYCL reduction basic test ---------==// |
Although I think @bader said before that we can drop these licenses from tests, if we want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is something new (I see that new sycl/test/abi/* tests don't have it. I dropped licenses.
template <typename T, int Dim, class BinaryOperation> | ||
class Unknown; | ||
|
||
template <typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'd be tempted to rename it something like CustomType
or CustomVec
as well, to highlight why you're defining a new class instead of using sycl::vec
.
Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
This patch adds the algorithm that implements 1 reduction in parallel_for(). It handles all types and operations, including user's custom ones. The more efficient variants are on the way. What is NOT supported by this patch: - parallel_for(range, ...) // i.e. simple range without work-group sizes - parallel_for(nd_range, reduction1, reduction1, ...) // i.e. more than 1 reductions in paralell_for - USM - vector reductions (dims > 1 & #elements > 1) - HOST. The implmentation used in this patch uses barrier(), which is not supported on HOST yet. Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
The fix also removes the field handler::MReductionsStorage and re-uses the existing MSharedPtrStorage to keep reductions buffers alive until the execution on device/host code using those buffers finishes. Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
…IT tests Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
315fb8b
to
a409752
Compare
Thank you for quick response/review. |
@Pennycook
|
Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
This time I added a test to check reductions using transparent operators. Please see the 6th commit. Regarding 'Aux' and naming. I don't see very good alternatives right now. Perhaps the comments, I added before reduAuxCGFunc() and inside inside parallel_for() lowering, help understand what they do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does queue_impl apply to ordered_queue as well? If so, these changes look good. It has been a while since I have looked at queues, so apologies for requesting clarification.
/// \return a SYCL event object representing the command group | ||
event finalize(const cl::sycl::detail::code_location &Payload = {}); | ||
event finalize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears like you have refactored the code location parameter that was coming into finalize() - unfortunately, ordered_queue uses the previous convention. Since the signature of handler.finalize() has changed, this may break ordered_queue.
Can you please reflect the same changes to ordered_queue as well?
@@ -362,8 +367,9 @@ class queue_impl { | |||
shared_ptr_class<queue_impl> Self, | |||
const detail::code_location &Loc) { | |||
handler Handler(std::move(Self), MHostQueue); | |||
Handler.saveCodeLoc(Loc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apply comments from handler.hpp:247 - ordered_queue will require the same changes.
Queue->addEvent(std::move(Event)); | ||
} | ||
|
||
event handler::finalize() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordered_queue sends in the code location information through finalize as a parameter. This change will affect ordered_queue. Please ensure that ordered_queue is correct as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please show the code that calls finalize() method and passes code_location to it?
I grepped all files in SYCL folder and did not find any calls that would not be fixed.
The only finalize(code_loc) call was in queue_impl::submit_impl(), which is used by ordered_queue.
But I fixed submit_impl, so no additional changes required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@v-klochkov I looked at it too and the share the queue_impl, so everything looks good.
size_t NWorkGroups = Range.get_group_range().size(); | ||
|
||
bool IsUnderLoaded = (NWorkGroups * WGSize - NWorkItems) != 0; | ||
bool IsEfficientCase = !IsUnderLoaded && ((WGSize & (WGSize - 1)) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments why these bool vars are computed in the way how they computed: Why you think one case is efficient, another - not.
bool IsUnderLoaded = (NWorkGroups * WGSize - NWorkItems) != 0; | ||
bool IsEfficientCase = !IsUnderLoaded && ((WGSize & (WGSize - 1)) == 0); | ||
|
||
bool IsUpdateOfUserAcc = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool IsUpdateOfUserAcc = | |
const bool IsUpdateOfUserAcc = |
Please, apply to the whole patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain what it changes.
IsUpdateOfUserAcc check includes 'NWorkGroups == 1', which is not known statically, and if so,then how this case is different from all other temp variables such as IsEfficientCase, etc.
Do you ask using 'const TYPE Var = ;' when 'Var' is initialized once and not changed after?
// The additional last element is used to catch elements that could | ||
// otherwise be lost in the tree-reduction algorithm. | ||
size_t NumLocalElements = WGSize + (IsEfficientCase ? 0 : 1); | ||
auto LocalReds = Redu.getReadWriteLocalAcc(NumLocalElements, *this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, do not use auto in cases where the type is not clear. Please, apply to the whole patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is one of reasons why I have that func in reduction_impl class.
Because the buffer.hpp and accessor.hpp cannot be included in handler.hpp.
I'll change if you really insist/repeat the request, I would prefer this 'auto'.
size_t WGSize = Range.get_local_range().size(); | ||
size_t NWorkGroups = Range.get_group_range().size(); | ||
|
||
bool IsUnderLoaded = (NWorkGroups * WGSize - NWorkItems) != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the more common name for what is computed here is something like "HasNonUnfiromWG"
// to user's accessor, then detach user's accessor from this kernel | ||
// to make the dependencies between accessors and kernels more clean and | ||
// correct. | ||
if (NWorkGroups > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it connected to the "last one kernel" from the comment above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the mainCGFunc is the last kernel, i.e. if nd_range was (N, N), then mainCGFunc and the kernel in it will write to user's accessor.
Otherwise (if NWorkGroups > 1, i.e. number of partial sums > 1), there will be another kernel following to the main kernel. and thus main kernel does not write to user's reduction variable and the dependency to user's variable is redundant.
// 1. Call the kernel that includes user's lambda function. | ||
// If this kernel is going to be now last one, i.e. it does not write | ||
// to user's accessor, then detach user's accessor from this kernel | ||
// to make the dependencies between accessors and kernels more clean and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the problem if we do not detach accessor? What exactly will be incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be redundant dependencies. Probably, nothing incorrect.
Let suppose there are
- initialization of user's reduction accessor
- main CGFunc (with user's lambda and 1 iteration to reduce elements)
- Aux kernel 1
- Aux kernel 2
- Final Aux kernel 3
If do not detach reduction accessor: If detach reduction accessor:
(1) (2) can be started even before (1)
(2) depends on (1) (3) depends on (2)
(3) depends on (2) (4) depends on (3)
(4) depends on (3) (1) may init reduction's buffer here
(5) depends on (4) and (1) (5) depends on (5) and (1)
So, do you think I need just to remove dissociateWithHandler?
while (NWorkItems > 1) { | ||
// Before creating another kernel, add the event from the previous kernel | ||
// to queue. | ||
addEventToQueue(QueueCopy, MLastEvent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is done to mimic the dependencies chain that is created for normal SYCL code without reductions. I.e.:
Q.submit( { user's CGFunc and lambda func})
while (NWorkGroups > 1)
Q.submit( { Aux function to reduce elements; })
/// | ||
/// Briefly: user's lambda, tree-reduction, CUSTOM types/ops. | ||
template <typename KernelName, typename KernelType, int Dims, class Reduction> | ||
void reduCGFunc(KernelType KernelFunc, const nd_range<Dims> &Range, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest moving these functions to some separate header. It seems strange that handler class handling reductions on this level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll move that to sycl/intel/reduction.hpp and make them separate routines. I don't see any value in adding them as static method to 'reducer' or 'reduction_impl' class.
addEventToQueue(QueueCopy, MLastEvent); | ||
|
||
// TODO: here the work-group size is not limited by user's needs, | ||
// the better strategy here is to make the work-group-size as big |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is the better strategy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because user may specify nd_range with small work-group because of some good reasons requirements for user's lambda, for example nd_range=(1M, 4), So, the main kernel must obey that and use WGSize=4, but it is obviously more efficient to use bigger WGSize for aux kernels that only reduce elements and do that much faster when WGSize is big (it also will require less calls of aux kernel, when the WGSize is bigger).
I'll update comment.
// correct. | ||
if (NWorkGroups > 1) | ||
dissociateWithHandler(Redu.MAcc); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest creating separate handler even to run user's lambda. And set "command type" of this handler to something like a NOP or Aggregator so finalize does nothing for such a type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reasoning for that?
That approach seems more error prone. User may create many accessors, which get associated with 'this' handler and not with the new handler. New handler will not have any knowledge about those objects
User may do some additional calls for 'this' handler which will not have effect to new handler running user's code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for code owner of function_pointers.hpp
, changes in there still look good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK if my comments are resolved in a separate PR.
This patch adds the algorithm that implements 1 reduction in parallel_for().
It handles all types and operations, including user's custom ones.
The more efficient variants are on the way.
What is NOT supported by this patch:
1 reductions in paralell_for
is not supported on HOST yet.