-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL] Implement hierarchical parallelism API. #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// is to better support handling of situations where there must be a loop nest | ||
// over a multi-dimensional space - it allows to avoid generating unnecessary | ||
// outer loops like 'for (int z=0; z<1; z++)' in case of 1D and 2D iteration | ||
// spaces or writing specializations of the algorighms for 1D, 2D and 3D cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not particularly relevant as it's understandable either way but "algorighms" -> "algorithms"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch. Will change
sycl/include/CL/sycl/handler.hpp
Outdated
template <typename KernelType, int Dims> | ||
void parallel_for(nd_range<Dims> ExecutionRange, KernelType KernelFunc) { | ||
parallel_for<KernelType, KernelType, Dims>(ExecutionRange, KernelFunc); | ||
template <typename KernelName = std::nullptr_t, typename KernelType, int Dims> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly a reason behind it that I'm not seeing at the moment, but is there a reason the dimension is named "Dims" here and "dimensions" in kernel_parallel_for_work_group and in the other parallel_for_work_group interface (https://github.com/intel/llvm/pull/221/files#diff-234ddc8d342a3b24bb61d130262e9624R665). Perhaps worth renaming the other two that use "dimensions" to "Dims" to keep consistency. Maybe just me being pedantic though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason. Will make it "Dims" - I agree, should be uniform.
|
||
#include "CL/__spirv/spirv_vars.hpp" | ||
|
||
#define DEFINE_INIT_SIZES(POSTFIX) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a silly request, but would it be possible to call this something other than spirv_glue.hpp, as at the moment we currently use these functions as a sort of top level API that offloads to SPIRV/SPIR intrinsics based on what we compile for. e.g: https://github.com/triSYCL/sycl/blob/sycl/unified/next/sycl/include/CL/__spir/spir_vars.hpp I'm aware the names are SPIRV oriented though.
Maybe this isn't the place for that though or the best implementation on our end (I did have a slightly more complex/generic implementation in a prior commit but it was a bit overly complex for our current needs), but it would be nice to have a sort of top level id/size/group etc. API that calls the underlying implementation be it OpenCL SPIR calls, SPIRV calls or something else entirely. Hopefully this is a somewhat reasonable sounding idea/request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We should not be tied to SPIRV, as there can be other models or even runtimes on the device side underneath. Is it correct interpretation of your concern? How about device_rt_intrin.hpp or device_rt_builtins.hpp? For now there is __spirv* variant only, later implementation for more native RTs can be added. @bader, please also comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is indeed a correct interpretation of my concern! device_rt_builtins.hpp sounds fine by me but if you want to call it device_rt_intrin.hpp that's also all good in my book. Thanks for taking my concern into consideration!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally agree with @agozillon that spirv_glue
is unfortunate name - even w/o applying his "generalization idea".
I think current implementation of these functions can reside in spirv_vars
header.
I'm all in for using built-ins to get id/size information.
@agozillon, feel free to propose any changes simplifying triSYCL development. I think we can re-use https://github.com/triSYCL/sycl/blob/sycl/unified/next/sycl/include/CL/__spir/spir_vars.hpp and have two different implementations of __spir_ocl_get_*
built-ins:
#if __COMPILE_TO_SPIRV__
typedef size_t size_t_vec __attribute__((ext_vector_type(3)));
extern "C" const __constant size_t_vec __spirv_BuiltInGlobalSize;
extern "C" const __constant size_t_vec __spirv_BuiltInGlobalInvocationId;
extern "C" const __constant size_t_vec __spirv_BuiltInWorkgroupSize;
extern "C" const __constant size_t_vec __spirv_BuiltInLocalInvocationId;
extern "C" const __constant size_t_vec __spirv_BuiltInWorkgroupId;
extern "C" const __constant size_t_vec __spirv_BuiltInGlobalOffset;
// __spir_ocl_get_* implementations can be generated by macro function
constexpr __spir_ocl_get_global_id(int dim) {
switch (dim) {
case 0:
return __spirv_BuiltInGlobalInvocationId.x;
case 1:
return __spirv_BuiltInGlobalInvocationId.y;
case 2:
return __spirv_BuiltInGlobalInvocationId.z;
}
}
...
#elif __XILINX_SPECIFIC_IMPLEMENTATION__
// some other implementation of __spir_ocl_get_*
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bader, I already made the change to device_rt_intrin.hpp, which seems to address @agozillon's concerns. Are you OK if we live with this name for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bader, OK, will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something along those lines would be useful I think, I don't necessarily mind what form it takes and the top level function doesn't need to be named _spir_ocl_get* it could be _sycl_get* if you wish, in our case that's just my cheap/cheat way of working around conflicts with user functions with the same mangling as OpenCL
SPIR
built-ins like get_global_id
etc.
If you guys are happy with an abstraction like that (similar to our current approach) it's good with us, but perhaps someone has a more generic approach? As for the most part it would be nice to have an API for these functions that allows anyone to add there own underlying implementation, not necessarily just SPIR
/SPIRV
built-ins!
I had a slightly more generic and perhaps less sane implementation in a previous commit: triSYCL/sycl@489f98d#diff-0e3fe0cefdf466dc994182d2e123ba01 that may be more useful? I reverted it as we don't really have need for other built-ins just yet, so I can't really justify forcing someone else to maintain it... and a separate file approach allowed me to avoid frequent merge conflicts.
In either case, I don't want to be a bother and hold up these commits for too long, it can perhaps be a separate merge/issue discussion. For now I'm happy with either, the renaming of the header to something less SPIRV
centric or the movement of the functions to reside in spirv_vars
. Whatever works for you guys!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing _sycl_get* (or __sycl_get*) C APIs (C mangling assumed, right?) sounds like a good idea. Those would be generic SYCL "intrinsics" which must be resolved by actual SYCL device Back-end, e.g. by linking with _device.bc which will provide implementations of __sycl_get* for . So far we have __spirv* serving in this role, with a nice side effect that we don't need to provide _device.bc, because SPIRV translator knows how to lower them. But the problem with current __spirv* is that the interface is based on variables rather than functions, which is less robust.
This sounds like a separate important task which needs little more discussion, so I suggest that for the purpose of hierarchical parallelism I follow what @bader suggested - move init functions in the spirv_vars.hpp for now removing the device_rt_intrin.hpp, then we can refactor along the lines @agozillon proposed.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds like a separate task. I am happy with the change of removing the new header and moving the functions into spirv_vars.hpp for now, that seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments for the first two patches.
|
||
#include "CL/__spirv/spirv_vars.hpp" | ||
|
||
#define DEFINE_INIT_SIZES(POSTFIX) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally agree with @agozillon that spirv_glue
is unfortunate name - even w/o applying his "generalization idea".
I think current implementation of these functions can reside in spirv_vars
header.
I'm all in for using built-ins to get id/size information.
@agozillon, feel free to propose any changes simplifying triSYCL development. I think we can re-use https://github.com/triSYCL/sycl/blob/sycl/unified/next/sycl/include/CL/__spir/spir_vars.hpp and have two different implementations of __spir_ocl_get_*
built-ins:
#if __COMPILE_TO_SPIRV__
typedef size_t size_t_vec __attribute__((ext_vector_type(3)));
extern "C" const __constant size_t_vec __spirv_BuiltInGlobalSize;
extern "C" const __constant size_t_vec __spirv_BuiltInGlobalInvocationId;
extern "C" const __constant size_t_vec __spirv_BuiltInWorkgroupSize;
extern "C" const __constant size_t_vec __spirv_BuiltInLocalInvocationId;
extern "C" const __constant size_t_vec __spirv_BuiltInWorkgroupId;
extern "C" const __constant size_t_vec __spirv_BuiltInGlobalOffset;
// __spir_ocl_get_* implementations can be generated by macro function
constexpr __spir_ocl_get_global_id(int dim) {
switch (dim) {
case 0:
return __spirv_BuiltInGlobalInvocationId.x;
case 1:
return __spirv_BuiltInGlobalInvocationId.y;
case 2:
return __spirv_BuiltInGlobalInvocationId.z;
}
}
...
#elif __XILINX_SPECIFIC_IMPLEMENTATION__
// some other implementation of __spir_ocl_get_*
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Just a few comments...
// Terminal part of the implementation. | ||
template <int DIM, template <int> class LoopBoundTy, typename FuncTy, | ||
template <int> class LoopIndexTy = LoopBoundTy> | ||
static ALWAYS_INLINE detail::enable_if_t<DIM == 0> iterate_impl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear to me while you need std::enable_if
and cannot rely on normal template standardization.
Something similar to https://github.com/triSYCL/triSYCL/blob/master/include/CL/sycl/parallelism/detail/parallelism.hpp#L119 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this would be more elegant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interface remained the same, implementation will now use struct specialization for DIM==0 instead of enable_if_t to terminate recursion
for (int I = Dims_; I < 3; ++I) { | ||
GlobalSize[I] = 1; | ||
LocalSize[I] = LocalSize[0] ? 1 : 0; | ||
GlobalOffset[I] = 0; | ||
NumWorkGroups[I] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Design question: is there any reason to have hard-coded 3D explicit descriptors.
It looks to me it is a hard-coded SPIR-V/OpenCL constraint that leaks out at a higher level and pollute the abstraction.
Would it make more sense to have only the required number of dimensions everywhere and just in the back-end (the PI stuff?) have some conversion to higher-level dimensions if required by enqueue-ing some kernel or whatever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds interesting. Maybe it is worth to open an issue and discuss the redesign there? @romanovvlad - please chime in.
This patch just follows the existing design here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always use 3 dimensions to get rid of template parameters, so we can easily pass it to the sycl library. We could probably create 3 different types for 3 dimensions, all of these types would be inherited from common interface, but I think it's too complex for this simple structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point.
On the long term we can do some optimization if we realize that it means more resources on the device to store this useless information and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields are stored on host side only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, not that bad, then.
/// Number of workgroups, used to record the number of workgroups from the | ||
/// simplest form of parallel_for_work_group. If set, all other fields must be | ||
/// zero | ||
sycl::range<3> NumWorkGroups; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magical 3... It could be Dims
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as @romanovvlad wrote:
We always use 3 dimensions to get rid of template parameters
@@ -20,6 +20,8 @@ struct Builder; | |||
} | |||
template <int dimensions> struct id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably everything is a class
instead of a struct
now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id is still a struct
@@ -86,6 +88,7 @@ template <int dimensions = 1, bool with_offset = true> struct item { | |||
// For call constructor inside conversion operator | |||
friend struct item<dimensions, false>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably everything along is a class
instead of a struct
now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
item is still a struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is always good to have some feedback from the implementors.
If you find some discrepancies in the spec, feel free to file an issue or even better a PR :-) on https://github.com/KhronosGroup/SYCL-Docs
Note that the published PDF might be out-dated compared to the source files.
// - h_item::get_global_id | ||
// The global size is not known, so kernel code needs to adapt | ||
// dynamically, hence the complex loop bound and index computation. | ||
std::memset(ptr, 0, range_length * sizeof(ptr[0])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least it is useful to have some tests using C-style and pointers to be sure SYCL can work with C programs... :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Let's say the test aims to be closer to what usual SYCL developer would write or port :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are depressing me ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, but there are a few thing we can improve.
As suggested by @bader, I split this PR into two: The last one depends on the previous one, which is not yet in intel/llvm, so I could not create it in intel/llvm. I'm was not sure how to best handle this, so I added all reviewers as collaborators into my repo, so that you guys can review there. I hope I addressed all outstanding comments from this PR. |
Please review the last two commits (not sure why other commits got into this PR):
|
This one was fixed by others, so please review only 1 commit - [SYCL] Implement hierarchical parallelism API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good.
A few minor issues.
Please, also finish resolving issues from my previous review - there a couple of them left.
5829664
to
49455d0
Compare
eb6444d
to
90b703d
Compare
Had to revert template <... template <int> typename ...> back to template <... template <int> class ...>, because the former leads to warning that it is C++17 feature and one of test failures. |
@kbobrovs, could you resolve the conflict with recently committed patches, please? |
This is the first part of SYCL hierarchical parallelism implementation. It implements main related APIs: - h_item class - group::parallel_for_work_item functions - handler::parallel_for_work_group functions It is able to run workloads which use these APIs but do not contain data or code with group-visible side effects between the work group and work item scopes. Signed-off-by: Konstantin S Bobrovsky <konstantin.s.bobrovsky@intel.com>
ok, done |
This is the first part of SYCL hierarchical parallelism implementation. It
implements all related APIs:
It is able to run workloads which use these APIs but do not contain data
or code with group-visible side effects between the work group and work
item scopes. Upcoming patches will implement the remaining part: private_data, data and code work-group level "scoping".
Signed-off-by: Konstantin Bobrovsky konstantin.s.bobrovsky@intel.com