-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL] WG-shared global variables must have external linkage #1279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
againull
commented
Mar 10, 2020
807ea48
to
b064467
Compare
b064467
to
c9ae9b2
Compare
LowerWGScope pass is an llvm pass that performs SYCL specific transformations in LLVM IR right after frontend. LLVM passes are supposed to be in llvm project and not in clang project. Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
c9ae9b2
to
35fef1a
Compare
Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
Currently hierarchical parallelism semantics is handled by SYCL specific code generation and LowerWGScope pass. WG-shared global variables are created for automatic variables in PFWG scope by CG and WG-shared shadow variables are created by LowerWGScope pass to broadcast private value from leader work item to other work items. Currently these global variables are created with internal linkage which is not correct. As a result wrong transformations are happening in the LLVM middle end. For example, ... if (Leader work item) store %PrivateValue to @SharedGlobal -> leader shares the value memory_barrier() load %PrivateValue from @SharedGlobal -> all WIs load the shared value ... Generated load/store operations are not supposed to be moved across memory barrier but barrier intrinsics like @llvm.nvvm.barrier0() are considered as regular functions in the LLVM middle end. As soon as global has an interanl linkage it is considered as non-escaping and alias analysis thinks that @llvm.nvvm.barrier0() cannot modify global variable and only reads it. As a result the following transformation is performed by GVN: ... crit_edge: load %PrivateValue from @SharedGlobal -> all WIs load the shared value if (Leader work item) store %PrivateValue to @SharedGlobal -> leader shares the value memory_barrier() ... That is why all WG-shared variables should have external linkage. Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
35fef1a
to
838163b
Compare
ret void | ||
} | ||
|
||
!0 = !{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is !0
for?
// RUN: %clang_cc1 -triple spir64-unknown-unknown-sycldevice -fsycl-is-device -disable-llvm-passes -I %S/Inputs -emit-llvm %s -o - | FileCheck %s | ||
|
||
// Checked that local variables declared by the user in PWFG scope are turned into globals in the local address space. | ||
// CHECK: @{{.*myLocal.*}} = addrspace(3) global i32 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: I think the size of int
depends on the host ABI, which is not set by the test, so on some platforms this check might fail due to sizeof(int) != i32
.
It's probably better to set aux-target-triple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear why re-order of load/stores has anything to do with a linkage type. Unless you reference a variable in other modules, it should not have external linkage.
alias analysis thinks that @llvm.nvvm.barrier0() cannot modify global variable
and only reads it.
Why AA treats a global with external linkage differently?
Local memory is referenced in other work-items/threads, so it's must be external. |
Not sure what I'm missing here, but these things seem to be totally unrelated. |
+1 |
Let me provide real IR before and after transformation. This is a transformation performed by GVN based on Globals AA. Full modules are provided here: #1258 Before:
After:
In the LLVM middle end intrinsics like @llvm.nvvm.barrier0() in the module are considered as regular external functions, they are recognized only by PTX backend and this is correct. As soon as global has an internal linkage it is considered as non-escaping from the module and alias analysis gives an answer that @llvm.nvvm.barrier0() cannot modify global variable and only reads it. As for me there is no any bug here (in the LLVM middle-end). Since there is no any bug in LLVM transformations we need to do something with code currently generated code by LowerWGScope pass. One option is to make these global variables as external. According to the testing this causes linkage errors, it looks expected ->I plan to add pass that internalizes these globals back after middle end transformations -> we can do this because all pipeline for SYCL compiler is currently created in clang/lib/CodeGen/BackendUtil.cpp. Other option is to change design somehow. I am not sure if it is possible, but if we know the size of local memory which is needed for hier par needs, then can we pass this info about size through integration header to SYCL RT and make RT to pass additional local buffer of this size to the kernel? |
In other words, |
Re-order of load/stores depends on linkage type, because alias analysis depends on linkage type. GVN uses memory dependency analysis (which is interface to alias analysis) to prove that barrier function doesn't modify global. And based on result (read, not modify) it performs transformation to execute load speculatively. |
I think that this is the main problem: the intrinsic is supposed to have semantic of a memory fence, and compiler should not re-order loads/stores over it. If it doesn't have this semantic and it is "just some function", then making global variables external look like a workaround. |
Yes. I think this how LLVM community is trying to make "classic text-book" optimizations applicable for SIMT program optimizer - re-use existing tools like linkage type. I'd like note that current patch is the "canonical" way to solve this problem in OpenCL environment by LLVM: https://godbolt.org/z/YB8kVW. @asavonic, @kbobrovs, if you have better solution for this problem, please, approach LLVM community first. We need OpenCL implementations to align on LLVM IR representation to make compatible SYCL front-end. @againull, please, take a look at the regressions. |
I'm not sure what the "canonical" solution you refer to.
I don't think it works that way. You should not ask reviewers to find a proper solution. It a submitter's responsibility to propose a correct implementation and justification. So far, the implementation and justification sound like a workaround at best, and no references to LLVM community code or documentation was given to prove that it is a valid and supported way of expressing a barrier semantic. |
Indeed. I missed that. Then I think the right direction is to investigate why convergent attribute doesn't help (see #1257 (comment)) as it clearly works for OpenCL. |
@bader thank you for your example of opencl program, it helped to figure out what is going on.
Difference is the following:
In the case #1 Globals AA sees call to external function @_Z7barrierj(i32 1) and conservatively decides that we cannot say anything about this call, it can read/modify any globals => load is not moved. So problem is in llvm which cannot handle @llvm.nvvm.barrier0() or in libclc library implementation where __syncthreads is used in opencl kernel which is an implementation of spirv barrier for ptx backend:
|
Sounds like a bug in LLVM passes. What if we replace +@Naghasan, have you looked at this? |
The barrier implementation is not doing the memfence (membar I think in PTX) that comes with the barrier. This may be part of the problem. I had another look at the generated IR in the issue #1258. I'm not too familiar with the LLVM alias analysis, but it seems to be missing some MD on some instructions and as there is a ptrtoint cast, this may also confuse the compiler in thinking it is fine to reorder. |
@Naghasan could you please take a look at results of the investigation I provided above.
Even if we try membar out of curiosity then still illegal reordering is performed: https://godbolt.org/z/jXSs_6
Summary: Problem is not reproduced for CUDA just because store and load instructions have addrspacecast constant expression as an argument and not global itself. GlobalsAA is just not taught to deal with this addrspacecast: https://godbolt.org/z/aVGixD Problem should be fixed in llvm project: https://github.com/llvm/llvm-project/tree/master/llvm and as far as I undestand @Naghasan is going to work on this. Preparing and committing proper fix to https://github.com/llvm/llvm-project/tree/master/llvm can take some time, so I suggest this workaround as a temporary solution to enable hierarchical parallelism tests on PTX backend: |
Fixing the problem in LowerWGScope by generating external globals (this PR) or volatile store/loads - #1257 is just the way to workaround the problem in llvm that I described. But it is not good to workaround this problem in this pass. |