[SYCL] WG-shared global variables must have external linkage #1279

againull · 2020-03-10T01:28:57Z

Currently hierarchical parallelism semantics is handled by SYCL specific
code generation and LowerWGScope pass. WG-shared global variables are
created for automatic variables in PFWG scope by CG and WG-shared shadow
variables are created by LowerWGScope pass to broadcast private value
from leader work item to other work items.

Currently these global variables are created with internal linkage which
is not correct. As a result wrong transformations are happening in the
LLVM middle end.

For example,
...
if (Leader work item)
    store %PrivateValue to @SharedGlobal -> leader shares the value
memory_barrier()
load %PrivateValue from @SharedGlobal -> all WIs load the shared value
...

Generated load/store operations are not supposed to be moved across
memory barrier but barrier intrinsics like @llvm.nvvm.barrier0() are
considered as regular functions in the LLVM middle end. As soon as
global has an interanl linkage it is considered as non-escaping and alias
analysis thinks that @llvm.nvvm.barrier0() cannot modify global variable
and only reads it. As a result the following transformation is performed
by GVN:

...
crit_edge:
  load %PrivateValue from @SharedGlobal -> all WIs load the shared value

if (Leader work item)
  store %PrivateValue to @SharedGlobal -> leader shares the value
memory_barrier()
...

That is why all WG-shared variables should have external linkage.

Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>

LowerWGScope pass is an llvm pass that performs SYCL specific transformations in LLVM IR right after frontend. LLVM passes are supposed to be in llvm project and not in clang project. Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>

Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>

Currently hierarchical parallelism semantics is handled by SYCL specific code generation and LowerWGScope pass. WG-shared global variables are created for automatic variables in PFWG scope by CG and WG-shared shadow variables are created by LowerWGScope pass to broadcast private value from leader work item to other work items. Currently these global variables are created with internal linkage which is not correct. As a result wrong transformations are happening in the LLVM middle end. For example, ... if (Leader work item) store %PrivateValue to @SharedGlobal -> leader shares the value memory_barrier() load %PrivateValue from @SharedGlobal -> all WIs load the shared value ... Generated load/store operations are not supposed to be moved across memory barrier but barrier intrinsics like @llvm.nvvm.barrier0() are considered as regular functions in the LLVM middle end. As soon as global has an interanl linkage it is considered as non-escaping and alias analysis thinks that @llvm.nvvm.barrier0() cannot modify global variable and only reads it. As a result the following transformation is performed by GVN: ... crit_edge: load %PrivateValue from @SharedGlobal -> all WIs load the shared value if (Leader work item) store %PrivateValue to @SharedGlobal -> leader shares the value memory_barrier() ... That is why all WG-shared variables should have external linkage. Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>

bader · 2020-03-10T07:37:58Z

llvm/test/SYCLLowerIR/byval_arg.ll

+  ret void
+}
+
+!0 = !{}


What is !0 for?

bader · 2020-03-10T07:41:21Z

clang/test/CodeGenSYCL/wg_scope_var.cpp

+// RUN: %clang_cc1 -triple spir64-unknown-unknown-sycldevice -fsycl-is-device -disable-llvm-passes -I %S/Inputs -emit-llvm %s -o - | FileCheck %s
+
+// Checked that local variables declared by the user in PWFG scope are turned into globals in the local address space.
+// CHECK: @{{.*myLocal.*}} = addrspace(3) global i32 0


FYI: I think the size of int depends on the host ABI, which is not set by the test, so on some platforms this check might fail due to sizeof(int) != i32.
It's probably better to set aux-target-triple.

asavonic

It is not clear why re-order of load/stores has anything to do with a linkage type. Unless you reference a variable in other modules, it should not have external linkage.

alias analysis thinks that @llvm.nvvm.barrier0() cannot modify global variable
and only reads it.

Why AA treats a global with external linkage differently?

bader · 2020-03-10T12:29:58Z

It is not clear why re-order of load/stores has anything to do with a linkage type. Unless you reference a variable in other modules, it should not have external linkage.

alias analysis thinks that @llvm.nvvm.barrier0() cannot modify global variable
and only reads it.

Why AA treats a global with external linkage differently?

Local memory is referenced in other work-items/threads, so it's must be external.

asavonic · 2020-03-10T12:34:37Z

It is not clear why re-order of load/stores has anything to do with a linkage type. Unless you reference a variable in other modules, it should not have external linkage.

alias analysis thinks that @llvm.nvvm.barrier0() cannot modify global variable
and only reads it.

Why AA treats a global with external linkage differently?

Local memory is referenced in other work-items/threads, so it's must be external.

Not sure what I'm missing here, but these things seem to be totally unrelated.
It must be global, yes, but why it must have external linkage and not internal?

kbobrovs · 2020-03-10T21:25:15Z

Not sure what I'm missing here, but these things seem to be totally unrelated.
It must be global, yes, but why it must have external linkage and not internal?

+1 external vs internal control symbol visibility between translation units, not accessibility by threads. Code in the same translation unit can spawn multiple threads that access shared variable with internal linkage.

againull · 2020-03-10T22:06:23Z

Let me provide real IR before and after transformation. This is a transformation performed by GVN based on Globals AA. Full modules are provided here: #1258

Before:

 tail call void @llvm.nvvm.barrier0() #4
  br i1 %30, label %31, label %37

31:                                               ; preds = %4
  %32 = tail call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #4, !range !29
  %33 = getelementptr inbounds %"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon", %"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon"* %5, i64 0, i32 0
  %34 = ptrtoint %"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon"* %5 to i64
  %35 = mul nuw nsw i32 %32, 3
  %36 = zext i32 %35 to i64
  store i64 %36, i64* addrspacecast (i64 addrspace(3)* @"_ZZZZ4mainENK3$_0clERN2cl4sycl7handlerEENKUlNS1_5groupILi1EEEE_clES5_E9wg_offset" to i64*), align 8, !tbaa !15
  store %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"* %33, %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"* addrspace(3)* @WGCopy1.0, align 8
  store i64 %34, i64 addrspace(3)* bitcast (%"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon"* addrspace(3)* @WGCopy to i64 addrspace(3)*), align 8
  br label %37

37:                                               ; preds = %4, %31
  call void @llvm.nvvm.barrier0() #4
  %38 = load %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"*, %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"* addrspace(3)* @WGCopy1.0, align 8

After:

  tail call void @llvm.nvvm.barrier0() #4
  br i1 %30, label %31, label %._crit_edge

._crit_edge:                                      ; preds = %4
  %.pre = load %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"*, %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"* addrspace(3)* @WGCopy1.0, align 8
  br label %37

31:                                               ; preds = %4
  %32 = tail call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #4, !range !29
  %33 = getelementptr inbounds %"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon", %"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon"* %5, i64 0, i32 0
  %34 = ptrtoint %"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon"* %5 to i64
  %35 = mul nuw nsw i32 %32, 3
  %36 = zext i32 %35 to i64
  store i64 %36, i64* addrspacecast (i64 addrspace(3)* @"_ZZZZ4mainENK3$_0clERN2cl4sycl7handlerEENKUlNS1_5groupILi1EEEE_clES5_E9wg_offset" to i64*), align 8, !tbaa !15
  store %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"* %33, %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"* addrspace(3)* @WGCopy1.0, align 8
  store i64 %34, i64 addrspace(3)* bitcast (%"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon"* addrspace(3)* @WGCopy to i64 addrspace(3)*), align 8
  br label %37

37:                                               ; preds = %._crit_edge, %31
  %38 = phi %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"* [ %.pre, %._crit_edge ], [ %33, %31 ]
  call void @llvm.nvvm.barrier0() #4

In the LLVM middle end intrinsics like @llvm.nvvm.barrier0() in the module are considered as regular external functions, they are recognized only by PTX backend and this is correct. As soon as global has an internal linkage it is considered as non-escaping from the module and alias analysis gives an answer that @llvm.nvvm.barrier0() cannot modify global variable and only reads it. As for me there is no any bug here (in the LLVM middle-end).

Since there is no any bug in LLVM transformations we need to do something with code currently generated code by LowerWGScope pass. One option is to make these global variables as external. According to the testing this causes linkage errors, it looks expected ->I plan to add pass that internalizes these globals back after middle end transformations -> we can do this because all pipeline for SYCL compiler is currently created in clang/lib/CodeGen/BackendUtil.cpp.

Other option is to change design somehow. I am not sure if it is possible, but if we know the size of local memory which is needed for hier par needs, then can we pass this info about size through integration header to SYCL RT and make RT to pass additional local buffer of this size to the kernel?

againull · 2020-03-10T22:36:44Z

In other words,
Internal linkage for these globals look like right solution if we know that they are in local memory and there is a memory barrier between memory accesses.
But information about address space and about memory barrier is target-specific, middle-end transformations doesn't know anything about this. For llvm middle-end this is just some global and barrier intrinsic is just some function. If internal linkage results in legal llvm tranformation and breaks some target specific behavior then we must not generate these globals with internal linkage or must not generate these globals at all.

againull · 2020-03-10T22:43:29Z

It is not clear why re-order of load/stores has anything to do with a linkage type. Unless you reference a variable in other modules, it should not have external linkage.

alias analysis thinks that @llvm.nvvm.barrier0() cannot modify global variable
and only reads it.

Why AA treats a global with external linkage differently?

Re-order of load/stores depends on linkage type, because alias analysis depends on linkage type. GVN uses memory dependency analysis (which is interface to alias analysis) to prove that barrier function doesn't modify global. And based on result (read, not modify) it performs transformation to execute load speculatively.

asavonic · 2020-03-11T07:19:58Z

For llvm middle-end this is just some global and barrier intrinsic is just some function.

I think that this is the main problem: the intrinsic is supposed to have semantic of a memory fence, and compiler should not re-order loads/stores over it. If it doesn't have this semantic and it is "just some function", then making global variables external look like a workaround.

bader · 2020-03-11T09:25:50Z

For llvm middle-end this is just some global and barrier intrinsic is just some function.

I think that this is the main problem: the intrinsic is supposed to have semantic of a memory fence, and compiler should not re-order loads/stores over it. If it doesn't have this semantic and it is "just some function", then making global variables external look like a workaround.

Yes. I think this how LLVM community is trying to make "classic text-book" optimizations applicable for SIMT program optimizer - re-use existing tools like linkage type.
This might be not the only way to solve this problem - first version of the patch also solves the problem, but this not how it's solved in the community.

I'd like note that current patch is the "canonical" way to solve this problem in OpenCL environment by LLVM: https://godbolt.org/z/YB8kVW.
I suggest following this approach to keep it compatible with OpenCL toolchains (including SPIR-V translator).

@asavonic, @kbobrovs, if you have better solution for this problem, please, approach LLVM community first. We need OpenCL implementations to align on LLVM IR representation to make compatible SYCL front-end.

@againull, please, take a look at the regressions.

asavonic · 2020-03-11T09:44:19Z

I'd like note that current patch is the "canonical" way to solve this problem in OpenCL environment by LLVM: https://godbolt.org/z/YB8kVW.
I suggest following this approach to keep it compatible with OpenCL toolchains (including SPIR-V translator).

I'm not sure what the "canonical" solution you refer to.
The global variable in the link clearly has internal linkage, and not external as proposed by this patch.

@asavonic, @kbobrovs, if you have better solution for this problem, please, approach LLVM community first. We need OpenCL implementations to align on LLVM IR representation to make compatible SYCL front-end.

I don't think it works that way. You should not ask reviewers to find a proper solution. It a submitter's responsibility to propose a correct implementation and justification. So far, the implementation and justification sound like a workaround at best, and no references to LLVM community code or documentation was given to prove that it is a valid and supported way of expressing a barrier semantic.

bader · 2020-03-11T10:34:29Z

I'd like note that current patch is the "canonical" way to solve this problem in OpenCL environment by LLVM: https://godbolt.org/z/YB8kVW.
I suggest following this approach to keep it compatible with OpenCL toolchains (including SPIR-V translator).

I'm not sure what the "canonical" solution you refer to.
The global variable in the link clearly has internal linkage, and not external as proposed by this patch.

Indeed. I missed that. Then I think the right direction is to investigate why convergent attribute doesn't help (see #1257 (comment)) as it clearly works for OpenCL.

againull · 2020-03-12T17:44:56Z

@bader thank you for your example of opencl program, it helped to figure out what is going on.

Here is your example: https://godbolt.org/z/YB8kVW. Load is not moved.
Here is my modified example: https://godbolt.org/z/6hJxTP. I have just replaced

barrier(CLK_LOCAL_MEM_FENCE);
with
__syncthreads();

Difference is the following:

barrier(CLK_LOCAL_MEM_FENCE); is lowered to function call tail call void @_Z7barrierj(i32 1)
__syncthreads(); is lowered to llvm intrinsic tail call void @llvm.nvvm.barrier0()

In the case #1 Globals AA sees call to external function @_Z7barrierj(i32 1) and conservatively decides that we cannot say anything about this call, it can read/modify any globals => load is not moved.
In the case #2 Globals AA can prove that llvm.nvvm.barrier0() can only read internal global => as a result load is moved.

So problem is in llvm which cannot handle @llvm.nvvm.barrier0() or in libclc library implementation where __syncthreads is used in opencl kernel which is an implementation of spirv barrier for ptx backend:

./libclc/ptx-nvidiacl/libspirv/synchronization/barrier.cl:
_CLC_DEF void _Z22__spirv_ControlBarrierN5__spv5ScopeES0_j(enum Scope scope, enum Scope memory, unsigned int semantics) {
  __syncthreads();
}

bader · 2020-03-12T19:49:48Z

Sounds like a bug in LLVM passes.

What if we replace __sycnthreads with OpenCL C built-in? Will NV back-end compiler be able to handle it?

+@Naghasan, have you looked at this?

Naghasan · 2020-03-16T14:31:46Z

The barrier implementation is not doing the memfence (membar I think in PTX) that comes with the barrier. This may be part of the problem.

I had another look at the generated IR in the issue #1258. I'm not too familiar with the LLVM alias analysis, but it seems to be missing some MD on some instructions and as there is a ptrtoint cast, this may also confuse the compiler in thinking it is fine to reorder.

againull · 2020-03-16T23:11:59Z

The barrier implementation is not doing the memfence (membar I think in PTX) that comes with the barrier. This may be part of the problem.
Accroding to description here https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#nvvm-intrin-barrier

I had another look at the generated IR in the issue #1258. I'm not too familiar with the LLVM alias analysis, but it seems to be missing some MD on some instructions and as there is a ptrtoint cast, this may also confuse the compiler in thinking it is fine to reorder.

@Naghasan could you please take a look at results of the investigation I provided above.

According to description and example here https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions __syncthreads() is supposed to be used for this scenario:

Memory fence functions only affect the ordering of memory operations by a thread; they do not ensure that these memory operations are visible to other threads (like __syncthreads() does for threads within a block (see Synchronization Functions)).

Even if we try membar out of curiosity then still illegal reordering is performed: https://godbolt.org/z/jXSs_6
So this is not a problem.

Globals AA doesn't deal with metadata so this is also not a problem.
ptrtoint cast also not a problem here, please see example without ptrtoint cast I provided above: https://godbolt.org/z/6hJxTP

Summary:
Problem is that PTX barrier is implemented as an LLVM intrinsic llvm.nvvm.barrier0 and Globals AA doesn’t handle it in a specific way but as a regular LLVM intrinsic. Globals AA can prove that llvm.nvvm.barrier0() can only read internal globals. As a result, illegal reordering of memory accesses is performed by transformations like GVN. Problem is not specific for SYCL and bug is not in the LowerWGScope pass that generates IR to perform hierarchical parallelism semantics.
As a prove let me provide an example in OpenCL where illegal transformation happens: https://godbolt.org/z/6hJxTP

Problem is not reproduced for CUDA just because store and load instructions have addrspacecast constant expression as an argument and not global itself. GlobalsAA is just not taught to deal with this addrspacecast: https://godbolt.org/z/aVGixD
If I remove these addrspacecast instructions manually then illegal transformation is also performed:
https://godbolt.org/z/cg2g6h

Problem should be fixed in llvm project: https://github.com/llvm/llvm-project/tree/master/llvm and as far as I undestand @Naghasan is going to work on this.

Preparing and committing proper fix to https://github.com/llvm/llvm-project/tree/master/llvm can take some time, so I suggest this workaround as a temporary solution to enable hierarchical parallelism tests on PTX backend:
#1334

againull · 2020-03-19T17:38:37Z

Fixing the problem in LowerWGScope by generating external globals (this PR) or volatile store/loads - #1257 is just the way to workaround the problem in llvm that I described. But it is not good to workaround this problem in this pass.
@Naghasan, Could you please take a look at proposed workaround in llvm #1334 and confirm that permanent fix will be done in llvm.
I am closing this PR, it is not going to be committed.

againull force-pushed the internal_to_external branch from 807ea48 to b064467 Compare March 10, 2020 01:30

againull requested review from kbobrovs and bader March 10, 2020 01:31

againull force-pushed the internal_to_external branch from b064467 to c9ae9b2 Compare March 10, 2020 01:37

againull mentioned this pull request Mar 10, 2020

[SYCL] Temporarily disable hier_par test #1253

Closed

againull force-pushed the internal_to_external branch from c9ae9b2 to 35fef1a Compare March 10, 2020 01:41

againull added 2 commits March 10, 2020 00:11

[SYCL] Add lit tests for SYCL specific CodeGen and LowerWGScope pass

4436fa2

Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>

againull force-pushed the internal_to_external branch from 35fef1a to 838163b Compare March 10, 2020 07:12

bader approved these changes Mar 10, 2020

View reviewed changes

asavonic suggested changes Mar 10, 2020

View reviewed changes

againull mentioned this pull request Mar 16, 2020

[SYCL] Mark calls to barrier and work-item functions as convergent #1333

Merged

againull closed this Mar 19, 2020

againull deleted the internal_to_external branch December 3, 2022 00:02

[SYCL] WG-shared global variables must have external linkage #1279

[SYCL] WG-shared global variables must have external linkage #1279

Uh oh!

Conversation

againull commented Mar 10, 2020

Uh oh!

bader Mar 10, 2020

Choose a reason for hiding this comment

Uh oh!

bader Mar 10, 2020

Choose a reason for hiding this comment

Uh oh!

asavonic left a comment

Choose a reason for hiding this comment

Uh oh!

bader commented Mar 10, 2020

Uh oh!

asavonic commented Mar 10, 2020

Uh oh!

kbobrovs commented Mar 10, 2020

Uh oh!

againull commented Mar 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

againull commented Mar 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

againull commented Mar 10, 2020

Uh oh!

asavonic commented Mar 11, 2020

Uh oh!

bader commented Mar 11, 2020

Uh oh!

asavonic commented Mar 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bader commented Mar 11, 2020

Uh oh!

againull commented Mar 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bader commented Mar 12, 2020

Uh oh!

Naghasan commented Mar 16, 2020

Uh oh!

againull commented Mar 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

againull commented Mar 19, 2020

Uh oh!

Uh oh!

againull commented Mar 10, 2020 •

edited

Loading

againull commented Mar 10, 2020 •

edited

Loading

asavonic commented Mar 11, 2020 •

edited

Loading

againull commented Mar 12, 2020 •

edited

Loading

againull commented Mar 16, 2020 •

edited

Loading