Skip to content

[SYCL] WG-shared global variables must have external linkage #1279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

againull
Copy link
Contributor

Currently hierarchical parallelism semantics is handled by SYCL specific
code generation and LowerWGScope pass. WG-shared global variables are
created for automatic variables in PFWG scope by CG and WG-shared shadow
variables are created by LowerWGScope pass to broadcast private value
from leader work item to other work items.

Currently these global variables are created with internal linkage which
is not correct. As a result wrong transformations are happening in the
LLVM middle end.

For example,
...
if (Leader work item)
    store %PrivateValue to @SharedGlobal -> leader shares the value
memory_barrier()
load %PrivateValue from @SharedGlobal -> all WIs load the shared value
...

Generated load/store operations are not supposed to be moved across
memory barrier but barrier intrinsics like @llvm.nvvm.barrier0() are
considered as regular functions in the LLVM middle end. As soon as
global has an interanl linkage it is considered as non-escaping and alias
analysis thinks that @llvm.nvvm.barrier0() cannot modify global variable
and only reads it. As a result the following transformation is performed
by GVN:

...
crit_edge:
  load %PrivateValue from @SharedGlobal -> all WIs load the shared value

if (Leader work item)
  store %PrivateValue to @SharedGlobal -> leader shares the value
memory_barrier()
...

That is why all WG-shared variables should have external linkage.

Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>

LowerWGScope pass is an llvm pass that performs SYCL specific
transformations in LLVM IR right after frontend. LLVM passes are
supposed to be in llvm project and not in clang project.

Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
@againull againull force-pushed the internal_to_external branch from c9ae9b2 to 35fef1a Compare March 10, 2020 01:41
Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
Currently hierarchical parallelism semantics is handled by SYCL specific
code generation and LowerWGScope pass. WG-shared global variables are
created for automatic variables in PFWG scope by CG and WG-shared shadow
variables are created by LowerWGScope pass to broadcast private value
from leader work item to other work items.

Currently these global variables are created with internal linkage which
is not correct. As a result wrong transformations are happening in the
LLVM middle end.

For example,

...
if (Leader work item)
    store %PrivateValue to @SharedGlobal -> leader shares the value
memory_barrier()
load %PrivateValue from @SharedGlobal -> all WIs load the shared value
...

Generated load/store operations are not supposed to be moved across
memory barrier but barrier intrinsics like @llvm.nvvm.barrier0() are
considered as regular functions in the LLVM middle end. As soon as
global has an interanl linkage it is considered as non-escaping and alias
analysis thinks that @llvm.nvvm.barrier0() cannot modify global variable
and only reads it. As a result the following transformation is performed
by GVN:

...
crit_edge:
  load %PrivateValue from @SharedGlobal -> all WIs load the shared value

if (Leader work item)
   store %PrivateValue to @SharedGlobal -> leader shares the value
memory_barrier()
...

That is why all WG-shared variables should have external linkage.

Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
@againull againull force-pushed the internal_to_external branch from 35fef1a to 838163b Compare March 10, 2020 07:12
ret void
}

!0 = !{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is !0 for?

// RUN: %clang_cc1 -triple spir64-unknown-unknown-sycldevice -fsycl-is-device -disable-llvm-passes -I %S/Inputs -emit-llvm %s -o - | FileCheck %s

// Checked that local variables declared by the user in PWFG scope are turned into globals in the local address space.
// CHECK: @{{.*myLocal.*}} = addrspace(3) global i32 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I think the size of int depends on the host ABI, which is not set by the test, so on some platforms this check might fail due to sizeof(int) != i32.
It's probably better to set aux-target-triple.

Copy link
Contributor

@asavonic asavonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear why re-order of load/stores has anything to do with a linkage type. Unless you reference a variable in other modules, it should not have external linkage.

alias analysis thinks that @llvm.nvvm.barrier0() cannot modify global variable
and only reads it.

Why AA treats a global with external linkage differently?

@bader
Copy link
Contributor

bader commented Mar 10, 2020

It is not clear why re-order of load/stores has anything to do with a linkage type. Unless you reference a variable in other modules, it should not have external linkage.

alias analysis thinks that @llvm.nvvm.barrier0() cannot modify global variable
and only reads it.

Why AA treats a global with external linkage differently?

Local memory is referenced in other work-items/threads, so it's must be external.

@asavonic
Copy link
Contributor

It is not clear why re-order of load/stores has anything to do with a linkage type. Unless you reference a variable in other modules, it should not have external linkage.

alias analysis thinks that @llvm.nvvm.barrier0() cannot modify global variable
and only reads it.

Why AA treats a global with external linkage differently?

Local memory is referenced in other work-items/threads, so it's must be external.

Not sure what I'm missing here, but these things seem to be totally unrelated.
It must be global, yes, but why it must have external linkage and not internal?

@kbobrovs
Copy link
Contributor

Not sure what I'm missing here, but these things seem to be totally unrelated.
It must be global, yes, but why it must have external linkage and not internal?

+1 external vs internal control symbol visibility between translation units, not accessibility by threads. Code in the same translation unit can spawn multiple threads that access shared variable with internal linkage.

@againull
Copy link
Contributor Author

againull commented Mar 10, 2020

Let me provide real IR before and after transformation. This is a transformation performed by GVN based on Globals AA. Full modules are provided here: #1258

Before:

 tail call void @llvm.nvvm.barrier0() #4
  br i1 %30, label %31, label %37

31:                                               ; preds = %4
  %32 = tail call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #4, !range !29
  %33 = getelementptr inbounds %"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon", %"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon"* %5, i64 0, i32 0
  %34 = ptrtoint %"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon"* %5 to i64
  %35 = mul nuw nsw i32 %32, 3
  %36 = zext i32 %35 to i64
  store i64 %36, i64* addrspacecast (i64 addrspace(3)* @"_ZZZZ4mainENK3$_0clERN2cl4sycl7handlerEENKUlNS1_5groupILi1EEEE_clES5_E9wg_offset" to i64*), align 8, !tbaa !15
  store %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"* %33, %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"* addrspace(3)* @WGCopy1.0, align 8
  store i64 %34, i64 addrspace(3)* bitcast (%"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon"* addrspace(3)* @WGCopy to i64 addrspace(3)*), align 8
  br label %37

37:                                               ; preds = %4, %31
  call void @llvm.nvvm.barrier0() #4
  %38 = load %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"*, %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"* addrspace(3)* @WGCopy1.0, align 8

After:

  tail call void @llvm.nvvm.barrier0() #4
  br i1 %30, label %31, label %._crit_edge

._crit_edge:                                      ; preds = %4
  %.pre = load %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"*, %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"* addrspace(3)* @WGCopy1.0, align 8
  br label %37

31:                                               ; preds = %4
  %32 = tail call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #4, !range !29
  %33 = getelementptr inbounds %"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon", %"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon"* %5, i64 0, i32 0
  %34 = ptrtoint %"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon"* %5 to i64
  %35 = mul nuw nsw i32 %32, 3
  %36 = zext i32 %35 to i64
  store i64 %36, i64* addrspacecast (i64 addrspace(3)* @"_ZZZZ4mainENK3$_0clERN2cl4sycl7handlerEENKUlNS1_5groupILi1EEEE_clES5_E9wg_offset" to i64*), align 8, !tbaa !15
  store %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"* %33, %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"* addrspace(3)* @WGCopy1.0, align 8
  store i64 %34, i64 addrspace(3)* bitcast (%"class._ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEEUlNS1_5groupILi1EEEE_.anon"* addrspace(3)* @WGCopy to i64 addrspace(3)*), align 8
  br label %37

37:                                               ; preds = %._crit_edge, %31
  %38 = phi %"class._ZTSN2cl4sycl8accessorIiLi1ELNS0_6access4modeE1026ELNS2_6targetE2014ELNS2_11placeholderE0EEE.cl::sycl::accessor"* [ %.pre, %._crit_edge ], [ %33, %31 ]
  call void @llvm.nvvm.barrier0() #4

In the LLVM middle end intrinsics like @llvm.nvvm.barrier0() in the module are considered as regular external functions, they are recognized only by PTX backend and this is correct. As soon as global has an internal linkage it is considered as non-escaping from the module and alias analysis gives an answer that @llvm.nvvm.barrier0() cannot modify global variable and only reads it. As for me there is no any bug here (in the LLVM middle-end).

Since there is no any bug in LLVM transformations we need to do something with code currently generated code by LowerWGScope pass. One option is to make these global variables as external. According to the testing this causes linkage errors, it looks expected ->I plan to add pass that internalizes these globals back after middle end transformations -> we can do this because all pipeline for SYCL compiler is currently created in clang/lib/CodeGen/BackendUtil.cpp.

Other option is to change design somehow. I am not sure if it is possible, but if we know the size of local memory which is needed for hier par needs, then can we pass this info about size through integration header to SYCL RT and make RT to pass additional local buffer of this size to the kernel?

@againull
Copy link
Contributor Author

againull commented Mar 10, 2020

In other words,
Internal linkage for these globals look like right solution if we know that they are in local memory and there is a memory barrier between memory accesses.
But information about address space and about memory barrier is target-specific, middle-end transformations doesn't know anything about this. For llvm middle-end this is just some global and barrier intrinsic is just some function. If internal linkage results in legal llvm tranformation and breaks some target specific behavior then we must not generate these globals with internal linkage or must not generate these globals at all.

@againull
Copy link
Contributor Author

It is not clear why re-order of load/stores has anything to do with a linkage type. Unless you reference a variable in other modules, it should not have external linkage.

alias analysis thinks that @llvm.nvvm.barrier0() cannot modify global variable
and only reads it.

Why AA treats a global with external linkage differently?

Re-order of load/stores depends on linkage type, because alias analysis depends on linkage type. GVN uses memory dependency analysis (which is interface to alias analysis) to prove that barrier function doesn't modify global. And based on result (read, not modify) it performs transformation to execute load speculatively.

@asavonic
Copy link
Contributor

For llvm middle-end this is just some global and barrier intrinsic is just some function.

I think that this is the main problem: the intrinsic is supposed to have semantic of a memory fence, and compiler should not re-order loads/stores over it. If it doesn't have this semantic and it is "just some function", then making global variables external look like a workaround.

@bader
Copy link
Contributor

bader commented Mar 11, 2020

For llvm middle-end this is just some global and barrier intrinsic is just some function.

I think that this is the main problem: the intrinsic is supposed to have semantic of a memory fence, and compiler should not re-order loads/stores over it. If it doesn't have this semantic and it is "just some function", then making global variables external look like a workaround.

Yes. I think this how LLVM community is trying to make "classic text-book" optimizations applicable for SIMT program optimizer - re-use existing tools like linkage type.
This might be not the only way to solve this problem - first version of the patch also solves the problem, but this not how it's solved in the community.

I'd like note that current patch is the "canonical" way to solve this problem in OpenCL environment by LLVM: https://godbolt.org/z/YB8kVW.
I suggest following this approach to keep it compatible with OpenCL toolchains (including SPIR-V translator).

@asavonic, @kbobrovs, if you have better solution for this problem, please, approach LLVM community first. We need OpenCL implementations to align on LLVM IR representation to make compatible SYCL front-end.

@againull, please, take a look at the regressions.

@asavonic
Copy link
Contributor

asavonic commented Mar 11, 2020

I'd like note that current patch is the "canonical" way to solve this problem in OpenCL environment by LLVM: https://godbolt.org/z/YB8kVW.
I suggest following this approach to keep it compatible with OpenCL toolchains (including SPIR-V translator).

I'm not sure what the "canonical" solution you refer to.
The global variable in the link clearly has internal linkage, and not external as proposed by this patch.

@asavonic, @kbobrovs, if you have better solution for this problem, please, approach LLVM community first. We need OpenCL implementations to align on LLVM IR representation to make compatible SYCL front-end.

I don't think it works that way. You should not ask reviewers to find a proper solution. It a submitter's responsibility to propose a correct implementation and justification. So far, the implementation and justification sound like a workaround at best, and no references to LLVM community code or documentation was given to prove that it is a valid and supported way of expressing a barrier semantic.

@bader
Copy link
Contributor

bader commented Mar 11, 2020

I'd like note that current patch is the "canonical" way to solve this problem in OpenCL environment by LLVM: https://godbolt.org/z/YB8kVW.
I suggest following this approach to keep it compatible with OpenCL toolchains (including SPIR-V translator).

I'm not sure what the "canonical" solution you refer to.
The global variable in the link clearly has internal linkage, and not external as proposed by this patch.

Indeed. I missed that. Then I think the right direction is to investigate why convergent attribute doesn't help (see #1257 (comment)) as it clearly works for OpenCL.

@againull
Copy link
Contributor Author

againull commented Mar 12, 2020

@bader thank you for your example of opencl program, it helped to figure out what is going on.

barrier(CLK_LOCAL_MEM_FENCE);
with
__syncthreads();

Difference is the following:

  1. barrier(CLK_LOCAL_MEM_FENCE); is lowered to function call tail call void @_Z7barrierj(i32 1)
  2. __syncthreads(); is lowered to llvm intrinsic tail call void @llvm.nvvm.barrier0()

In the case #1 Globals AA sees call to external function @_Z7barrierj(i32 1) and conservatively decides that we cannot say anything about this call, it can read/modify any globals => load is not moved.
In the case #2 Globals AA can prove that llvm.nvvm.barrier0() can only read internal global => as a result load is moved.

So problem is in llvm which cannot handle @llvm.nvvm.barrier0() or in libclc library implementation where __syncthreads is used in opencl kernel which is an implementation of spirv barrier for ptx backend:

./libclc/ptx-nvidiacl/libspirv/synchronization/barrier.cl:
_CLC_DEF void _Z22__spirv_ControlBarrierN5__spv5ScopeES0_j(enum Scope scope, enum Scope memory, unsigned int semantics) {
  __syncthreads();
}

@bader
Copy link
Contributor

bader commented Mar 12, 2020

Sounds like a bug in LLVM passes.

What if we replace __sycnthreads with OpenCL C built-in? Will NV back-end compiler be able to handle it?

+@Naghasan, have you looked at this?

@Naghasan
Copy link
Contributor

The barrier implementation is not doing the memfence (membar I think in PTX) that comes with the barrier. This may be part of the problem.

I had another look at the generated IR in the issue #1258. I'm not too familiar with the LLVM alias analysis, but it seems to be missing some MD on some instructions and as there is a ptrtoint cast, this may also confuse the compiler in thinking it is fine to reorder.

@againull
Copy link
Contributor Author

againull commented Mar 16, 2020

The barrier implementation is not doing the memfence (membar I think in PTX) that comes with the barrier. This may be part of the problem.
Accroding to description here https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#nvvm-intrin-barrier

I had another look at the generated IR in the issue #1258. I'm not too familiar with the LLVM alias analysis, but it seems to be missing some MD on some instructions and as there is a ptrtoint cast, this may also confuse the compiler in thinking it is fine to reorder.

@Naghasan could you please take a look at results of the investigation I provided above.

Memory fence functions only affect the ordering of memory operations by a thread; they do not ensure that these memory operations are visible to other threads (like __syncthreads() does for threads within a block (see Synchronization Functions)).

Even if we try membar out of curiosity then still illegal reordering is performed: https://godbolt.org/z/jXSs_6
So this is not a problem.

  • Globals AA doesn't deal with metadata so this is also not a problem.

  • ptrtoint cast also not a problem here, please see example without ptrtoint cast I provided above: https://godbolt.org/z/6hJxTP

Summary:
Problem is that PTX barrier is implemented as an LLVM intrinsic llvm.nvvm.barrier0 and Globals AA doesn’t handle it in a specific way but as a regular LLVM intrinsic. Globals AA can prove that llvm.nvvm.barrier0() can only read internal globals. As a result, illegal reordering of memory accesses is performed by transformations like GVN. Problem is not specific for SYCL and bug is not in the LowerWGScope pass that generates IR to perform hierarchical parallelism semantics.
As a prove let me provide an example in OpenCL where illegal transformation happens: https://godbolt.org/z/6hJxTP

Problem is not reproduced for CUDA just because store and load instructions have addrspacecast constant expression as an argument and not global itself. GlobalsAA is just not taught to deal with this addrspacecast: https://godbolt.org/z/aVGixD
If I remove these addrspacecast instructions manually then illegal transformation is also performed:
https://godbolt.org/z/cg2g6h

Problem should be fixed in llvm project: https://github.com/llvm/llvm-project/tree/master/llvm and as far as I undestand @Naghasan is going to work on this.

Preparing and committing proper fix to https://github.com/llvm/llvm-project/tree/master/llvm can take some time, so I suggest this workaround as a temporary solution to enable hierarchical parallelism tests on PTX backend:
#1334

@againull
Copy link
Contributor Author

Fixing the problem in LowerWGScope by generating external globals (this PR) or volatile store/loads - #1257 is just the way to workaround the problem in llvm that I described. But it is not good to workaround this problem in this pass.
@Naghasan, Could you please take a look at proposed workaround in llvm #1334 and confirm that permanent fix will be done in llvm.
I am closing this PR, it is not going to be committed.

@againull againull closed this Mar 19, 2020
@againull againull deleted the internal_to_external branch December 3, 2022 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants