Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix llvm not generating proper address space loads and stores in PTX #3428

Merged
merged 11 commits into from
Jul 3, 2020
9 changes: 8 additions & 1 deletion gen/dcompute/target.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,16 @@
#include "gen/dcompute/target.h"
#include "gen/llvmhelpers.h"
#include "gen/runtime.h"
#include <string>
#include "ir/irtypestruct.h"


void DComputeTarget::doCodeGen(Module *m) {
// Reset any generated type info for dcompute types.
// The ll types get generated when the host code gets
// gen'd which means the address space info is not
// properly set.
IrTypeStruct::resetDComputeTypes();

// process module members
for (unsigned k = 0; k < m->members->length; k++) {
Dsymbol *dsym = (*m->members)[k];
Expand Down
22 changes: 21 additions & 1 deletion ir/irtypestruct.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,21 @@ IrTypeStruct::IrTypeStruct(StructDeclaration *sd)

//////////////////////////////////////////////////////////////////////////////

std::vector<IrTypeStruct*> IrTypeStruct::dcomputeTypes;

/// Resets special DCompute structs so they get re-created
/// with the proper address space when generating device code.
void IrTypeStruct::resetDComputeTypes() {
for(auto&& irTypeStruct : dcomputeTypes) {
delete irTypeStruct->dtype->ctype;
irTypeStruct->dtype->ctype = nullptr;
}

dcomputeTypes.clear();
}

//////////////////////////////////////////////////////////////////////////////

IrTypeStruct *IrTypeStruct::get(StructDeclaration *sd) {
auto t = new IrTypeStruct(sd);
sd->type->ctype = t;
Expand All @@ -45,9 +60,14 @@ IrTypeStruct *IrTypeStruct::get(StructDeclaration *sd) {

t->packed = isPacked(sd);

if(isFromLDC_DCompute(sd)) {
dcomputeTypes.push_back(t);
}

// For ldc.dcomptetypes.Pointer!(uint n,T),
// emit { T addrspace(gIR->dcomputetarget->mapping[n])* }
llvm::Optional<DcomputePointer> p;
llvm::Optional<DcomputePointer> p;

if (gIR->dcomputetarget && (p = toDcomputePointer(sd))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've already computed p here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, adding it to the vector should happen here, preventing the toDcomputePointer() call for non-dcompute compiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reply to your second comment (missed it the first time). We explicitly need to add it to the vector when dcompute is not the IR generator as normal passes set the ctype, but without the required address space data needed for dcompute.

So perhaps instead of calling toDComputePointer there and getting the pointer, it should call isFromLDC_DCompute to push to the vector, and then only get the pointer type when dcompute is the IR generator.


// Translate the virtual dcompute address space into the real one for
Expand Down
6 changes: 6 additions & 0 deletions ir/irtypestruct.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,16 @@ class IrTypeStruct : public IrTypeAggr {
///
IrTypeStruct *isStruct() override { return this; }

///
static void resetDComputeTypes();

protected:
///
explicit IrTypeStruct(StructDeclaration *sd);

///
static std::vector<IrTypeStruct*> dcomputeTypes;

/// StructDeclaration this type represents.
StructDeclaration *sd = nullptr;

Expand Down
57 changes: 55 additions & 2 deletions tests/codegen/dcompute_host_and_device.d
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
// Check that we can generate code for both the host and device in one compiler invocation
// REQUIRES: target_NVPTX
// RUN: %ldc -mdcompute-targets=cuda-350 -mdcompute-file-prefix=host_and_device -Iinputs %s %S/inputs/kernel.d
// RUN: %ldc -c -mdcompute-targets=cuda-350 -m64 -output-ll -mdcompute-file-prefix=host_and_device -Iinputs -output-o %s %S/inputs/kernel.d
// RUN: FileCheck %s --check-prefix=PTX < host_and_device_cuda350_64.ptx
// RUN: FileCheck %s --check-prefix=LL < dcompute_host_and_device.ll

import inputs.kernel : foo;
import inputs.kernel : k_foo;

import ldc.dcompute;

int tlGlobal;
__gshared int gGlobal;
Expand All @@ -12,4 +16,53 @@ void main(string[] args)
tlGlobal = 0;
gGlobal = 0;
string s = foo.mangleof;
string k_s = k_foo.mangleof;

GlobalPointer!float global_x;
foo(global_x);
}

void foo(GlobalPointer!float x_in) {
// LL-LABEL: foo
SharedPointer!float shared_x;
PrivatePointer!float private_x;
ConstantPointer!float const_x;

// LL: [[s_load_reg:%[0-9]*]] = load float*, float** {{%[0-9]*}}
// LL: [[s_addr_reg:%[0-9]*]] = load float*, float** {{%[0-9]*}}
// LL: [[s_store_reg:%[0-9]*]] = load float, float* [[s_addr_reg]]
// LL: store float [[s_store_reg]], float* [[s_load_reg]]
*shared_x = *x_in;

// LL: [[p_load_reg:%[0-9]*]] = load float*, float** {{%[0-9]*}}
// LL: [[p_addr_reg:%[0-9]*]] = load float*, float** {{%[0-9]*}}
// LL: [[p_store_reg:%[0-9]*]] = load float, float* [[p_addr_reg]]
// LL: store float [[p_store_reg]], float* [[p_load_reg]]
*private_x = *x_in;

// LL: [[c_load_reg:%[0-9]*]] = load float*, float** {{%[0-9]*}}
// LL: [[c_addr_reg:%[0-9]*]] = load float*, float** {{%[0-9]*}}
// LL: [[c_store_reg:%[0-9]*]] = load float, float* [[c_addr_reg]]
// LL: store float [[c_store_reg]], float* [[c_load_reg]]
*x_in = *const_x;

// LL: [[g1_load_reg:%[0-9]*]] = load float*, float** {{%[0-9]*}}
// LL: [[g1_addr_reg:%[0-9]*]] = load float*, float** {{%[0-9]*}}
// LL: [[g1_store_reg:%[0-9]*]] = load float, float* [[g1_addr_reg]]
// LL: store float [[g1_store_reg]], float* [[g1_load_reg]]
*x_in = *shared_x;

// LL: [[g2_load_reg:%[0-9]*]] = load float*, float** {{%[0-9]*}}
// LL: [[g2_addr_reg:%[0-9]*]] = load float*, float** {{%[0-9]*}}
// LL: [[g2_store_reg:%[0-9]*]] = load float, float* [[g2_addr_reg]]
// LL: store float [[g2_store_reg]], float* [[g2_load_reg]]
*x_in = *private_x;
}

// PTX-LABEL: k_foo
// PTX: ld.global.f32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more checks here? (Also for host code)

  • add test for loading from global_x and the others in host code aswell
  • add CHECK-LABEL: to make sure the loads happen in the correct functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New additions look good.
I guess I'm still confused as to why the host code checking is so much more verbose than the PTX checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both cases I’m testing all the places where address space info shows up. Those same lines in the device IR have address space info so I figured this would be the most complete way to ensure no address space info gets leaked into the host code. By the time that the ptx gets generated there are far fewer places where the address space info shows up which is why it looks a little lop-sided.

If you think It seems prudent, I can add the device side IR checks as well as the ptx checks to ensure both the IR and PTX have the proper address spaces but I had (perhaps incorrectly) assumed that if the PTX had the address spaces, the IR would too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know anything about PTX, so I trust you on this one ;)

// PTX: st.shared.f32
// PTX: st.local.f32
// PTX: ld.const.f32
// PTX: ld.shared.f32
// PTX: ld.local.f32
13 changes: 12 additions & 1 deletion tests/codegen/inputs/kernel.d
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,15 @@
module inputs.kernel;

import ldc.dcompute;
@kernel void foo() {}
@kernel void k_foo(GlobalPointer!float x_in)
{
SharedPointer!float shared_x;
PrivatePointer!float private_x;
ConstantPointer!float const_x;
*shared_x = *x_in;
*private_x = *x_in;
*x_in = *const_x;

*x_in = *shared_x;
*x_in = *private_x;
}