Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable Gpu address spaces #10884

Merged
merged 2 commits into from
Feb 21, 2022
Merged

enable Gpu address spaces #10884

merged 2 commits into from
Feb 21, 2022

Conversation

gwenzek
Copy link
Contributor

@gwenzek gwenzek commented Feb 13, 2022

Follow up on #10189 , this enable GPU specific address spaces leveraging previous work from @Snektron in #9649.

Right now this is directly mapping to the LLVM address spaces, but with Snektron we didn't found good reason to deviate from this and we believe it should be compatible across different architectures.

Note: this is still experimental feature because Zig doesn't have any officially supported GPU target.

@gwenzek gwenzek changed the title introduces Gpu address spaces enable Gpu address spaces Feb 13, 2022
@gwenzek
Copy link
Contributor Author

gwenzek commented Feb 14, 2022

Tagging @Vexu who reviewed the previous PR (I'm not able to assign reviewers, not sure if it's intended or not)

Copy link
Collaborator

@Snektron Snektron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

src/Sema.zig Outdated Show resolved Hide resolved
@Snektron
Copy link
Collaborator

Snektron commented Feb 14, 2022

One thing you might also want to implement is the logic for obtaining the default address space in a given context for this architecture:

zig/src/target.zig

Lines 598 to 614 in f506810

pub fn defaultAddressSpace(
target: std.Target,
context: enum {
/// Query the default address space for global constant values.
global_constant,
/// Query the default address space for global mutable values.
global_mutable,
/// Query the default address space for function-local values.
local,
/// Query the default address space for functions themselves.
function,
},
) std.builtin.AddressSpace {
_ = target;
_ = context;
return .generic;
}

Currently this just returns the generic address space, but since this is not allowed for SPIR-V, the idea was to assign a default address space for each context. This would for example set const x = y; at global scope to constant, and function-local variables to local. Do you happen to know if theres a big difference between the performance of different address spaces? From the LLVM NVPTX Docs it looks like the the default address space is generic, and I assume this is what CUDA also uses, but I'm not very sure about that. If CUDA instead makes function-locals local (address space 5) instead, it might be best to replicate that in Zig.

@gwenzek
Copy link
Contributor Author

gwenzek commented Feb 14, 2022

Do you happen to know if there's a big difference between the performance of different address spaces?

So when you dereference a generic pointer, the GPU runtime need to first identify what kind of pointer it is before dereferencing.
I don't know how costly it is, but it's certainly better to use qualified pointers as much as possible.

The slowest memory are local and global, then constant memory is a bit faster, then shared memory is significantly faster and finally registers is the fastest memory.

  • the shared memory is on chip memory, and compete with memory of the L1 cache (IIRC, need to find a source), it's generally act as a user controlled cache.
  • constant address space can be the same than global on some Nvidia GPUs, but is otherwise faster because it has dedicated caching
  • local memory is generally only used for kernels that require too many registers.

you might also want to implement is the logic for obtaining the default address space

I've tried based on your recommendation, but I think it will require more work. It tend to generate compile errors when
generating the builtin.zig for nvptx64

error: expected *const target.Target.Cpu.Model, found *addrspace(.constant) const target.Target.Cpu.Model

I've added the commit for reference, but will probably rollback since it's seems mostly useless for now:

zig/src/target.zig

Lines 604 to 614 in aa7acf0

.nvptx, .nvptx64 => switch (context) {
// TODO: this should be .constant, but it generates an error when generating the builtin.zig for nvptx64
// error: expected *const target.Target.Cpu.Model, found *addrspace(.constant) const target.Target.Cpu.Model
.global_constant => .generic,
.global_mutable => .global,
// Note: local memory is similar in term of performance to global memory
// we generally don't want to use it. The device will decide to do so if
// needed, when loading the assembly.
.local => .generic,
.function => .generic,
},

What would be useful would be to mark kernel parameters that are pointers as pointers to global memory, but I haven't found a way to do so.

I'm also not sure if we should expose the .local address space. For Cuda at least, it's not something we should use. It's only use when the device doesn't have enough registers to compute a given kernel, but it's quite slow.

@Snektron
Copy link
Collaborator

you might also want to implement is the logic for obtaining the default address space

I've tried based on your recommendation, but I think it will require more work. It tend to generate compile errors when generating the builtin.zig for nvptx64

error: expected *const target.Target.Cpu.Model, found *addrspace(.constant) const target.Target.Cpu.Model

I've added the commit for reference, but will probably rollback since it's seems mostly useless for now:

Hm, maybe need to experiment a bit more with that. Let's address that another time.

What would be useful would be to mark kernel parameters that are pointers as pointers to global memory, but I haven't found a way to do so.

Do you mean automatically? Perhaps it could be inferred based on the function calling convention, I am having second thoughts about the whole "automatically inferring address spaces" thingy anyway. Perhaps things should remain more explicit?

In the explicit (and current) case it would look like

export fn kernel(ptr: *addrspace(.global) u32) callconv(.PtxKernel) void {
  ...
}

I'm also not sure if we should expose the .local address space. For Cuda at least, it's not something we should use. It's only use when the device doesn't have enough registers to compute a given kernel, but it's quite slow.

Hm i see. The AMDGPU backend also exposes a local address space, and it seems that any alloca instruction automatically gains this address space (so in that sense its the correct mental model for local variables). How does this work for PTX? Do we need to avoid alloca all together?

src/Sema.zig Outdated Show resolved Hide resolved
@gwenzek gwenzek force-pushed the nvptx_addr branch 2 times, most recently from 06941f6 to 632e147 Compare February 17, 2022 21:54
@gwenzek
Copy link
Contributor Author

gwenzek commented Feb 17, 2022

The AMDGPU backend also exposes a local address space, and it seems that any alloca instruction automatically gains this address space. How does this work for PTX?

PTX does also have alloca but it's considered an experimental feature: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#stack-manipulation-instructions-alloca
The returned pointer is effectively in .local space.
The code that I saw generated is always fully inlined, and there is no function calls nor alloca I don't know if it's because my kernels are too simples, or is that's just a common practice for PTX format.

@gwenzek
Copy link
Contributor Author

gwenzek commented Feb 20, 2022

Im also not sure if global should be valid for anything thats not a pointer. It should in theory be possible to declare a local and shared global value, but im not sure whether it can be initialized at all.

So PTX can initialize .global and .constant. It seems to also be possible with AMDGPU, but it's a bit harder.
Maybe it won't be possible on all platforms but let's be permissive for now.
.local and .shared can't be preinitialized, but I haven't found a way to prevent this in the Zig frontend, so the errors are triggered in the backend.

relevant ptx documentation
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#state-spaces__properties-state-spaces

@andrewrk andrewrk merged commit 628e9e6 into ziglang:master Feb 21, 2022
@gwenzek
Copy link
Contributor Author

gwenzek commented Feb 21, 2022

Thanks for the insightful exchanges Robin,
and thanks for the reviews !

Time to go back making more examples of GPU Zig programming :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants