Add Stage2 support for Nvptx target #10189

gwenzek · 2021-11-20T22:29:47Z

Nvptx aka NVidia Parallel Thread Execution is a high-level assembly for Nvidia GPUs.
This PR aims to allow Stage 2 to generate this format, and therefore enable GPU programming in Zig.
This is a follow up on issue #10064 , and contains work presented at last Zig Rush

Overview

To generate the .ptx file we leverage the Nvptx LLVM backend.
The main thing we need to do is to ask LLVM to produce the assembly instead of the binary.
That's why we have a custom Linker that just pass most function calls to LLVM.
Only we intercept flushModule and modify the Compilation object to ask for assembly instead of binary.

In NVPTX functions that can be launched from the host (CPU) are named kernel and need to be declared as such.
For this LLVM expects the function to use the PTX_Kernel calling convention.
Therefore I added the .PtxKernel calling convention to Zig (both Stage1 and Stage2)

The main trick here is that kernel need to be called from CPU code, so you want to be able to access the signature of the kernel from the CPU code. That's why I convert the calling convention .PtxKernel into .Fast when compiled for another target.
The function itself will still probably be impossible to compile, but at least the CPU code can import it.

Help needed

I was only able to implement the CC conversion for stage1 using get_llvm_cc.
But in Stage2, implementing the equivalent in toLlvmCallConv doesn't work as expected because
it doesn't change the .cc attribute of the function. Is it possible ? Or is it a better place to do that ?
Right now I need to compile device code with Stage2 and host code with Stage1.

Demo

You can find sample code that leverage this PR here:

Kernel (compiled with stage2 for Nvptx target): https://github.com/gwenzek/cudaz/blob/4dd5a6b2eef966afa11135fbe286e51c1fe5056d/CS344/src/hw1_pure_kernel.zig#L5

Caller code (compiled with stage1 for : https://github.com/gwenzek/cudaz/blob/4dd5a6b2eef966afa11135fbe286e51c1fe5056d/CS344/src/hw1_pure.zig#L44

Note that PTX has a lot of intrinsics but for now I didn't need to add them to Zig language, because I can use Zig inline assembly to generate the corresponding PTX code directly.

Tagging @Snektron that helped me previously on this topic.

Snektron · 2021-11-22T03:12:27Z

The main trick here is that kernel need to be called from CPU code, so you want to be able to access the signature of the kernel from the CPU code. That's why I convert the calling convention .PtxKernel into .Fast when compiled for another target.
The function itself will still probably be impossible to compile, but at least the CPU code can import it.

I don't really think that this is the right approach - a calling convention valid for only one target should not be silently translated into another calling convention for a different target. It seems to me from your examples that this is not actually required for any nvptx functionality, but is rather used for generating an interface and for calling a debug-version of the kernel compiled to CPU code.

Would this also be resolved by implementing something as follows?

const kernel_cc: CallingConvention = if (target.cpu.arch == .nvptx) .PtxKernel else .Unspecified;

pub export fn rgba_to_greyscale(rgbaImage: []u8, greyImage: []u8) callconv(kernel_cc) void {
  ...
}

This still leaves the problem where you want to access the kernel prototype without emitting the cpu-kernel in the binary, or actually having the compiler analyze it at all. One solution might be to make the compiler not analyze a function's body if only the prototype is required (and the prototype does not depend on the function's body, such as with generics or inferred error sets). Kernels can then be conditionally exported using @export, depending on the target. I do not know how that currently works though.

src/Compilation.zig

src/stage1/analyze.cpp

src/link/NvPtx.zig

gwenzek · 2021-11-22T22:31:32Z

I don't really think that this is the right approach. [...] It seems to me from your examples that this is not actually required for any nvptx functionality, but is rather used for generating an interface and for calling a debug-version of the kernel compiled to CPU code.

Actually one of my intermediate versions was using const kernel_cc: CallingConvention = if (target.cpu.arch == .nvptx) .PtxKernel else .Unspecified; on the client side.
But it become mandatory because you'll need some CPU code to start the kernel,
and for that you need to know the kernel signature, and you need to import it.
And since compiling a callconv(.PtxKernel) for any target beside Nvptx doesn't make sense, I thought it would be better to nicely import it, instead of yielding a compile error.

So do you think it could be a valid use case of having a calling convention that have a different meaning on different target ? (.Vectorcall seems to be a precedent)

Otherwise I may not even need a Calling convention at all. I just want to mark some of the function as "entry_point" in the .ptx.
LLVM generate this only for exported but it forces you to chose a calling convention. So I had to create a new one, to still be able to pass Zig objects. So if there is another way to annotate function as "entry_point" in Zig, I may not have to specify the CC/nor the export.

gwenzek · 2021-12-01T22:35:51Z

I probably hasn't been really clear in my last post, and I got more time to think about this.

First what do we mean by "Ptx" target ?

we want to generate a .ptx file that can be called by any language using the cuda C API
we want to generate an adhoc CPU program that can interact with the GPU

For instance Clang went with option 2. The generation of the .ptx is only an intermediate step of compiling a larger CPU program with GPU acceleration. The .ptx is embedded in the binary and isn't visible from the user.
This approach has the benefit of not having to think about ABI: you only need the GPU code and the CPU code to agree on memory layout. The downside is that there is more work for the compiler, Clang actually run two compilations passes. Some of the code will be compiled both for the CPU and the GPU. This allows to check that the GPU code is launched with correct parameter types.

My original approach was more like option 1: generate a .ptx file as a target, and @embed to include the .ptx in another compilation unit. And I think this is the approach you're favoring @Snektron IIUC.

But since I want to have a type-safe way of launching the kernels I also want to @import the zig code targeted for the device. And I also want to be able to pass Zig structs between CPU/GPU, so my goal is more to provide a user experience similar to Clang to the user (option 2).
The good thing is my implementation only need minimal changes to the Zig compiler (this PR) to allow generating the .ptx, and some library code which doesn't need to be part of Zig.

So to get back to the two questions raised by @Snektron :

should we allow the .PtxKernel calling convention in code compiled for x86 ?

For me the CPU code always need to know what are the signature of the GPU functions. So we will always want to "import" the device code into the CPU compilation unit. (The alternative being to manually declare the signature for each kernel.) But it raises the question of how to handle function with this calling convention on CPU.

should we allow the .PtxKernel to use Zig object ?

For me yes. For example Pytorch has lot of GPU functions that directly takes their Tensor type as input which provide a convenient way to read/write to them. Having to go back to raw pointers would be a serious drawback
If users want to have a .ptx compatible with other languages then they can just restrict their function to C types or use extern struct which have a guaranteed layout. We can introduce a .PtxKernelC calling convention if we want to be more explicit.

gwenzek · 2022-01-18T15:53:48Z

I don't understand what's going on there. I can't reproduce the CI issues locally.

sample command: /home/guw/github/zig/stage2/bin/zig build-obj cuda_kernel.zig -target nvptx64-cuda -O ReleaseSafe this will create a kernel.ptx expose PtxKernel call convention from LLVM kernels are `export fn f() callconv(.PtxKernel)`

gwenzek · 2022-01-23T09:32:47Z

Youhou tests passes!

Previous version had surreptitiously introduced a "break" statement in a switch statement that was only triggering for other archs. Similarly I needed to put all usage of LLVMObject behind a comptime guard to avoid Zig to analyse it when not building with LLVM. (Thanks Meghan to point me to that)

To come back to the design questions, in stage 2 meeting we agreed to:

restrict PtxKernel call conv to the nvptx targets,
pass arbitrary Zig structs to those kernels

gwenzek · 2022-02-05T15:12:26Z

Thanks for the review!

gwenzek mentioned this pull request Nov 20, 2021

Pointers for NVPTX support #10064

Closed

Snektron reviewed Nov 22, 2021

View reviewed changes

src/Compilation.zig Outdated Show resolved Hide resolved

src/stage1/analyze.cpp Show resolved Hide resolved

src/link/NvPtx.zig Outdated Show resolved Hide resolved

gwenzek mentioned this pull request Dec 23, 2021

Handling of heterogeneous computing #10397

Open

gwenzek force-pushed the nvptx_cc branch from 6c27f4b to aa6873e Compare December 24, 2021 08:22

gwenzek force-pushed the nvptx_cc branch from aa6873e to 0db7238 Compare January 16, 2022 16:09

gwenzek force-pushed the nvptx_cc branch from 0db7238 to 607d1cf Compare January 20, 2022 10:59

gwenzek added 2 commits January 21, 2022 10:04

add guards

1b3d0bb

gwenzek force-pushed the nvptx_cc branch from 607d1cf to 1b3d0bb Compare January 21, 2022 13:37

Vexu approved these changes Jan 29, 2022

View reviewed changes

gwenzek mentioned this pull request Feb 2, 2022

zig cc: Treat cu files as C++ source files #10704

Merged

Vexu merged commit 0e1afb4 into ziglang:master Feb 5, 2022

gwenzek mentioned this pull request Feb 13, 2022

enable Gpu address spaces #10884

Merged

gwenzek deleted the nvptx_cc branch March 18, 2022 08:58

gwenzek mentioned this pull request Sep 16, 2022

Update Nvptx backend for Zig 0.10 #12878

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Stage2 support for Nvptx target #10189

Add Stage2 support for Nvptx target #10189

gwenzek commented Nov 20, 2021 •

edited

Loading

Snektron commented Nov 22, 2021 •

edited

Loading

gwenzek commented Nov 22, 2021

gwenzek commented Dec 1, 2021

gwenzek commented Jan 18, 2022

gwenzek commented Jan 23, 2022

gwenzek commented Feb 5, 2022

Add Stage2 support for Nvptx target #10189

Add Stage2 support for Nvptx target #10189

Conversation

gwenzek commented Nov 20, 2021 • edited Loading

Overview

Help needed

Demo

Snektron commented Nov 22, 2021 • edited Loading

gwenzek commented Nov 22, 2021

gwenzek commented Dec 1, 2021

should we allow the .PtxKernel calling convention in code compiled for x86 ?

should we allow the .PtxKernel to use Zig object ?

gwenzek commented Jan 18, 2022

gwenzek commented Jan 23, 2022

gwenzek commented Feb 5, 2022

gwenzek commented Nov 20, 2021 •

edited

Loading

Snektron commented Nov 22, 2021 •

edited

Loading