Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Stage2 support for Nvptx target #10189

Merged
merged 2 commits into from
Feb 5, 2022
Merged

Add Stage2 support for Nvptx target #10189

merged 2 commits into from
Feb 5, 2022

Conversation

gwenzek
Copy link
Contributor

@gwenzek gwenzek commented Nov 20, 2021

Nvptx aka NVidia Parallel Thread Execution is a high-level assembly for Nvidia GPUs.
This PR aims to allow Stage 2 to generate this format, and therefore enable GPU programming in Zig.
This is a follow up on issue #10064 , and contains work presented at last Zig Rush

Overview

To generate the .ptx file we leverage the Nvptx LLVM backend.
The main thing we need to do is to ask LLVM to produce the assembly instead of the binary.
That's why we have a custom Linker that just pass most function calls to LLVM.
Only we intercept flushModule and modify the Compilation object to ask for assembly instead of binary.

In NVPTX functions that can be launched from the host (CPU) are named kernel and need to be declared as such.
For this LLVM expects the function to use the PTX_Kernel calling convention.
Therefore I added the .PtxKernel calling convention to Zig (both Stage1 and Stage2)

The main trick here is that kernel need to be called from CPU code, so you want to be able to access the signature of the kernel from the CPU code. That's why I convert the calling convention .PtxKernel into .Fast when compiled for another target.
The function itself will still probably be impossible to compile, but at least the CPU code can import it.

Help needed

I was only able to implement the CC conversion for stage1 using get_llvm_cc.
But in Stage2, implementing the equivalent in toLlvmCallConv doesn't work as expected because
it doesn't change the .cc attribute of the function. Is it possible ? Or is it a better place to do that ?
Right now I need to compile device code with Stage2 and host code with Stage1.

Demo

You can find sample code that leverage this PR here:

Kernel (compiled with stage2 for Nvptx target): https://github.com/gwenzek/cudaz/blob/4dd5a6b2eef966afa11135fbe286e51c1fe5056d/CS344/src/hw1_pure_kernel.zig#L5

Caller code (compiled with stage1 for : https://github.com/gwenzek/cudaz/blob/4dd5a6b2eef966afa11135fbe286e51c1fe5056d/CS344/src/hw1_pure.zig#L44

Note that PTX has a lot of intrinsics but for now I didn't need to add them to Zig language, because I can use Zig inline assembly to generate the corresponding PTX code directly.

Tagging @Snektron that helped me previously on this topic.

@Snektron
Copy link
Collaborator

Snektron commented Nov 22, 2021

The main trick here is that kernel need to be called from CPU code, so you want to be able to access the signature of the kernel from the CPU code. That's why I convert the calling convention .PtxKernel into .Fast when compiled for another target.
The function itself will still probably be impossible to compile, but at least the CPU code can import it.

I don't really think that this is the right approach - a calling convention valid for only one target should not be silently translated into another calling convention for a different target. It seems to me from your examples that this is not actually required for any nvptx functionality, but is rather used for generating an interface and for calling a debug-version of the kernel compiled to CPU code.

Would this also be resolved by implementing something as follows?

const kernel_cc: CallingConvention = if (target.cpu.arch == .nvptx) .PtxKernel else .Unspecified;

pub export fn rgba_to_greyscale(rgbaImage: []u8, greyImage: []u8) callconv(kernel_cc) void {
  ...
}

This still leaves the problem where you want to access the kernel prototype without emitting the cpu-kernel in the binary, or actually having the compiler analyze it at all. One solution might be to make the compiler not analyze a function's body if only the prototype is required (and the prototype does not depend on the function's body, such as with generics or inferred error sets). Kernels can then be conditionally exported using @export, depending on the target. I do not know how that currently works though.

src/Compilation.zig Outdated Show resolved Hide resolved
src/stage1/analyze.cpp Show resolved Hide resolved
src/link/NvPtx.zig Outdated Show resolved Hide resolved
@gwenzek
Copy link
Contributor Author

gwenzek commented Nov 22, 2021

I don't really think that this is the right approach. [...] It seems to me from your examples that this is not actually required for any nvptx functionality, but is rather used for generating an interface and for calling a debug-version of the kernel compiled to CPU code.

Actually one of my intermediate versions was using const kernel_cc: CallingConvention = if (target.cpu.arch == .nvptx) .PtxKernel else .Unspecified; on the client side.
But it become mandatory because you'll need some CPU code to start the kernel,
and for that you need to know the kernel signature, and you need to import it.
And since compiling a callconv(.PtxKernel) for any target beside Nvptx doesn't make sense, I thought it would be better to nicely import it, instead of yielding a compile error.

So do you think it could be a valid use case of having a calling convention that have a different meaning on different target ? (.Vectorcall seems to be a precedent)

Otherwise I may not even need a Calling convention at all. I just want to mark some of the function as "entry_point" in the .ptx.
LLVM generate this only for exported but it forces you to chose a calling convention. So I had to create a new one, to still be able to pass Zig objects. So if there is another way to annotate function as "entry_point" in Zig, I may not have to specify the CC/nor the export.

@gwenzek
Copy link
Contributor Author

gwenzek commented Dec 1, 2021

I probably hasn't been really clear in my last post, and I got more time to think about this.

First what do we mean by "Ptx" target ?

  1. we want to generate a .ptx file that can be called by any language using the cuda C API
  2. we want to generate an adhoc CPU program that can interact with the GPU

For instance Clang went with option 2. The generation of the .ptx is only an intermediate step of compiling a larger CPU program with GPU acceleration. The .ptx is embedded in the binary and isn't visible from the user.
This approach has the benefit of not having to think about ABI: you only need the GPU code and the CPU code to agree on memory layout. The downside is that there is more work for the compiler, Clang actually run two compilations passes. Some of the code will be compiled both for the CPU and the GPU. This allows to check that the GPU code is launched with correct parameter types.

My original approach was more like option 1: generate a .ptx file as a target, and @embed to include the .ptx in another compilation unit. And I think this is the approach you're favoring @Snektron IIUC.

But since I want to have a type-safe way of launching the kernels I also want to @import the zig code targeted for the device. And I also want to be able to pass Zig structs between CPU/GPU, so my goal is more to provide a user experience similar to Clang to the user (option 2).
The good thing is my implementation only need minimal changes to the Zig compiler (this PR) to allow generating the .ptx, and some library code which doesn't need to be part of Zig.

So to get back to the two questions raised by @Snektron :

should we allow the .PtxKernel calling convention in code compiled for x86 ?

  • For me the CPU code always need to know what are the signature of the GPU functions. So we will always want to "import" the device code into the CPU compilation unit. (The alternative being to manually declare the signature for each kernel.) But it raises the question of how to handle function with this calling convention on CPU.

should we allow the .PtxKernel to use Zig object ?

  • For me yes. For example Pytorch has lot of GPU functions that directly takes their Tensor type as input which provide a convenient way to read/write to them. Having to go back to raw pointers would be a serious drawback

  • If users want to have a .ptx compatible with other languages then they can just restrict their function to C types or use extern struct which have a guaranteed layout. We can introduce a .PtxKernelC calling convention if we want to be more explicit.

@gwenzek
Copy link
Contributor Author

gwenzek commented Jan 18, 2022

I don't understand what's going on there. I can't reproduce the CI issues locally.

sample command:

/home/guw/github/zig/stage2/bin/zig build-obj cuda_kernel.zig -target nvptx64-cuda -O ReleaseSafe
this will create a kernel.ptx

expose PtxKernel call convention from LLVM
kernels are `export fn f() callconv(.PtxKernel)`
@gwenzek
Copy link
Contributor Author

gwenzek commented Jan 23, 2022

Youhou tests passes!

Previous version had surreptitiously introduced a "break" statement in a switch statement that was only triggering for other archs. Similarly I needed to put all usage of LLVMObject behind a comptime guard to avoid Zig to analyse it when not building with LLVM. (Thanks Meghan to point me to that)

To come back to the design questions, in stage 2 meeting we agreed to:

  • restrict PtxKernel call conv to the nvptx targets,
  • pass arbitrary Zig structs to those kernels

@Vexu Vexu merged commit 0e1afb4 into ziglang:master Feb 5, 2022
@gwenzek
Copy link
Contributor Author

gwenzek commented Feb 5, 2022

Thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants