-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Stage2 support for Nvptx target #10189
Conversation
I don't really think that this is the right approach - a calling convention valid for only one target should not be silently translated into another calling convention for a different target. It seems to me from your examples that this is not actually required for any nvptx functionality, but is rather used for generating an interface and for calling a debug-version of the kernel compiled to CPU code. Would this also be resolved by implementing something as follows? const kernel_cc: CallingConvention = if (target.cpu.arch == .nvptx) .PtxKernel else .Unspecified;
pub export fn rgba_to_greyscale(rgbaImage: []u8, greyImage: []u8) callconv(kernel_cc) void {
...
} This still leaves the problem where you want to access the kernel prototype without emitting the cpu-kernel in the binary, or actually having the compiler analyze it at all. One solution might be to make the compiler not analyze a function's body if only the prototype is required (and the prototype does not depend on the function's body, such as with generics or inferred error sets). Kernels can then be conditionally exported using |
Actually one of my intermediate versions was using So do you think it could be a valid use case of having a calling convention that have a different meaning on different target ? (.Vectorcall seems to be a precedent) Otherwise I may not even need a Calling convention at all. I just want to mark some of the function as "entry_point" in the |
I probably hasn't been really clear in my last post, and I got more time to think about this. First what do we mean by "Ptx" target ?
For instance Clang went with option 2. The generation of the My original approach was more like option 1: generate a But since I want to have a type-safe way of launching the kernels I also want to So to get back to the two questions raised by @Snektron : should we allow the .PtxKernel calling convention in code compiled for x86 ?
should we allow the .PtxKernel to use Zig object ?
|
I don't understand what's going on there. I can't reproduce the CI issues locally. |
sample command: /home/guw/github/zig/stage2/bin/zig build-obj cuda_kernel.zig -target nvptx64-cuda -O ReleaseSafe this will create a kernel.ptx expose PtxKernel call convention from LLVM kernels are `export fn f() callconv(.PtxKernel)`
Youhou tests passes! Previous version had surreptitiously introduced a "break" statement in a switch statement that was only triggering for other archs. Similarly I needed to put all usage of LLVMObject behind a comptime guard to avoid Zig to analyse it when not building with LLVM. (Thanks Meghan to point me to that) To come back to the design questions, in stage 2 meeting we agreed to:
|
Thanks for the review! |
Nvptx aka NVidia Parallel Thread Execution is a high-level assembly for Nvidia GPUs.
This PR aims to allow Stage 2 to generate this format, and therefore enable GPU programming in Zig.
This is a follow up on issue #10064 , and contains work presented at last Zig Rush
Overview
To generate the .ptx file we leverage the Nvptx LLVM backend.
The main thing we need to do is to ask LLVM to produce the assembly instead of the binary.
That's why we have a custom Linker that just pass most function calls to LLVM.
Only we intercept
flushModule
and modify theCompilation
object to ask for assembly instead of binary.In NVPTX functions that can be launched from the host (CPU) are named kernel and need to be declared as such.
For this LLVM expects the function to use the
PTX_Kernel
calling convention.Therefore I added the
.PtxKernel
calling convention to Zig (both Stage1 and Stage2)The main trick here is that kernel need to be called from CPU code, so you want to be able to access the signature of the kernel from the CPU code. That's why I convert the calling convention
.PtxKernel
into.Fast
when compiled for another target.The function itself will still probably be impossible to compile, but at least the CPU code can import it.
Help needed
I was only able to implement the CC conversion for stage1 using
get_llvm_cc
.But in Stage2, implementing the equivalent in
toLlvmCallConv
doesn't work as expected becauseit doesn't change the
.cc
attribute of the function. Is it possible ? Or is it a better place to do that ?Right now I need to compile device code with Stage2 and host code with Stage1.
Demo
You can find sample code that leverage this PR here:
Kernel (compiled with stage2 for Nvptx target): https://github.com/gwenzek/cudaz/blob/4dd5a6b2eef966afa11135fbe286e51c1fe5056d/CS344/src/hw1_pure_kernel.zig#L5
Caller code (compiled with stage1 for : https://github.com/gwenzek/cudaz/blob/4dd5a6b2eef966afa11135fbe286e51c1fe5056d/CS344/src/hw1_pure.zig#L44
Note that PTX has a lot of intrinsics but for now I didn't need to add them to Zig language, because I can use Zig inline assembly to generate the corresponding PTX code directly.
Tagging @Snektron that helped me previously on this topic.