-
-
Notifications
You must be signed in to change notification settings - Fork 14.4k
Adding a new offload_args intrinsic, which only maps arguments #150683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
7794484
788a31e
7f83187
22f1c8d
58d1a17
1a22802
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1609,6 +1609,7 @@ symbols! { | |
| of, | ||
| off, | ||
| offload, | ||
| offload_args, | ||
| offset, | ||
| offset_of, | ||
| offset_of_enum, | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3384,6 +3384,67 @@ pub const fn copysignf128(x: f128, y: f128) -> f128; | |
| #[rustc_intrinsic] | ||
| pub const fn autodiff<F, G, T: crate::marker::Tuple, R>(f: F, df: G, args: T) -> R; | ||
|
|
||
| /// This intrinsic maps the given args from the Host(=CPU) to a GPU device. It then calls the given | ||
| /// function. Unlike the full `offload` intrinsic, this intrinsic expects a host function, in which | ||
| /// we will replace all usages of the given host args with their device version. This enables | ||
| /// support for various GPU libraries like `cuBLAS`, `cuDNN`, or `rocBLAS`, which *must* be called | ||
| /// from the host, but expect a mixture of host and device arguments. | ||
| /// | ||
| /// Type Parameters: | ||
| /// - `F`: The kernel to call. Must be a function item. | ||
| /// - `T`: A tuple of arguments passed to `f`. | ||
| /// - `R`: The return type of the kernel. | ||
| /// | ||
| /// Arguments: | ||
| /// - `f`: The host function to be called. | ||
| /// - `args`: A tuple of arguments, will be mapped to the gpu and forwarded to `f`. | ||
| /// | ||
| /// Example usage (pseudocode): | ||
| /// | ||
| /// ```rust,ignore (pseudocode) | ||
| /// fn kernel(A: &[f32; 6], x: &[f32; 3], y: &mut [f64; 2]) { | ||
| /// core::intrinsics::offload_args(sgemv_wrapper, (A,x,y)) | ||
| /// } | ||
| /// | ||
| /// #[cfg(target_os = "linux")] | ||
| /// extern "C" { | ||
| /// pub fn rocblas_sgemv( | ||
| /// alpha: *const f32, | ||
| /// A: *const f32, | ||
| /// x: *const f32, | ||
| /// beta: *const f32, | ||
| /// y: *mut f32, | ||
| /// ); | ||
| /// } | ||
| /// | ||
| /// #[cfg(not(target_os = "linux"))] | ||
| /// fn sgemv_wrapper(A: &[f32; 6], x: &[f32; 3], y: &mut [f64; 2]) { | ||
| /// // rocblas expects scalars to be passed as host pointers. | ||
| /// let alpha = 1.0; | ||
| /// let beta = 1.0; | ||
| /// unsafe { | ||
| /// rocblas_sgemv( | ||
| /// // Host ptr | ||
| /// &alpha as *const f32, | ||
| /// // Replaced by device ptr | ||
| /// A.as_ptr(), | ||
| /// // Replaced by device ptr | ||
| /// x.as_ptr(), | ||
| /// // Host ptr | ||
| /// &beta as *const f32, | ||
| /// // Replaced by device ptr | ||
| /// y.as_mut_ptr() | ||
| /// ); | ||
| /// } | ||
| /// } | ||
| /// ``` | ||
| /// | ||
| /// For reference, see the Clang documentation on offloading: | ||
| /// <https://clang.llvm.org/docs/OffloadingDesign.html>. | ||
| #[rustc_nounwind] | ||
| #[rustc_intrinsic] | ||
| pub const fn offload_args<F, T: crate::marker::Tuple, R>(f: F, args: T) -> R; | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: docs
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah that's a blocker for reviews.^^ @rustbot author
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Details ^^ Here you go. @RalfJung I'm somewhat concerned whether people could write host code in the wrapper, which breaks if we map arguments to the GPU. I will make a follow-up PR, in which I will add support for forwarding host arguments that should not be mapped. In the case of the rocBLAS test, those would be
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know enough about this to have an opinion on that question.^^ But we can bikeshed the intrinsic as a whole a bit.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Under the hood, both create an Happy to change the intrinsic name since we expect to have a pretty macro on top of it. My motivation for Does that help?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It sounds a bit like offload_args is a lower-level primitive that could be used to implement offload... but it also seems like that's not actually how it works?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They share a lot of the implementation, that's why this PR is so small, especially once the first 3 refactor commits from a different PR are merged. If you look at the commits, you'll see that I've added a boolean host flag. If it is set to true, we directly call the host function given to the intrinsic. The compilation pipeline in this case is also shorten, and a few of our optional args are set to None, since we don't need to generate all that much info for offload. If the host boolean is set to false, we generate a
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, so... I take it that it does not make sense to implement
I see. Happy to take input some more folks on that.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I’m not that familiar with offload, so I’ll need help on the details, but this is how I understood it: There are two (and a half?) primitive operations involved:
The current On a low level, 1 and 2 really are orthogonal operations, so it would make sense to split them into two separate things. The interface of doing the copy to and back around a closure seems nice, it looks like scoped threads and similar patterns. I think we can the expose the two primitive operations as intrinsics: /// Transfer memory referenced by args back and forth
pub fn offload_copy_mem_args<F, T: crate::marker::Tuple, R>(f: F, args: T) -> R;
/// Call a kernel.
/// Does **not** copy memory (note: also no restriction on T)
pub fn offload_call_kernel<F>(args: T) -> R;This composes nicely to implement offload in code: pub fn offload<F, T: crate::marker::Tuple, R>(f: F, args: T) -> R {
offload_copy_mem_args(|offloaded_args| {
offload_call_kernel(f, offloaded_args)
})
}And it gets rid of the bool argument to trigger one or the other implementation.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Replying here since it seems you posted that in the wrong thread
This sounds like you are discussing the public, user-visible API. In this PR we are discussing the underlying language primitives (intrinsics) powering that API. Those don't have the same concerns. No user should ever directly call these intrinsics. The intrinsics should be designed for
|
||
|
|
||
| /// Generates the LLVM body of a wrapper function to offload a kernel `f`. | ||
| /// | ||
| /// Type Parameters: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahere is a reference to invalid memory, as the pointer behindAis only valid on the GPU but this code runs on the CPU (also thecfg(not(target_osshould probably becfg(target_oswithout the not).It works out fine here because A is never accessed, but I think it violates Rust’s guarantees.
To be safe,
offload_argsneeds to map the type as well. I.e.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right, that's what I was mentioning above to Ralf: https://github.com/rust-lang/rust/pull/150683/changes#r2706310846 That means some safe usages can be currently unsound till the follow-up pr lands, but it's a half implemented rustc intrinsic (which end-users aren't supposed to use anyway), so I'd argue it's ok. The alternative is to make this PR bigger, but it clearly already takes a while to review, so I'd rather have that separate.
Generally, I am very willing to put in significant effort to preserve noalias for the GPU kernels, which your type mapping I think does not do. If you watch the slides from my US LLVM dev talk, they show how I intend to keep it even for mut args like
&mut [f32].There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for the vision, I want these two intrinsics to be the default, which should (thanks to our new llvm opts) almost always be fast enough. However, we'll also offer explicit host2device, device2host memtransfers for users. oli-obk had a nice idea for a thin wrapper type that represents that a type is now on the GPU. In that case we could prevent usages on the CPU, and enforce that people only pass data to the GPU kernel which is already on the GPU.
But I see this third explicit mode (well it's not really a fully orthogonal mode, more like an extension) a bit similar to unsafe, it should hopefully be rarely necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That link doesn't work here so I am not sure what you are referring to. But I definitely didn't realize that the example contains unsound code, or else I would have made a fuzz about it. ;)
The problem with the example isn't about being safe or not, it's about having UB. Examples with UB without a big fat warning sign are not okay.
Is there a better example that avoids the UB? Like, if you made all these references into raw pointers, would that work?
Also, this affects not just
Abut alsoxandy, wouldn't it? They are all GPU pointers now?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gh being lovely again. I was referring to my comment here:
Yes,
A,x,yare GPU pointers. That was why I had asked you about only supporting externForeignFnonce the follow-up pr lands. The easiest solution for now is indeed to move this to raw pointers, till we have either of the follow-up pr's landed which make it safe.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intrinsic is unsafe, so it's entirely fine if some uses of it cause UB. But the docs for the intrinsic should be clear about what exactly is or isn't UB, or if that is yet to be determined then they should be clear about that.
Also, the example shouldn't be UB. :)