-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global and device kernels are unsound #11
Comments
Sync
Sync
Sync
Sync
In RustaCUDA I have a marker trait (DeviceCopy) for types which are safe to copy to the GPU, along with a custom derive macro that checks that DeviceCopy must be implemented on all fields of a type. It's implemented by default on all of the primitive numeric types, but not on references (because that would be unsound) or raw pointers (because they could be pointing to host memory). Instead, the user has to use some other function to get a DevicePointer, which contains the All of the device allocation structures require that their contents implement DeviceCopy. I haven't implemented kernel launching yet, but I was planning to have the same restriction for parameters. Perhaps a similar approach could be taken here, at least for kernel parameters. This doesn't help limit unsafety within a kernel, though. I chose not to have DeviceCopy be a subtrait of Copy. It's quite likely that users will want to implement DeviceCopy for large and complex structures, where it would be inefficient to pass-by-copy. |
The function parameters to a kernel can be anything that nvcc successfully compiles, and which can be copied as bits; see the docs for So, for |
@gnzlbg Also re: your An example: fn device(a: &mut i32) { *a += 1; }
pub extern "ptx-kernel" fn kernel() { // not unsafe
let shm_size: usize = 32;
let shm_ptr: *mut i32 = /* somehow, get shared memory... */;
let shm = unsafe { CuSharedMemSlice::from_raw(shm_ptr, shm_size) }; // OK
// fn split_tid_x<'a>(&'a self) -> &'a T { ... }
// fn split_tid_x_mut<'a>(&'a self) -> &'a mut T { ... }
// internally the impl of split_tid does some pointer weakening coercions like in your last example
let x: &mut i32 = shm.split_tid_x_mut(); // OK
device(x); // OK
} However the issue of shared memory seems to then be a matter of higher-level APIs, rather than one of soundness, unlike the issue of parameters of |
Yeah, we probably want something like this, but people often like to do more "complicated" things like this - like touching multiple disjoint non-necessarily-contiguous elements of an array from each of the threads. (EDIT: I always thought that providing something like Matlab's Enabling those use cases while at the same time rejecting code that has undefined behavior is the tricky part. |
That makes sense. Sounds to me then there are at least a few separate issues all under the "soundness" umbrella:
extern "ptx-kernel" {
static shared: [u32; 0];
}
pub unsafe extern "ptx-kernel" fn foo() {
let base: *mut u32 = shared.as_ptr() as *mut u32; // is this UB? but it gives us what we want
let p: *mut u32 = base.offset(nvptx::_thread_idx_x() as isize);
*p = 42;
} will compile to the following PTX assembly:
The PTX output above is almost correct, except where there are these references to "global" memory that should instead be "shared" memory. One problem is Rust isn't aware of NVPTX required address spaces, which is necessary for supporting |
Motivation
Launching NVPTX global kernels is
unsafe
- they areunsafe fn
, and this requires the Rust program that launches to use anunsafe
block. For most examples below, this program has undefined behavior because theunsafe
code it contains is incorrect.However, must of the kernels below are never correct, so it would be very helpful for the compiler to reject them, or to at least warn about their issues.
Examples
These are some examples of code that's accepted today. Most of these examples are always UB.
Launching these global kernels
is always undefined behavior: these kernels are spawned in multiple threads of execution, each containing a copy of the same
&mut T
to the same data. On the other hand:global kernels that are called from other kernels are executed in the same thread of execution. Device kernels as well:
We don't support static and dynamic shared arrays in kernels yet, but NVPTX does, and we'd like to support them at some point. These arrays are shared across all threads of execution without any synchronization:
Note that there are two issues with these. When a device function creates them, these are shared across all execution threads of that device function. That is, taking a
&mut T
to the whole array creates many copies, one on each execution thread, of the same&mut T
to the exact same data. This is already undefined behavior, and can be used to introduce data-races.We might want to support synchronized (e.g. atomic) versions of the shared memory arrays as well. While they might avoid the data-race, taking a
&mut T
to the array still creates multiple&mut T
to the same data, which is undefined behavior. That is, just adding synchronization does not solve the problem (this is also not desirable for performance).We'd like to accept this code:
but note that
IndexMut::index_mut(&mut self)
would create multiple&mut T
to the shared array, one on each thread, which results in UB as well. The following example should work, but is not very nice:Questions
What general approaches do we have to make these examples sound?
ptx-kernel
) that are notunsafe fn
Should we also pursue an approach that lints on "improper global/device kernel arguments" ? E.g.
Sync
- probably as a too hard constraint, since it does not allow raw pointers, also we technically only requireSync
for mutable references to shared memory. Mutable references that do not point to shared memory are fine.SendGPU
or similar (DeviceCopy
as @bheisler put it below), since these arguments need to be sendable from the Host to the Device, and Copyable to the multiple execution threads of the device.It might get tricky to propagate these lints through generic code, e.g., when calling
Index::index
as a device function. Also,Sync
prevents raw pointers. A simple wrapper solves this, but we might want to allow raw pointers for convenience here.What do we do about shared memory device arrays? Taking a
&mut
to them is always undefined behavior, which makes them extremely easy to use incorrectly, and very hard and unergonomic to use correctly.Are there any other ways of tackling this problem?
The text was updated successfully, but these errors were encountered: