Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global and device kernels are unsound #11

Open
gnzlbg opened this issue Nov 16, 2018 · 6 comments
Open

Global and device kernels are unsound #11

gnzlbg opened this issue Nov 16, 2018 · 6 comments

Comments

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 16, 2018

Motivation

Launching NVPTX global kernels is unsafe - they are unsafe fn, and this requires the Rust program that launches to use an unsafe block. For most examples below, this program has undefined behavior because the unsafe code it contains is incorrect.

However, must of the kernels below are never correct, so it would be very helpful for the compiler to reject them, or to at least warn about their issues.

Examples

These are some examples of code that's accepted today. Most of these examples are always UB.

Launching these global kernels

#[no_mangle]
pub unsafe extern "ptx-kernel" fn foo(a: &mut f32) {} // UB: multiple &mut

pub struct Bar<'a>(&'a mut f32);
#[no_mangle]
pub unsafe extern "ptx-kernel" fn bar<'a>(a: Bar<'a>) {}  // UB: multiple &mut

is always undefined behavior: these kernels are spawned in multiple threads of execution, each containing a copy of the same&mut T to the same data. On the other hand:

#[no_mangle]
pub unsafe extern "ptx-kernel" fn foo(a: &mut f32) {}

#[no_mangle]
pub unsafe extern "ptx-kernel" fn bar(mut a: f32) {
    foo(&mut a); // OK - global kernels can be called from other kernels
} 

global kernels that are called from other kernels are executed in the same thread of execution. Device kernels as well:

fn device(a: &mut i32) { a += 1 } // OK

#[no_mangle]
pub unsafe extern "ptx-kernel" fn global() {
   let mut a = 0;
   device(&mut a); // OK: each a is local to each thread of execution
}

We don't support static and dynamic shared arrays in kernels yet, but NVPTX does, and we'd like to support them at some point. These arrays are shared across all threads of execution without any synchronization:

fn device(a: &mut [i32; 32]) {
    a[0] += 1;  // UB: data-race
}

#[no_mangle]
pub unsafe extern "ptx-kernel" fn global() {
   let mut a = UnsyncShared::<[0_i32; 32]>::new(); // OK: create unsynchronized shared memory array
   device(&mut a); // UB: multiple &mut to same object
}

Note that there are two issues with these. When a device function creates them, these are shared across all execution threads of that device function. That is, taking a &mut T to the whole array creates many copies, one on each execution thread, of the same &mut T to the exact same data. This is already undefined behavior, and can be used to introduce data-races.

We might want to support synchronized (e.g. atomic) versions of the shared memory arrays as well. While they might avoid the data-race, taking a &mut T to the array still creates multiple &mut T to the same data, which is undefined behavior. That is, just adding synchronization does not solve the problem (this is also not desirable for performance).

We'd like to accept this code:

fn device(a: &mut i32) {  *a += 1; } 

#[no_mangle]
pub unsafe extern "ptx-kernel" fn global() {
   let mut a = UnsyncShared::<[0_i32; 32]>::new(); // OK: create unsynchronized shared memory array
   device(&mut a[nvptx::_thread_idx_x()])); // UB
}

but note that IndexMut::index_mut(&mut self) would create multiple &mut T to the shared array, one on each thread, which results in UB as well. The following example should work, but is not very nice:

fn device(a: &mut i32) {  *a += 1; } 

#[no_mangle]
pub unsafe extern "ptx-kernel" fn global() {
   let mut a = UnsyncShared::<[0_i32; 32]>::new(); // OK: create unsynchronized shared memory array
   let p = &mut a as *mut _ as *mut i32; // OK: &mut T as *mut T does not create a &mut T
   device(unsafe { &mut * p.add(nvptx::_thread_idx_x()) }); // OK
}

Questions

What general approaches do we have to make these examples sound?

  • trivial: reject global kernels (abi ptx-kernel) that are not unsafe fn

Should we also pursue an approach that lints on "improper global/device kernel arguments" ? E.g.

  • global and device kernel arguments require: Sync - probably as a too hard constraint, since it does not allow raw pointers, also we technically only require Sync for mutable references to shared memory. Mutable references that do not point to shared memory are fine.
  • launching a global kernel requires the types passed to the kernel to be: SendGPU or similar (DeviceCopy as @bheisler put it below), since these arguments need to be sendable from the Host to the Device, and Copyable to the multiple execution threads of the device.

It might get tricky to propagate these lints through generic code, e.g., when calling Index::index as a device function. Also, Sync prevents raw pointers. A simple wrapper solves this, but we might want to allow raw pointers for convenience here.

What do we do about shared memory device arrays? Taking a &mut to them is always undefined behavior, which makes them extremely easy to use incorrectly, and very hard and unergonomic to use correctly.

Are there any other ways of tackling this problem?

@gnzlbg gnzlbg changed the title Global kernels taking mutable references via their arguments should be illegal Global kernel arguments must be Sync Nov 17, 2018
@gnzlbg gnzlbg changed the title Global kernel arguments must be Sync Global kernel arguments must at least be Sync Nov 17, 2018
@gnzlbg gnzlbg changed the title Global kernel arguments must at least be Sync Global and device kernel argument constraints Nov 17, 2018
@gnzlbg gnzlbg changed the title Global and device kernel argument constraints Global and device kernel argument are unsound Nov 17, 2018
@gnzlbg gnzlbg changed the title Global and device kernel argument are unsound Global and device kernels are unsound Nov 17, 2018
@gnzlbg
Copy link
Contributor Author

gnzlbg commented Nov 17, 2018

cc @rkruppe @nikomatsakis

@bheisler
Copy link

In RustaCUDA I have a marker trait (DeviceCopy) for types which are safe to copy to the GPU, along with a custom derive macro that checks that DeviceCopy must be implemented on all fields of a type. It's implemented by default on all of the primitive numeric types, but not on references (because that would be unsound) or raw pointers (because they could be pointing to host memory). Instead, the user has to use some other function to get a DevicePointer, which contains the *mut T pointing to device memory and does implement DeviceCopy.

All of the device allocation structures require that their contents implement DeviceCopy. I haven't implemented kernel launching yet, but I was planning to have the same restriction for parameters.

Perhaps a similar approach could be taken here, at least for kernel parameters. This doesn't help limit unsafety within a kernel, though.

I chose not to have DeviceCopy be a subtrait of Copy. It's quite likely that users will want to implement DeviceCopy for large and complex structures, where it would be inefficient to pass-by-copy.

@peterhj
Copy link
Contributor

peterhj commented Dec 1, 2018

The function parameters to a kernel can be anything that nvcc successfully compiles, and which can be copied as bits; see the docs for cuLaunchKernel which basically passes kernel parameters via a void ** (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1gb8f3dc3031b40da29d5f9a7139e52e15)

So, for extern "ptx-kernel" function signatures to be sound should really only require their parameters to be Copy (edit: I suppose they should be 'static as well). Once that's the case, I'm not sure the function needs to be explicitly marked unsafe (of course, invoking the kernel should still be unsafe since it's extern well I guess if we launch them using cuLaunchKernel anyway it's moot...).

@peterhj
Copy link
Contributor

peterhj commented Dec 2, 2018

@gnzlbg Also re: your UnsyncShared example, shared memory is essentially just a pointer. If we want individual threads to take mutable references of independent parts of shared memory, what we probably want is something in the same spirit as the existing split_ APIs on slices (https://doc.rust-lang.org/std/primitive.slice.html#method.split_first), except that the shared memory-related APIs would take a shared reference to the shared memory and return either shared or mutable references to independent parts of the shared memory.

An example:

fn device(a: &mut i32) {  *a += 1; }

pub extern "ptx-kernel" fn kernel() {   // not unsafe
  let shm_size: usize = 32;
  let shm_ptr: *mut i32 = /* somehow, get shared memory... */;
  let shm = unsafe { CuSharedMemSlice::from_raw(shm_ptr, shm_size) };   // OK
  // fn split_tid_x<'a>(&'a self) -> &'a T { ... }
  // fn split_tid_x_mut<'a>(&'a self) -> &'a mut T { ... }
  // internally the impl of split_tid does some pointer weakening coercions like in your last example
  let x: &mut i32 = shm.split_tid_x_mut();   // OK
  device(x);   // OK
}

However the issue of shared memory seems to then be a matter of higher-level APIs, rather than one of soundness, unlike the issue of parameters of "ptx-kernel" functions.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Dec 2, 2018

what we probably want is something in the same spirit as the existing split_ APIs on slices

Yeah, we probably want something like this, but people often like to do more "complicated" things like this - like touching multiple disjoint non-necessarily-contiguous elements of an array from each of the threads. (EDIT: I always thought that providing something like Matlab's reshape as a way to create disjoint views of 1D arrays might be worth exploring here - the ndarray crate probably already offers something like this where tuples are used to reshape 1D arrays).

Enabling those use cases while at the same time rejecting code that has undefined behavior is the tricky part.

@peterhj
Copy link
Contributor

peterhj commented Dec 3, 2018

That makes sense. Sounds to me then there are at least a few separate issues all under the "soundness" umbrella:

  • Currently, global kernel functions can have an unsound signature, e.g. your example: pub unsafe extern "ptx-kernel" fn foo(a: &mut f32) {}. One resolution is for the params to be Sync. @bheisler proposes the params to be DeviceCopy which doesn't derive Copy. Yet another alternative resolution I proposed requires the params of a extern "ptx-kernel" function to be Copy + 'static. In any case, constraining the params of an extern "ptx-kernel" function ought to be easily (?) implementable in rustc today.
  • Currently, the sound usage of a device function like fn device(a: &mut i32) { a += 1 } depends on having a sound mutable reference in the first place. However Rust does not seem to know about sound SIMT semantics for (mutably) referencing disjoint regions of memory (either global or shared) from parallel threads, using special CUDA rank registers like threadIdx.x. In other words the issue is soundly obtaining a mutable reference of thread-disjoint memory regions, while also enabling more complicated/higher-level use cases.
  • Currently, there doesn't seem to be any way (sound or not) of working with __shared__ memory. (Note that both global and __shared__ memory have similar disjoint-thread access problems.) However, today, it is almost possible to work with dynamically allocated __shared__ memory. The following Rust source (EDIT: added tid.x indexing):
extern "ptx-kernel" {
  static shared: [u32; 0];
}
pub unsafe extern "ptx-kernel" fn foo() {
  let base: *mut u32 = shared.as_ptr() as *mut u32;   // is this UB? but it gives us what we want
  let p: *mut u32 = base.offset(nvptx::_thread_idx_x() as isize);
  *p = 42;
}

will compile to the following PTX assembly:

//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_35
.address_size 64

	// .globl	foo
.extern .global .align 4 .b8 shared[];   // WRONG: .global -> .shared
.visible .entry foo()
{
	mov.u32 	        %r1, %tid.x;
	mul.wide.u32 	%rd1, %r1, 4;
	mov.u64 	        %rd2, shared;
	add.s64 	        %rd3, %rd2, %rd1;
	mov.u32 	        %r2, 42;
	st.global.u32 	[%rd3], %r2;   // WRONG: st.global -> st.shared
}

The PTX output above is almost correct, except where there are these references to "global" memory that should instead be "shared" memory.

One problem is Rust isn't aware of NVPTX required address spaces, which is necessary for supporting __shared__ memory. However there have been some recent PRs for rustc which are adding support for address spaces, that should also help w/ __shared__ memory and the NVPTX target. One thing that probably get __shared__ memory 100% of the way there is having some sort of address space annotation, e.g. #[space(...)] as proposed by @shepmaster for use w/ the AVR target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants