Safety soundness of mut borrows in wavefront/warp/superslice invoked functions #10

DiamondLovesYou · 2018-11-16T17:46:56Z

Consider this valid rust statement:

some_mut_slice[0] = dispatch_packet().global_id_x();

where the right hand value is basically the equivalent to get_global_id(0) in OpenCl.

On AMDGPU at least, the value of some_mut_slice[0] after the kernel returns is undefined. I'll bet Nvidia is similar in this. IBM's Power9 (as used in Summit, the fastest supercomputer in the world currently) which features SMT, so for example SMT1 on SMT4 hardware would be 4 slices (threads, basically) running w/ a single instruction pointer, would be the same in SMT1 or SMT2 (though I don't have documentation to back this up).

Essentially, I think the issue with the borrow model is that it assumes a single thread is a single thread and not a wavefront/warp/superslice. Thus, a mut borrow is unique.

I have no idea how to go about solving this.

The text was updated successfully, but these errors were encountered:

gnzlbg · 2018-11-16T17:58:00Z

Consider this valid rust statement:

If what you are saying is that you are somehow constructing more than one mutable reference to some memory, then this is anything but valid. Since you haven't provided a full example, I can only speculate about what you mean.

I have no idea how to go about solving this.

Many libraries solve this problem (e.g. rayon).

I'll bet Nvidia is similar in this.

I don't think the nvptx backend currently has this problem (EDIT: as in, it doesn't have any sort of implicit arrays yet, but I think we currently allow passing references/slices as arguments, which might be incorrect).

DiamondLovesYou · 2018-11-16T18:23:42Z

I mean Rust will compile it without error. And that statement actually doesn't construct more than one mut ref on SIMT machines, instead it's conceptually multiple threads having access to a single mut borrow. On AMDGPU at least, the mut ref will be in scalar registers; the work items will have access to it (to which the hardware will scalarize the writes in an undefined order). Rust is unaware of the SIMT-y nature.

I recognize that the statement itself is a bit contrived and obviously invalid (for me this is the big reason why I didn't think of this as an issue). The issue is that Rust doesn't think it's invalid either.

bheisler · 2018-11-16T19:13:16Z

I think I see what you mean, yeah. The SIMT model is inconsistent with the safety guarantees of Rust.

Rayon doesn't solve this problem. Rayon (or, more accurately, std) prevents this problem by using the type system and the Send and Sync traits to guarantee that no two CPU threads can share a reference to the same non-Sync value, and that no two CPU threads can share a mutable reference to any value. Thus, Rust can prevent data races at compile time. Working around this limitation requires the use of unsafe.

On the GPU, it's trivial to have multiple threads with mutable references to the same value, or with shared references to non-Sync values. That makes it possible to have memory unsafety and data races in Rust code without explicitly using unsafe.

It's unfortunate, but I think we just have to accept that GPGPU doesn't fit easily into Rust's safety guarantees. In addition to this, there are several ways to create data-races using the CUDA API - for instance, launching a kernel that writes to Unified memory, then reading that memory on the host without waiting for the kernel launch to complete. At worst, it just means that kernel entry functions and asynchronous launches must be marked as unsafe. Perhaps with time we'll think of better solutions.

gnzlbg · 2018-11-16T19:19:08Z

On the GPU, it's trivial to have multiple threads with mutable references to the same value, or with shared references to non-Sync values.

Kernels don't have shared memory arrays (today) so one would need to pass kernels a mutable reference to create undefined behavior in safe code. We just need to add a check to the ptx-abi to forbid any type that transitively contains a &mut to anything.

gnzlbg · 2018-11-16T19:29:53Z

In particular, device functions can accept &mut T as arguments, only global functions cannot.

gnzlbg · 2018-11-16T19:42:37Z

Technically, we don't even have to do that. Launching a global kernel is unsafe (like calling an extern function). If a global kernel accepts a &mut T, the unsafe code in the host that launches the kernel is incorrect because it copies a &mut T into each SIMT thread.

DiamondLovesYou · 2018-11-16T20:32:09Z

@gnzlbg You requested a complete example on zulip. I've pasted it below:

#![feature(allocator_api)]

extern crate hsa_rt;
extern crate runtime;
#[macro_use]
extern crate log;
extern crate env_logger;
extern crate legionella_std;

use std::alloc::{Global, Alloc, Layout, };
use std::mem::{size_of, };

use hsa_rt::agent::{DeviceType, };
use hsa_rt::queue::{FenceScope, };

use runtime::context::{Context, };
use runtime::module::{Invoc, };

use legionella_std::{dispatch_packet, mem::*, mem::slice::*, };

const WORKITEM_SIZE: usize = 8;
const X_SIZE: usize = 1000;

/// This is our kernel.
fn obviously_undefined_behaviour(mut out: SliceMut<u64>) {
  let dispatch = dispatch_packet();
  out[0] = dispatch.global_id_x() as u64;
}

pub fn main() {
  env_logger::init();
  let ctxt = Context::new()
    .expect("create context");

  let accel = ctxt
    .find_accel(|accel| {
      match accel.agent().device_type() {
        Ok(DeviceType::Gpu) => true,
        _ => false,
      }
    })
    .expect("lock failure")
    .expect("no suitable accelerators");

  let workitems = accel.isa()
    .map(|isa| {
      isa.workgroup_max_size()
    })
    .unwrap_or(Ok(1))
    .expect("get max workgroup size") as usize;
  assert!(workitems >= WORKITEM_SIZE,
          "not enough workitems per workgroup for this example");
  info!("using workgroup size of {}", WORKITEM_SIZE);

  let mut invoc = Invoc::new(ctxt.clone(), obviously_undefined_behaviour)
    .expect("Invoc::new");

  invoc.workgroup_dims((WORKITEM_SIZE, ))
    .expect("Invoc::workgroup_dims");
  invoc.grid_dims((X_SIZE, ))
    .expect("Invoc::grid_dims");
  invoc.begin_fence = Some(FenceScope::System);
  invoc.end_fence = Some(FenceScope::System);

  let kernel_props = invoc.precompile(&accel)
    .expect("kernel compilation");

  let agent = accel.agent();

  let data_bytes = X_SIZE * size_of::<u64>();

  let group_size = Some(kernel_props.group_segment_size());
  let private_size = Some(kernel_props.private_segment_size());

  let host_data = unsafe {
    // allocate the host frame data w/ page alignment. This isn't
    // *required*, but I'm betting nicer for the driver.
    // XXX hardcoded page size.
    let layout =
      Layout::from_size_align(data_bytes, 4096)
        .unwrap();
    let data = Global.alloc(layout)
      .expect("allocate kernel data");
    Vec::from_raw_parts(data.as_ptr() as *mut u64,
                        X_SIZE, X_SIZE)
  };
  let mut data = host_data.lock_memory(&[agent])
    .expect("lock host memory to GPU");

  let queue = agent.new_kernel_multi_queue(4, group_size, private_size)
    .expect("Agent::new_kernel_multi_queue");

  invoc.call_accel_sync(&accel, &queue, (data.as_slice_mut(), ))
    .expect("Invoc::call_accel_sync");

  println!("the winning global id is: {}", data.as_slice()[0]);
}

I ran this a few times and the results were 463, 815, and 495. I think you get the idea.

gnzlbg · 2018-11-16T20:39:30Z

I think you get the idea.

Yes, so that's what I had in mind that you were trying to say. In that example, your Invoc::new API is unsound. Potentially, the language extension that you have implemented in your fork of Rust is unsound as well (unless you give us access, or a specification of it, it's hard to tell).

The NVPTX backend that's part of Rust does not have these problems. The only way to introduce undefined behavior is by using unsafe to launch a global kernel, and if you launch a global kernel that results in multiples &mut T to the same memory, then there is the error that introduces undefined behavior - don't do that. We should definitely consider rejecting global kernels that accept mutable references.

DiamondLovesYou · 2018-11-16T21:06:55Z

My fork (not really a fork as it tracks master; I just use a few patches on top of master which haven't made it into Rust proper. Nothing exotic) doesn't extend the Rust language at all (which is the point). The NVPTX target 100% has this issue too; the issue is completely target independent. My runtime doesn't support CUDA/NVPTX (and I'd have to refactor a couple of things to abstract HSA/CUDA libraries first, so not a change I could finish today) but the crate is agnostic of the specific target machine.

I'd be willing to give working group members free access (possibly including source), but I haven't yet setup the infra to support that yet (or infra to support commercial customers for that matter, WIP).

gnzlbg · 2018-11-16T21:48:21Z

Can you show an example that shows this issue with the NVPTX target? I don't think one can be constructed without unsafe code.

gnzlbg · 2018-11-17T10:28:36Z

Particularly, since only global NVPTX kernels can be launched from the host, and these are always unsafe fn. If they don't uphold Rust safety guarantees, then they are incorrect. I've elaborated a little bit more about it in this issue: #11

DiamondLovesYou · 2019-06-20T05:41:18Z

I believe I've at least partially solved this SIMT issue (or, I've found a sort of workaround for it).

A rough description of the solution:

Require SIMT entry points to be Fn(), that is, take no arguments and have no return value. Instead, one would pass data to the kernel via externally defined globals and programmically define globals with the resources used in the kernel just before launch.
This way, if an argument is to be mutated device side, it'll have to be declared static mut (or otherwise be marked with interior mutability, like the atomics), and thus Rust will demand unsafe blocks to access.

(I suppose I'm writing this my framework's POV, which transforms things so such globals are undefined as expected. Otherwise, you'd have to use an extern block for such globals, which I'm pretty sure require unsafe unconditionally, so the effect should be similar, if a bit overzealous for this case)

On its face, this won't allow implicitly safe usage, and runs afoul of device side enqueue/kernel launch (as the to-be-launched kernel copied to a new kernel so the requested parameter relocations could be written including possibly to the .text section, but I'm pretty sure that's not really something you'd want the GPU to do anyway even if it technically could; thus you'd have to round trip to the CPU first. Which defeats the whole point).

Granted, this workaround flies in face of what is pretty much universal practice in GPGPU space. Though it is curious that OpenGL/GLSL/Vulkan/SPIRV use globals similarly to above for resources provided to shaders and even kernels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Safety soundness of mut borrows in wavefront/warp/superslice invoked functions #10

Safety soundness of mut borrows in wavefront/warp/superslice invoked functions #10

DiamondLovesYou commented Nov 16, 2018

gnzlbg commented Nov 16, 2018 •

edited

Loading

Uh oh!

DiamondLovesYou commented Nov 16, 2018

Uh oh!

bheisler commented Nov 16, 2018

Uh oh!

gnzlbg commented Nov 16, 2018 •

edited

Loading

Uh oh!

gnzlbg commented Nov 16, 2018

Uh oh!

gnzlbg commented Nov 16, 2018 •

edited

Loading

Uh oh!

DiamondLovesYou commented Nov 16, 2018

Uh oh!

gnzlbg commented Nov 16, 2018 •

edited

Loading

Uh oh!

DiamondLovesYou commented Nov 16, 2018

Uh oh!

gnzlbg commented Nov 16, 2018 •

edited

Loading

Uh oh!

gnzlbg commented Nov 17, 2018 •

edited

Loading

Uh oh!

DiamondLovesYou commented Jun 20, 2019

Uh oh!

Safety soundness of mut borrows in wavefront/warp/superslice invoked functions #10

Safety soundness of mut borrows in wavefront/warp/superslice invoked functions #10

Comments

DiamondLovesYou commented Nov 16, 2018

gnzlbg commented Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DiamondLovesYou commented Nov 16, 2018

Uh oh!

bheisler commented Nov 16, 2018

Uh oh!

gnzlbg commented Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnzlbg commented Nov 16, 2018

Uh oh!

gnzlbg commented Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DiamondLovesYou commented Nov 16, 2018

Uh oh!

gnzlbg commented Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DiamondLovesYou commented Nov 16, 2018

Uh oh!

gnzlbg commented Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnzlbg commented Nov 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DiamondLovesYou commented Jun 20, 2019

Uh oh!

gnzlbg commented Nov 16, 2018 •

edited

Loading

gnzlbg commented Nov 16, 2018 •

edited

Loading

gnzlbg commented Nov 16, 2018 •

edited

Loading

gnzlbg commented Nov 16, 2018 •

edited

Loading

gnzlbg commented Nov 16, 2018 •

edited

Loading

gnzlbg commented Nov 17, 2018 •

edited

Loading