-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safety soundness of mut borrows in wavefront/warp/superslice invoked functions #10
Comments
If what you are saying is that you are somehow constructing more than one mutable reference to some memory, then this is anything but valid. Since you haven't provided a full example, I can only speculate about what you mean.
Many libraries solve this problem (e.g. rayon).
I don't think the nvptx backend currently has this problem (EDIT: as in, it doesn't have any sort of implicit arrays yet, but I think we currently allow passing references/slices as arguments, which might be incorrect). |
I mean Rust will compile it without error. And that statement actually doesn't construct more than one mut ref on SIMT machines, instead it's conceptually multiple threads having access to a single mut borrow. On AMDGPU at least, the mut ref will be in scalar registers; the work items will have access to it (to which the hardware will scalarize the writes in an undefined order). Rust is unaware of the SIMT-y nature. I recognize that the statement itself is a bit contrived and obviously invalid (for me this is the big reason why I didn't think of this as an issue). The issue is that Rust doesn't think it's invalid either. |
I think I see what you mean, yeah. The SIMT model is inconsistent with the safety guarantees of Rust. Rayon doesn't solve this problem. Rayon (or, more accurately, std) prevents this problem by using the type system and the On the GPU, it's trivial to have multiple threads with mutable references to the same value, or with shared references to non-Sync values. That makes it possible to have memory unsafety and data races in Rust code without explicitly using It's unfortunate, but I think we just have to accept that GPGPU doesn't fit easily into Rust's safety guarantees. In addition to this, there are several ways to create data-races using the CUDA API - for instance, launching a kernel that writes to Unified memory, then reading that memory on the host without waiting for the kernel launch to complete. At worst, it just means that kernel entry functions and asynchronous launches must be marked as unsafe. Perhaps with time we'll think of better solutions. |
Kernels don't have shared memory arrays (today) so one would need to pass kernels a mutable reference to create undefined behavior in safe code. We just need to add a check to the |
In particular, device functions can accept |
Technically, we don't even have to do that. Launching a global kernel is |
@gnzlbg You requested a complete example on zulip. I've pasted it below: #![feature(allocator_api)]
extern crate hsa_rt;
extern crate runtime;
#[macro_use]
extern crate log;
extern crate env_logger;
extern crate legionella_std;
use std::alloc::{Global, Alloc, Layout, };
use std::mem::{size_of, };
use hsa_rt::agent::{DeviceType, };
use hsa_rt::queue::{FenceScope, };
use runtime::context::{Context, };
use runtime::module::{Invoc, };
use legionella_std::{dispatch_packet, mem::*, mem::slice::*, };
const WORKITEM_SIZE: usize = 8;
const X_SIZE: usize = 1000;
/// This is our kernel.
fn obviously_undefined_behaviour(mut out: SliceMut<u64>) {
let dispatch = dispatch_packet();
out[0] = dispatch.global_id_x() as u64;
}
pub fn main() {
env_logger::init();
let ctxt = Context::new()
.expect("create context");
let accel = ctxt
.find_accel(|accel| {
match accel.agent().device_type() {
Ok(DeviceType::Gpu) => true,
_ => false,
}
})
.expect("lock failure")
.expect("no suitable accelerators");
let workitems = accel.isa()
.map(|isa| {
isa.workgroup_max_size()
})
.unwrap_or(Ok(1))
.expect("get max workgroup size") as usize;
assert!(workitems >= WORKITEM_SIZE,
"not enough workitems per workgroup for this example");
info!("using workgroup size of {}", WORKITEM_SIZE);
let mut invoc = Invoc::new(ctxt.clone(), obviously_undefined_behaviour)
.expect("Invoc::new");
invoc.workgroup_dims((WORKITEM_SIZE, ))
.expect("Invoc::workgroup_dims");
invoc.grid_dims((X_SIZE, ))
.expect("Invoc::grid_dims");
invoc.begin_fence = Some(FenceScope::System);
invoc.end_fence = Some(FenceScope::System);
let kernel_props = invoc.precompile(&accel)
.expect("kernel compilation");
let agent = accel.agent();
let data_bytes = X_SIZE * size_of::<u64>();
let group_size = Some(kernel_props.group_segment_size());
let private_size = Some(kernel_props.private_segment_size());
let host_data = unsafe {
// allocate the host frame data w/ page alignment. This isn't
// *required*, but I'm betting nicer for the driver.
// XXX hardcoded page size.
let layout =
Layout::from_size_align(data_bytes, 4096)
.unwrap();
let data = Global.alloc(layout)
.expect("allocate kernel data");
Vec::from_raw_parts(data.as_ptr() as *mut u64,
X_SIZE, X_SIZE)
};
let mut data = host_data.lock_memory(&[agent])
.expect("lock host memory to GPU");
let queue = agent.new_kernel_multi_queue(4, group_size, private_size)
.expect("Agent::new_kernel_multi_queue");
invoc.call_accel_sync(&accel, &queue, (data.as_slice_mut(), ))
.expect("Invoc::call_accel_sync");
println!("the winning global id is: {}", data.as_slice()[0]);
} I ran this a few times and the results were |
Yes, so that's what I had in mind that you were trying to say. In that example, your The NVPTX backend that's part of Rust does not have these problems. The only way to introduce undefined behavior is by using |
My fork (not really a fork as it tracks master; I just use a few patches on top of master which haven't made it into Rust proper. Nothing exotic) doesn't extend the Rust language at all (which is the point). The NVPTX target 100% has this issue too; the issue is completely target independent. My runtime doesn't support CUDA/NVPTX (and I'd have to refactor a couple of things to abstract HSA/CUDA libraries first, so not a change I could finish today) but the crate is agnostic of the specific target machine. I'd be willing to give working group members free access (possibly including source), but I haven't yet setup the infra to support that yet (or infra to support commercial customers for that matter, WIP). |
Can you show an example that shows this issue with the NVPTX target? I don't think one can be constructed without |
Particularly, since only global NVPTX kernels can be launched from the host, and these are always |
I believe I've at least partially solved this SIMT issue (or, I've found a sort of workaround for it). A rough description of the solution: Require SIMT entry points to be (I suppose I'm writing this my framework's POV, which transforms things so such globals are undefined as expected. Otherwise, you'd have to use an On its face, this won't allow implicitly safe usage, and runs afoul of device side enqueue/kernel launch (as the to-be-launched kernel copied to a new kernel so the requested parameter relocations could be written including possibly to the Granted, this workaround flies in face of what is pretty much universal practice in GPGPU space. Though it is curious that OpenGL/GLSL/Vulkan/SPIRV use globals similarly to above for resources provided to shaders and even kernels. |
Consider this valid rust statement:
where the right hand value is basically the equivalent to
get_global_id(0)
in OpenCl.On AMDGPU at least, the value of
some_mut_slice[0]
after the kernel returns is undefined. I'll bet Nvidia is similar in this. IBM's Power9 (as used in Summit, the fastest supercomputer in the world currently) which features SMT, so for example SMT1 on SMT4 hardware would be 4 slices (threads, basically) running w/ a single instruction pointer, would be the same in SMT1 or SMT2 (though I don't have documentation to back this up).Essentially, I think the issue with the borrow model is that it assumes a single thread is a single thread and not a wavefront/warp/superslice. Thus, a mut borrow is unique.
I have no idea how to go about solving this.
The text was updated successfully, but these errors were encountered: