-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVPTX: "LLVM ERROR: Cannot select" when returning struct with 3byte size from "device function" #97174
Comments
@rustbot label +O-NVPTX |
I did a test with the corresponding cuda c++ code compiled with clang into llvm-ir (using the command #include <stdint.h>
#include <stdio.h>
struct foo {
uint8_t a;
uint8_t b;
uint8_t c;
};
__attribute__((noinline)) __device__ struct foo device(uint8_t v) {
struct foo s = {
.a = v,
.b = v,
.c = v
};
return s;
}
extern "C" __global__ void kernel(struct foo* output, uint8_t const* input) {
*output = device(*input);
}
A similarity between the llvm-ir is that both produces the struct as a type:
A big difference is that clang returns the struct while rustc returns a i24 from the rustc:
clang:
When compiling the equivalent rust code for Are there any reasons for not using this struct as the llvm return type in rustc? The |
Opened an issue in llvm llvm/llvm-project#55764 |
Short status update: I'm looking into two alternative solutions. The most proper one is to add a The more hacky and "rustc-centric" solution is to add a field |
I wanted to find out if the "passing as immediate" optimization made sense also for the NVPTX backend. I did a test between a compiler that disabled the passing as immediate optimization (no-opt) and one that promoted No opt time - warmup: 502.139603ms, normal: 465.524613ms I'm surprised how much of a difference the optimization is even on a target like the nvptx which contains several levels of abstractions and thus also opportunities for optimizations. The conclusion is that disabling the optimization is not an alternative. Why doesn't llvm do this optimization themselves, are there no way to select an unspecified ABI and they always must follow the C abi? Test codeDevice#![feature(abi_ptx)]
#![no_std]
#[panic_handler]
unsafe fn breakpoint_panic_handler(_: &::core::panic::PanicInfo) -> ! {
loop {}
core::hint::unreachable_unchecked();
}
#[repr(C)]
#[derive(Clone, Copy)]
pub struct ThreeU8 {
a: u8,
b: u8,
c: u8,
}
// ptx linker is inlining the device function even if it is tagged as `never`
// I have checked that a combination of --emit=llvm-ir actually produces a function
// in llvm-ir and compiling with llc keeps the functions into ptx assembly.
// TODO: verify that this function is not inlined after ptx-linker is fixed
#[inline(never)]
#[no_mangle]
pub fn device_three_u8(v: ThreeU8) -> ThreeU8 {
ThreeU8{
a: v.b,
b: v.a,
c: (v.a + v.b)/2,
}
}
#[inline(never)]
#[no_mangle]
// CHECK: kernel_three_u8
pub unsafe extern "ptx-kernel" fn kernel_three_u8(input: *const ThreeU8, output: *mut ThreeU8) {
for i in 0..1_000_000 {
output.write_volatile(device_three_u8(*input));
}
} HostThe kernel above was spawned in a single thread on a stream and timed until synchronized use cust::prelude::*;
use cust::stream::{
Stream,
StreamFlags
};
const NO_OPT_PTX: &str = include_str!("no_opt.ptx");
const OPT_PTX: &str = include_str!("opt.ptx");
#[repr(C)]
#[derive(Clone, Copy, Default, cust::DeviceCopy)]
pub struct ThreeU8 {
a: u8,
b: u8,
c: u8,
}
fn main() {
let ctx = cust::quick_init().unwrap();
let module_no_opt = Module::from_ptx(NO_OPT_PTX, &[]).unwrap();
let module_opt = Module::from_ptx(OPT_PTX, &[]).unwrap();
let stream = Stream::new(StreamFlags::NON_BLOCKING, None).unwrap();
let i = cust::memory::DeviceBox::new(&ThreeU8 {a: 4, b: 5, c: 6}).unwrap();
let o = cust::memory::DeviceBox::new(&ThreeU8::default()).unwrap();
let func_no_opt = module_no_opt.get_function("kernel_three_u8").unwrap();
let func_opt = module_opt.get_function("kernel_three_u8").unwrap();
// warm up
let mut before_run = std::time::Instant::now();
unsafe {
launch!(
// slices are passed as two parameters, the pointer and the length.
func_no_opt<<<1, 1, 0, stream>>>(i.as_device_ptr(), o.as_device_ptr())
).unwrap();
}
stream.synchronize().unwrap();
let no_opt_warmup = std::time::Instant::now() - before_run;
before_run = std::time::Instant::now();
unsafe {
launch!(
// slices are passed as two parameters, the pointer and the length.
func_no_opt<<<1, 1, 0, stream>>>(i.as_device_ptr(), o.as_device_ptr())
).unwrap();
}
stream.synchronize().unwrap();
let no_opt = std::time::Instant::now() - before_run;
before_run = std::time::Instant::now();
unsafe {
launch!(
// slices are passed as two parameters, the pointer and the length.
func_opt<<<1, 1, 0, stream>>>(i.as_device_ptr(), o.as_device_ptr())
).unwrap();
}
stream.synchronize().unwrap();
let opt_warmup = std::time::Instant::now() - before_run;
before_run = std::time::Instant::now();
unsafe {
launch!(
// slices are passed as two parameters, the pointer and the length.
func_opt<<<1, 1, 0, stream>>>(i.as_device_ptr(), o.as_device_ptr())
).unwrap();
}
stream.synchronize().unwrap();
let opt = std::time::Instant::now() - before_run;
println!("No opt time - warmup: {:?}, normal: {:?}", no_opt_warmup, no_opt);
println!("With opt time - warmup: {:?}, normal: {:?}", opt_warmup, opt);
} |
A fix have been merged in LLVM (https://reviews.llvm.org/D129291). Next step is to get it into rustc |
The patch is included in the LLVM 15 upgrade currently in progress #99464 I should add a test for this after the LLVM 15 upgrade is completed |
The LLVM 15 update has since happened, does this work now? |
Yes! This do work after LLVM 15. Just forgot to go back and close. Thanks for reminder! |
I tried this code (compiling with
rustc +nightly main_rs.rs --target nvptx64-nvidia-cuda --crate-type=cdylib --emit=asm
):I expected to see this happen: A well formed .ptx file
Instead, this happened: The following error (from ptx linker)
Meta
rustc --version --verbose
:Comment
It seems like there are problems with the
NVPTXISD::StoreRetval
being used with as24
. I assume the problem is that as24
is not a valid type at all. If anyone knows where this problems originate, and if it's on the rustc or LLVM side I'm very thankful.I will do some experiments with clang, possibly this weekend, to see how device functions are being called with such types.
The text was updated successfully, but these errors were encountered: