CuPBoP is a framework which support executing unmodified CUDA source code on non-NVIDIA devices. Currently, CuPBoP support serveral CPU backends, including x86, AArch64, and RISC-V. Supporting Vortex (a RISC-V GPU) is working in progress.
- Linux system
- LLVM 14.0.1
- CUDA Toolkit
Although CuPBoP does not require NVIDIA GPUs, it needs CUDA to compile the source programs to NVVM/LLVM IRs. CUDA toolkit can be built on machines without NVIDIA GPUs. For building CUDA toolkit, please refer to https://developer.nvidia.com/cuda-downloads.
-
Clone from github
git clone --recursive https://github.com/cupbop/CuPBoP cd CuPBoP export CuPBoP_PATH=`pwd` export LD_LIBRARY_PATH=$CuPBoP_PATH/build/runtime:$CuPBoP_PATH/build/runtime/threadPool:$LD_LIBRARY_PATH export CUDA_PATH=/usr/local/cuda-11.7 # set to your own location
-
Build CuPBoP
mkdir build && cd build #set -DDEBUG=ON for debugging cmake .. make
-
(Optional) Use CuPBoP to execute Hetero-mark benchmark for verification
make test
In this section, we provide an example of how to use CuPBoP to execute a CUDA program.
cd examples/vecadd
# Compile CUDA source code (both host and kernel) to bitcode files
clang++ -std=c++11 vecadd.cu \
-I../.. --cuda-path=$CUDA_PATH \
--cuda-gpu-arch=sm_50 -L$CUDA_PATH/lib64 \
-lcudart_static -ldl -lrt -pthread -save-temps -v || true
# Apply compilation transformations on the kernel bitcode file
$CuPBoP_PATH/build/compilation/kernelTranslator \
vecadd-cuda-nvptx64-nvidia-cuda-sm_50.bc kernel.bc
# Apply compilation transformations on the host bitcode file
$CuPBoP_PATH/build/compilation/hostTranslator \
vecadd-host-x86_64-unknown-linux-gnu.bc host.bc
# Generate object files
llc --relocation-model=pic --filetype=obj kernel.bc
llc --relocation-model=pic --filetype=obj host.bc
# Link with runtime libraries and generate the executable file
g++ -o vecadd -fPIC -no-pie \
-I$CuPBoP_PATH/runtime/threadPool/include \
-L$CuPBoP_PATH/build/runtime \
-L$CuPBoP_PATH/build/runtime/threadPool \
host.o kernel.o \
-I../.. -lc -lCPUruntime -lthreadPool -lpthread
# Execute
./vecadd
Any kinds of contributions are welcome. Please refer to Contribution.md for more detail.
If you want to refer CuPBoP in your projects, please cite the related papers:
- Unleashing CPU Potential for Executing GPU Programs through Compiler/Runtime Optimizations, MICRO 2024
- CuPBoP: Making CUDA a Portable Language, TODAES 2024
- COX: Exposing CUDA Warp-Level Functions to CPUs, TACO 2022
- Ruobing Han
- Jun Chen
- Bhanu Garg
- Xule Zhou
- John Lu
- Chihyo Ahn
- Haotian Sheng
- Blaise Tine
- Hyesoon Kim
- POCL is an open-source OpenCL implementations that based on LLVM. We reuse some code from it (e.g., apply optimizations, load/store LLVM IRs).
- Hetero-Mark and Rodinia Benchmark are two benchmark suites for heterogeneous system computation. CuPBoP uses them as integrated test to verify the correctness.
- moodycamel::ConcurrentQueue is a fast multi-producer, multi-consumer lock-free concurrent queue for C++11. CuPBoP uses it as the task queue for launching and executing kernels.