The CUDA kernel examples using CUTLASS and CuTe abstractions.
To download the CUTLASS-Examples repository, please run the following command.
$ git clone --recursive https://github.com/leimao/CUTLASS-Examples
$ cd CUTLASS-Examples
# If you are updating the submodules of an existing checkout.
$ git submodule sync
$ git submodule update --init --recursive
Docker is used to build and run CUTLASS CUDA kernels. The custom Docker container is built based on the NVIDIA NGC CUDA 12.4.1 Docker container.
Please adjust the base Docker container CUDA version if the host computer has a different CUDA version. Otherwise, weird compilation errors and runtime errors may occur.
To build the custom Docker image, please run the following command.
$ docker build -f docker/cuda.Dockerfile --no-cache --tag cuda:12.4.1 .
To run the custom Docker container, please run the following command.
$ docker run -it --rm --gpus device=0 -v $(pwd):/mnt -w /mnt cuda:12.4.1
To run the custom Docker container with NVIDIA Nsight Compute, please run the following command.
$ xhost +
$ docker run -it --rm --gpus device=0 -v $(pwd):/mnt -w /mnt -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix --cap-add=SYS_ADMIN --security-opt seccomp=unconfined --network host cuda:12.4.1
$ xhost -
To build the CUDA kernels, please run the following commands.
$ export NUM_CMAKE_JOBS=4
$ cmake -B build
$ cmake --build build --config Release --parallel ${NUM_CMAKE_JOBS}
To run the unit tests, please run the following command.
$ ctest --test-dir build/ --tests-regex "Test.*" --verbose
To run the performance measurements, please run the following command.
$ ctest --test-dir build/ --tests-regex "Profile.*" --verbose
Performance measurements will run selected CUDA kernels for large problems multiple times and therefore might take a long time to complete.