Name		Name	Last commit message	Last commit date
parent directory ..
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Makefile		Makefile
README.md		README.md
main.hip		main.hip
warp_shuffle.svg		warp_shuffle.svg
warp_shuffle_simple.svg		warp_shuffle_simple.svg
warp_shuffle_vs2017.sln		warp_shuffle_vs2017.sln
warp_shuffle_vs2017.vcxproj		warp_shuffle_vs2017.vcxproj
warp_shuffle_vs2017.vcxproj.filters		warp_shuffle_vs2017.vcxproj.filters
warp_shuffle_vs2019.sln		warp_shuffle_vs2019.sln
warp_shuffle_vs2019.vcxproj		warp_shuffle_vs2019.vcxproj
warp_shuffle_vs2019.vcxproj.filters		warp_shuffle_vs2019.vcxproj.filters
warp_shuffle_vs2022.sln		warp_shuffle_vs2022.sln
warp_shuffle_vs2022.vcxproj		warp_shuffle_vs2022.vcxproj
warp_shuffle_vs2022.vcxproj.filters		warp_shuffle_vs2022.vcxproj.filters

README.md

HIP-Basic Warp Shuffle Example

Description

Kernel code for a particular block is executed in groups of threads known as a wavefronts (AMD) or warps (NVIDIA). Each block is is divided into as many warps as the block's size allows. If the block size is less than the warp size, then part of the warp just stays idle (as happens in this example). AMD GPUs use 64 threads per wavefront for architectures prior to RDNA™ 1. RDNA architectures support both 32 and 64 wavefront sizes.

Warps are executed in lockstep, i.e. all the threads in each warp execute the same instruction at the same time but with different data. This type of parallel processing is also known as Single Instruction, Multiple Data (SIMD). A block contains several warps and the warp size is dependent on the architecture, but the block size is not. Blocks and warps also differ in the way they are executed, and thus they may provide different results when used in the same piece of code. For instance, the kernel code of this example would not work as it is with block execution and shared memory access e.g. because some synchronization would be needed to ensure that every thread has written its correspondent value before trying to access it.

Higher performance in the execution of kernels can be achieved with explicit warp-level programming. This can be done by using some collective operations, known as warp shuffles, that allow exchanging data between threads in the same warp without the need for shared memory. This exchange occurs simultaneously for all the active threads in the warp.

This example showcases how to use the above-mentioned operations by implementing a simple matrix transpose kernel.

Application flow

A number of constants are defined for the kernel launch parameters.
The input and output matrices are allocated and initialized in host memory.
The necessary amount of device memory for the input and output matrices is allocated and the input data is copied to the device.
A trace message is printed to the standard output.
The GPU kernel is then launched with the previously defined arguments.
The transposed matrix is copied back to host memory.
All device memory is freed.
The expected transposed matrix is calculated with a CPU version of the transpose kernel and the transposed matrix obtained from the kernel execution is then compared with it. The result of the comparison is printed to the standard output.

Key APIs and Concepts

Warp shuffle is a warp-level primitive that allows for the communication between the threads of a warp. Below is a simple example that shows how the value of the thread with index 2 is copied to all other threads within the warp.

__shfl(var, src_lane, width = warp_size) copies the value of a var from the thread src_lane within the warp. This operation admits a third parameter (not used in this example), width, defaulted to the warp size value and which allows restricting the number of threads of the warp from which values are read. Values are copied from threads with an ID in the range $[0, width-1]$. If the ID of the thread specified in the call to __shfl is out of that range, then the thread accessed is the one with that ID modulo width. The src_lane may also vary per thread, as shown below.

hipGetDeviceProperties gets the properties of the specified device. In this example, it is used to get the warp size of the device (GPU) used.
hipMalloc allocates memory in the global memory of the device, and with hipMemcpy data bytes can be transferred from host to device (using hipMemcpyHostToDevice) or from device to host (using hipMemcpyDeviceToHost), among others.
myKernelName<<<...>>> queues the execution of a kernel on a device (GPU).
hipGetLastError gets the last error returned by any HIP runtime API call.
hipFree deallocates device memory allocated with hipMalloc.

Demonstrated API Calls

HIP runtime

Device symbols

__global__
threadIdx
__shfl

Host symbols

hipFree
hipGetDeviceProperties
hipGetLastError
hipMalloc
hipMemcpy
hipMemcpyDeviceToHost
hipMemcpyHostToDevice
hipStreamDefault

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

warp_shuffle

warp_shuffle

README.md

HIP-Basic Warp Shuffle Example

Description

Application flow

Key APIs and Concepts

Demonstrated API Calls

HIP runtime

Device symbols

Host symbols

Files

warp_shuffle

Directory actions

More options

Directory actions

More options

Latest commit

History

warp_shuffle

Folders and files

parent directory

README.md

HIP-Basic Warp Shuffle Example

Description

Application flow

Key APIs and Concepts

Demonstrated API Calls

HIP runtime

Device symbols

Host symbols