Bitonic Sort CUDA

This repository contains CUDA implementations of the Bitonic Sort algorithm optimized for GPU architectures. The project explores six versions of Bitonic Sort, each progressively improving performance by addressing specific GPU programming challenges:

V0: A naive baseline implementation where each thread performs a single comparison step using global memory and one kernel launch per sorting step.
V1: Introduces kernel fusion by looping multiple comparison steps within a single kernel, reducing kernel launch overhead and improving efficiency.
V2: Utilizes shared memory for intra-block sorting, significantly reducing memory access latency compared to global memory.
V3: Leverages warp-level shuffle instructions for intra-warp comparisons, minimizing the use of shared memory and synchronization barriers.
V4: Cuts the number of active threads in half to eliminate in-warp branching and idle threads, improving thread utilization and execution efficiency.
V5: Applies an indexing technique to reorder shared memory accesses, effectively eliminating bank conflicts and maximizing shared memory bandwidth.

This repository is designed for testing, benchmarking, and profiling the Bitonic Sort CUDA implementations on the Aristotelis HPC Cluster, enabling efficient evaluation of performance across multiple GPU architectures.

Project Structure

benchmarks/ - Benchmark scripts and performance results (HPC)
debug/ - Scripts for memcheck, racecheck, initcheck (HPC)
docs/ - Project report
include/ - Header files
profile/ - Nsight Systems profiling scripts (HPC)
src/ - CUDA source files
tests/ - Unit test scripts (HPC)
.vscode/ - VSCode settings

Version Summary

Version	Optimization
V0	Naive implementation
V1	Fused steps into a single kernel
V2	Used shared memory for intra-block sort
V3	Warp shuffles for intra-warp sorting
V4	Half the threads, no warp branching
V5	Avoided shared memory bank conflicts

Performance Comparison

Sorting time for 2^29 32-bit integers.

GPU partition

Version	Time (ms)	Step Speedup	Total Speedup
SERIAL	110711.37	--	1.00
V0	4475.87	24.74	24.74
V1	3967.72	1.13	27.90
V2	3220.96	1.23	34.37
V3	3028.74	1.06	36.55
V4	2627.18	1.15	42.14
V5	2596.56	1.01	42.63

Ampere partition

Version	Time (ms)	Step Speedup	Total Speedup
SERIAL	93705.61	--	1.00
V0	2127.30	44.05	44.05
V1	1769.15	1.20	52.97
V2	1672.39	1.06	56.03
V3	1644.31	1.02	56.99
V4	1562.46	1.05	59.97
V5	1543.99	1.01	60.69

Execution time vs array size:

GPU Partition Ampere Partition

Requirements

To build and run locally:

Hardware

CUDA-capable NVIDIA GPU (Compute Capability >= 6.0 recommended)
At least 8 GB GPU memory (for large benchmarks)
Sufficient RAM for allocating large host arrays (up to ~12 GB)

Software

CUDA Toolkit (nvcc, libcudart)
C++ compiler (g++ >= 5.0)
Make
Linux or WSL2 (tested on Ubuntu 22.04)

Local Setup

Clone the repository:

git clone https://github.com/georrous6/bitonic-sort-cuda.git
cd bitonic-sort-cuda

Set the CUDA_HOME environment variable:

export CUDA_HOME=/path/to/cuda

Build the project (default is release):

make BUILD_TYPE=release

Valid build types:

release: optimized binary
debug: debugging symbols and checks enabled

Running Tests Locally

After building:

cd build/debug
./tests

Running Benchmarks Locally

cd build/release
./benchmarks <q> [options]

Where 2^q is the array size.

Options:

--version serial|v0|v1|v2|v3|v4|v5 (default: serial)
--desc sort in descending order
--timing-file <file> save timing to file

Setup on Aristotelis HPC

Clone on the HPC:

git clone https://github.com/georrous6/bitonic-sort-cuda.git
cd bitonic-sort-cuda

Or transfer from local:

scp -r /path/to/bitonic-sort-cuda [user]@aristotle.it.auth.gr:/destination/path/

Replace [user] with your institutional username.

Running Tests on HPC

cd tests
chmod +x submit_tests.sh
./submit_tests.sh gpu         # or ampere

Running Benchmarks on HPC

cd benchmarks
chmod +x submit_benchmarks.sh
./submit_benchmarks.sh ampere   # or gpu

Aknowledgements

This work was carried out using the computational resources of the Aristotelis HPC Cluster at Aristotle University of Thessaloniki.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bitonic Sort CUDA

Table of Contents

Project Structure

Version Summary

Performance Comparison

Requirements

Hardware

Software

Local Setup

Running Tests Locally

Running Benchmarks Locally

Setup on Aristotelis HPC

Running Tests on HPC

Running Benchmarks on HPC

Aknowledgements

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.vscode		.vscode
benchmarks		benchmarks
debug		debug
docs		docs
include		include
profile		profile
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

georrous6/bitonic-sort-cuda

Folders and files

Latest commit

History

Repository files navigation

Bitonic Sort CUDA

Table of Contents

Project Structure

Version Summary

Performance Comparison

Requirements

Hardware

Software

Local Setup

Running Tests Locally

Running Benchmarks Locally

Setup on Aristotelis HPC

Running Tests on HPC

Running Benchmarks on HPC

Aknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages