CUDA Stream Compaction

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2

Yan Wu
- LinkedIn
Tested on: Windows 10 Education, i7-8750H @ 2.2GHz 16GB, GTX 1060 6GB (Personal Laptop)

Project Description

This project let us implement a few different versions of the Scan (Prefix Sum) algorithm.

Implementing a CPU version of the algorithm.
Writing a few GPU implementations: "naive" and "work-efficient."
Using some of these to implement GPU stream compaction.

Algorithm Slides:
Algorithm Demonstration:
- Naive Inclusive Scan:
- Work Efficient Scan (up-sweep):
- Work Efficient Scan (down-sweep):

Result and Performance Analysis

A sample of outcomes:
- Scan Test:
- Stream Compaction Test:
Result Sheets:
- Scan Test Result:
- Compaction Test Result:
Analysis:
- Scan Test Charts:
- Compaction Test Charts:
  
  Just to mention, the two red lines regarding "cpu compact with scan" in both above charts are the same line.
- Analysis:
  First we can see that GPU isn't performing significantly better when the test array is short. But when array becomes very large, it also becomes a burden for the CPU and the executing time increases almost exponentially. Under this circumstance we should perceive a much greater performance outcome by GPU.
  Taking a closer look, my work efficient algorithm is slower than the naive method when the test array is relatively shorter. As array size increases, work efficient method has a trend of better performance. Thrust method is steady. It may seems slow at first, but when array size increases, its time remains almost the same. Comparing all these methods, thrust scan is clearly the best when array is incredibly large.
Q & A:
- Roughly optimize the block sizes of each of your implementations for minimal run time on your GPU.
  My block size was 128.
- Why does work efficien method slower than naive?
  When implementing algorithm from the course slides, we did reduced the calculation count since we only need to compute half of the elements involved each round. Although we did this, half of each warp didn't even work. Implementing a parallel reduction and reduce the working warp number could save a lot time on this issue.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
cmake		cmake
img		img
src		src
stream_compaction		stream_compaction
.cproject		.cproject
.gitignore		.gitignore
.project		.project
CMakeLists.txt		CMakeLists.txt
GNUmakefile		GNUmakefile
INSTRUCTION.md		INSTRUCTION.md
README.md		README.md
cis565_stream_compaction_test.launch		cis565_stream_compaction_test.launch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Stream Compaction

Project Description

Result and Performance Analysis

About

Releases

Packages

Languages

wuyan33/Project2-Stream-Compaction

Folders and files

Latest commit

History

Repository files navigation

CUDA Stream Compaction

Project Description

Result and Performance Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages