Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Henry Zhu #3

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 94 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,99 @@
CUDA Stream Compaction
======================
**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
Project 2 - Stream-Compaction**

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
* Henry Zhu
* [Github](https://github.com/Maknee), [LinkedIn](https://www.linkedin.com/in/henry-zhu-347233121/), [personal website](https://maknee.github.io/), [twitter](https://twitter.com/maknees1), etc.
* Tested on: Windows 10 Home, Intel i7-4710HQ @ 2.50GHz 22GB, GTX 870M (Own computer)

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
## Scanning and stream-compaction

![](scan_all_things.png)

### What is scanning/stream-compaction

Scanning (prefix sum) is summing all the values from previous indices into the current index. The image below from [GPU Gems](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html) depicts how scan works:

![](scan.png)

Stream compaction is the removal of a particular value from an array. The image below from [GPU Gems](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html) depicts how stream compaction works:

![](stream-compaction.png)

## Answer to Questions

### Performance of scan speeds (Array sizes)

![](scan_speeds.png)

### To guess at what might be happening inside the Thrust implementation (e.g. allocation, memory copy), take a look at the Nsight timeline for its execution. Your analysis here doesn't have to be detailed, since you aren't even looking at the code for the implementation.

The thrust implementation is much better optimized for larger instances. It does not, however, work well with small array sizes.
This is most likely due to its implementation. It takes in consideration a better block size for a larger amount of elements to scan through, so that is why it is much better in performance for larger numbers.

### Can you find the performance bottlenecks? Is it memory I/O? Computation? Is it different for each implementation?

I did find performance bottlenecks. I think, at least for the work efficient implementation of scan, it might be memory I/O as one has to send data to the GPU and then send it back for each iteration of d. For CPU, as shown in the graph below, it has to be with the computation as it has to iterate through every single node. The work-efficient implementation beats the naive implementation at first, but later, the naive implementation beats the work-efficient implementation as the memory i/o overhead becomes much larger than the computation overhead.

## Output of program

```
****************
** SCAN TESTS **
****************
[ 24 39 5 4 7 25 26 45 30 32 42 22 35 ... 19 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.865021ms (std::chrono Measured)
[ 0 24 63 68 72 79 104 130 175 205 237 279 301 ... 801298 801317 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0.826429ms (std::chrono Measured)
[ 0 24 63 68 72 79 104 130 175 205 237 279 301 ... 801241 801278 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.67424ms (CUDA Measured)
[ 0 24 63 68 72 79 104 130 175 205 237 279 301 ... 801298 801317 ]
passed
==== naive scan, non-power-of-two ====
elapsed time: 1.50448ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 1.03907ms (CUDA Measured)
[ 0 24 63 68 72 79 104 130 175 205 237 279 301 ... 801298 801317 ]
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.866112ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 4.9281ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.254016ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 0 3 1 0 1 1 2 1 0 2 0 2 3 ... 1 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 0.12029ms (std::chrono Measured)
[ 3 1 1 1 2 1 2 2 3 1 1 3 1 ... 3 1 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 0.11988ms (std::chrono Measured)
[ 3 1 1 1 2 1 2 2 3 1 1 3 1 ... 1 1 ]
passed
==== cpu compact with scan ====
elapsed time: 1.28747ms (std::chrono Measured)
[ 3 1 1 1 2 1 2 2 3 1 1 3 1 ... 3 1 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 1.34822ms (CUDA Measured)
[ 3 1 1 1 2 1 2 2 3 1 1 3 1 ... 3 1 ]
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 1.15571ms (CUDA Measured)
[ 3 1 1 1 2 1 2 2 3 1 1 3 1 ... 1 1 ]
passed
```

### (TODO: Your README)

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)

Binary file added scan.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added scan_all_things.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added scan_speeds.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 6 additions & 6 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 13; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down Expand Up @@ -44,14 +44,14 @@ int main(int argc, char* argv[]) {
printDesc("cpu scan, non-power-of-two");
StreamCompaction::CPU::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(NPOT, b, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
Expand All @@ -71,7 +71,7 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
Expand Down Expand Up @@ -137,14 +137,14 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

system("pause"); // stop Win32 console from closing on exit
Expand Down
Binary file added stream-compaction.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 17 additions & 3 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,16 +23,30 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
const int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index > n)
return;

bools[index] = idata[index] ? 1 : 0;
}

/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
const int *idata, const int *bools, const int *indices) {
const int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index > n)
return;

//filter only elements that are not zero in the bool map.
if (bools[index])
{
//get the index of where the element is suppose to be in the in the final array
const int index_of_filtered_element = indices[index];
odata[index_of_filtered_element] = idata[index];
}
}

}
Expand Down
192 changes: 146 additions & 46 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
@@ -1,50 +1,150 @@
#include <cstdio>
#include "cpu.h"

#include "common.h"

namespace StreamCompaction {
namespace CPU {
using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
static PerformanceTimer timer;
return timer;
}

/**
* CPU scan (prefix sum).
* For performance analysis, this is supposed to be a simple for loop.
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
timer().endCpuTimer();
}

/**
* CPU stream compaction without using the scan function.
*
* @returns the number of elements remaining after compaction.
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
timer().endCpuTimer();
return -1;
}

/**
* CPU stream compaction using scan and scatter, like the parallel version.
*
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
timer().endCpuTimer();
return -1;
}
}
#include "common.h"
#include <memory>
#include <iostream>

namespace StreamCompaction
{
namespace CPU
{
using StreamCompaction::Common::PerformanceTimer;

PerformanceTimer& timer()
{
static PerformanceTimer timer;
return timer;
}

//actual implementation of scan
//because timer().startCpuTimer() is called inside
//scan(...) from scatter(...), causing an abort
void scan_impl(int n, int* odata, const int* idata)
{
/// super naive cpu implementation ///
// memset(odata, 0, n * sizeof(int));
//
// for(int k = 1; k < n; k++)
// {
// odata[k] = odata[k - 1] + idata[k - 1];
// }

/// psuedo parallel implementation ///

//make sure the data is set first before beginning
memcpy(odata, idata, sizeof(int) * n);

for (int d = 1; static_cast<float>(d) <= std::ceil(std::log2(n)); d++)
{
//make a copy, because naive can't be done in place
auto temp = std::make_unique<int[]>(n);
memcpy(temp.get(), odata, n * sizeof(int));
for (int k = 0; k < n; k++)
{
//follow the formula
if (k >= static_cast<int>(std::pow(2, d - 1)))
{
odata[k] = temp[k - static_cast<int>(std::pow(2, d - 1))] + temp[k];
}
}
}

//copy the data back
auto temp = std::make_unique<int[]>(n);
memcpy(temp.get(), odata, n * sizeof(int));

//shift right by 1
for (int i = 1; i < n; i++)
{
odata[i] = temp[i - 1];
}
//set first element to 0
odata[0] = 0;
}

/**
* CPU scan (prefix sum).
* For performance analysis, this is supposed to be a simple for loop.
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
*/
void scan(int n, int* odata, const int* idata)
{
timer().startCpuTimer();

scan_impl(n, odata, idata);

timer().endCpuTimer();
}

/**
* CPU stream compaction without using the scan function.
*
* @returns the number of elements remaining after compaction.
*/
int compactWithoutScan(int n, int* odata, const int* idata)
{
timer().startCpuTimer();

memset(odata, 0, n * sizeof(int));

int index = 0;

//iterate through and count
for (int i = 0; i < n; i++)
{
if (idata[i])
{
odata[index] = idata[i];
index++;
}
}

timer().endCpuTimer();
return index;
}

/**
* CPU stream compaction using scan and scatter, like the parallel version.
*
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int* odata, const int* idata)
{
timer().startCpuTimer();

auto counters = std::make_unique<int[]>(n);

int count = 0;

//iterate through and count
for (int i = 0; i < n; i++)
{
counters[i] = idata[i] ? 1 : 0;
if (counters[i])
{
count++;
}
}

auto indicies = std::make_unique<int[]>(n);

memcpy(indicies.get(), counters.get(), n);

//scan
scan_impl(n, indicies.get(), counters.get());

//now set the scanned result to the correct index
for (int i = 0; i < n; i++)
{
if (counters[i])
{
odata[indicies[i]] = idata[i];
}
}

timer().endCpuTimer();
return count;
}
}
}
Loading