Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Salaar Kohari #13

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added Analysis.xlsx
Binary file not shown.
32 changes: 26 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,32 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Salaar Kohari
* [LinkedIn](https://www.linkedin.com/in/salaarkohari), [personal website](http://salaar.kohari.com)
* Tested on: Windows 10, Intel Xeon @ 3.7GHz 32GB, GTX 1070 8GB (SIG Lab)

### (TODO: Your README)
### Description
Scan and GPU Stream Compaction are two commonly parallelized problems for the GPU that come in handy as libraries for more sophisticated parallel algorithms. The goal of this project, in addition to implementing cpu, naive, and work-efficient scan and stream compaction, is to compare performance of different methods under various array sizes and analyze the results.

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
Scan involves processing a normally sequential action (such as summing an array) into a parallel one. The naive implementation involves summing two adjacent elements together then that sum with adjacent sums, etc to collapse the tree.

The work efficient implementation involves an up and down sweep of a balanced tree to minimize the number of scans and threads needed. This can be further optimized by reducing block size at each sum iteration to minimize wasted threads.

Stream compaction maps an array to 0s and 1s depending on if the value meets a condition (i.e. non-zero). Using a scan algorithm, new indices are chosen for the non-zero elements and a new array is constructed with those elements excluded.

### Analysis

![Scan, (2^x)-3 Array Size](img/scan.png)
Scan, (2^x)-3 Array Size

![Scan, 2^x Array Size](img/scan2.png)
Scan, 2^x Array Size

![Stream Compact](img/compact.png)
Stream Compact

In the graphs above, scan and compact are compared for various array sizes and algorithm implementations. All numbers are based on the average of two runs, since random arrays cause variance in runtime. The graphs are split into power-of-2 array sizes and power-of-2 minus 3, since behavior of certain algorithms can cause the runtime to vary between these conditions. One thing to note was that the power-of-2 implementation of thrust took 4-5 seconds, so it was discluded from the results. This may have been due to thrust being slow the first time it is called.

CPU operations perform better at lower array size but becomes much slower at larger array sizes. This is because CPU has a faster clock and memory access for smaller array sizes, but this scales linearly with the size of the array as opposed to more logarithmically in the parallel approach. CPU also requires no memory transfer to the GPU which could be an additional bottleneck for low array sizes.

In my implementation, the naive approach performs better than the efficient one. This is likely because I did not optimize my block size at each iteration, so many threads are wasted at each implementation. There is also modulus division happening in the kernel to check if it is a valid thread, which is computationally inefficient on the GPU. If these are optimized, the efficient implementation would likely perform better.
Binary file added img/compact.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ int main(int argc, char* argv[]) {
onesArray(SIZE, c);
printDesc("1s array for finding bugs");
StreamCompaction::Naive::scan(SIZE, c, a);
printArray(SIZE, c, true); */
printArray(SIZE, c, true);*/

zeroArray(SIZE, c);
printDesc("naive scan, non-power-of-two");
Expand Down
2 changes: 1 addition & 1 deletion stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,5 @@ set(SOURCE_FILES

cuda_add_library(stream_compaction
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_50
)
14 changes: 12 additions & 2 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,11 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n)
return;

bools[index] = idata[index] > 0 ? 1 : 0;
}

/**
Expand All @@ -32,7 +36,13 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n)
return;

if (bools[index] == 1) {
odata[indices[index]] = idata[index];
}
}

}
Expand Down
104 changes: 63 additions & 41 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
@@ -1,50 +1,72 @@
#include <cstdio>
#include "cpu.h"

#include "common.h"
#include "common.h"

namespace StreamCompaction {
namespace CPU {
using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
static PerformanceTimer timer;
return timer;
}
namespace CPU {
using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
static PerformanceTimer timer;
return timer;
}

/**
* CPU scan (prefix sum).
* For performance analysis, this is supposed to be a simple for loop.
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
timer().endCpuTimer();
}
/**
* CPU scan (prefix sum).
* For performance analysis, this is supposed to be a simple for loop.
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
*/
void scanNoTimer(int n, int *odata, const int *idata) {
odata[0] = 0;
for (int i = 1; i < n; ++i) {
odata[i] = idata[i - 1] + odata[i - 1];
}
}

/**
* CPU stream compaction without using the scan function.
*
* @returns the number of elements remaining after compaction.
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
timer().endCpuTimer();
return -1;
}
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
scanNoTimer(n, odata, idata);
timer().endCpuTimer();
}

/**
* CPU stream compaction using scan and scatter, like the parallel version.
*
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
timer().endCpuTimer();
return -1;
}
}
/**
* CPU stream compaction without using the scan function.
*
* @returns the number of elements remaining after compaction.
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
int count = 0;
for (int i = 0; i < n; ++i) {
if (idata[i] != 0) {
odata[count++] = idata[i];
}
}
timer().endCpuTimer();
return count;
}

/**
* CPU stream compaction using scan and scatter, like the parallel version.
*
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
int* mapped = new int[n];
for (int i = 0; i < n; ++i) {
mapped[i] = idata[i] != 0 ? 1 : 0;
}
int* scanned = new int[n];
scanNoTimer(n, scanned, mapped);
for (int i = 0; i < n; ++i) {
if (mapped[i] == 1) {
odata[scanned[i]] = idata[i];
}
}
timer().endCpuTimer();
return scanned[n - 1];
}
}
}
92 changes: 82 additions & 10 deletions stream_compaction/efficient.cu
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,68 @@

namespace StreamCompaction {
namespace Efficient {
using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
static PerformanceTimer timer;
return timer;
using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
static PerformanceTimer timer;
return timer;
}

// Kernels for efficient prefix scan
__global__ void kernUpScan(int n, int *data, const int offset, const int offset2) {
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n)
return;

if (index % offset2 != 0)
return;

data[index + offset2 - 1] += data[index + offset - 1];
}

__global__ void kernDownScan(int n, int *data, const int offset, const int offset2) {
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n)
return;

if (index % offset2 != 0)
return;

int temp = data[index + offset - 1];
data[index + offset - 1] = data[index + offset2 - 1];
data[index + offset2 - 1] += temp;
}

/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scanNoTimer(int n, int *odata, const int *idata) {
int *dev_data;
cudaMalloc((void**)&dev_data, n * sizeof(int));
cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice);

const int blockSize = 256;
dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);

for (int d = 0; d < ilog2ceil(n); ++d) {
int offset = pow(2, d);
kernUpScan << <fullBlocksPerGrid, blockSize >> > (n, dev_data, offset, offset * 2);
}

cudaMemset(dev_data + n - 1, 0, 1);
for (int d = ilog2ceil(n); d >= 0; --d) {
int offset = pow(2, d);
kernDownScan << <fullBlocksPerGrid, blockSize >> > (n, dev_data, offset, offset * 2);
}

cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dev_data);
}

void scan(int n, int *odata, const int *idata) {
timer().startGpuTimer();
// TODO
timer().endGpuTimer();
timer().startGpuTimer();
scanNoTimer(n, odata, idata);
timer().endGpuTimer();
}

/**
Expand All @@ -31,10 +79,34 @@ namespace StreamCompaction {
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
int *dev_idata, *dev_odata, *dev_mapped, *dev_scanned;
cudaMalloc((void**)&dev_idata, n * sizeof(int));
cudaMalloc((void**)&dev_odata, n * sizeof(int));
cudaMalloc((void**)&dev_mapped, n * sizeof(int));
cudaMalloc((void**)&dev_scanned, n * sizeof(int));
cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);

const int blockSize = 256;
dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);

timer().startGpuTimer();
// TODO
Common::kernMapToBoolean << <fullBlocksPerGrid, blockSize >> > (n, dev_mapped, dev_idata);
scanNoTimer(n, dev_scanned, dev_mapped);
Common::kernScatter << <fullBlocksPerGrid, blockSize >> > (n, dev_odata, dev_idata, dev_mapped, dev_scanned);
timer().endGpuTimer();
return -1;

int count, lastbool;
cudaMemcpy(&count, dev_scanned + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(&lastbool, dev_mapped + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
count += lastbool;

cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dev_idata);
cudaFree(dev_odata);
cudaFree(dev_mapped);
cudaFree(dev_scanned);

return count;
}
}
}
47 changes: 39 additions & 8 deletions stream_compaction/naive.cu
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,52 @@

namespace StreamCompaction {
namespace Naive {
using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
static PerformanceTimer timer;
return timer;
using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
static PerformanceTimer timer;
return timer;
}

// TODO: __global__
// Kernel for naive prefix scan
__global__ void kernNaiveScan(int n, int *odata, const int *idata, const int offset) {
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n)
return;

if (index >= offset) {
odata[index] = idata[index - offset] + idata[index];
}
else {
odata[index] = idata[index];
}
}

/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
timer().startGpuTimer();
// TODO
void scan(int n, int *odata, const int *idata) {
int *dev_idata, *dev_odata;
cudaMalloc((void**)&dev_idata, n * sizeof(int));
cudaMalloc((void**)&dev_odata, n * sizeof(int));
cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);

timer().startGpuTimer();

const int blockSize = 512;
dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);

for (int d = 1; d <= ilog2ceil(n); ++d) {
kernNaiveScan << <fullBlocksPerGrid, blockSize >> > (n, dev_odata, dev_idata, pow(2, d - 1));
std::swap(dev_odata, dev_idata);
}

timer().endGpuTimer();

cudaMemcpy(odata + 1, dev_idata, (n - 1) * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dev_idata);
cudaFree(dev_odata);
}
}
}
Loading