CIS565-Fall-2018 · salaark · Sep 13, 2018 · Sep 15, 2018 · Sep 17, 2018 · Sep 17, 2018
diff --git a/Analysis.xlsx b/Analysis.xlsx
diff --git a/README.md b/README.md
@@ -3,12 +3,32 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Salaar Kohari
+  * [LinkedIn](https://www.linkedin.com/in/salaarkohari), [personal website](http://salaar.kohari.com)
+* Tested on: Windows 10, Intel Xeon @ 3.7GHz 32GB, GTX 1070 8GB (SIG Lab)
 
-### (TODO: Your README)
+### Description
+Scan and GPU Stream Compaction are two commonly parallelized problems for the GPU that come in handy as libraries for more sophisticated parallel algorithms. The goal of this project, in addition to implementing cpu, naive, and work-efficient scan and stream compaction, is to compare performance of different methods under various array sizes and analyze the results.
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+Scan involves processing a normally sequential action (such as summing an array) into a parallel one. The naive implementation involves summing two adjacent elements together then that sum with adjacent sums, etc to collapse the tree.
 
+The work efficient implementation involves an up and down sweep of a balanced tree to minimize the number of scans and threads needed. This can be further optimized by reducing block size at each sum iteration to minimize wasted threads.
+
+Stream compaction maps an array to 0s and 1s depending on if the value meets a condition (i.e. non-zero). Using a scan algorithm, new indices are chosen for the non-zero elements and a new array is constructed with those elements excluded.
+
+### Analysis
+
+![Scan, (2^x)-3 Array Size](img/scan.png)
+Scan, (2^x)-3 Array Size
+
+![Scan, 2^x Array Size](img/scan2.png)
+Scan, 2^x Array Size
+
+![Stream Compact](img/compact.png)
+Stream Compact
+
+In the graphs above, scan and compact are compared for various array sizes and algorithm implementations. All numbers are based on the average of two runs, since random arrays cause variance in runtime. The graphs are split into power-of-2 array sizes and power-of-2 minus 3, since behavior of certain algorithms can cause the runtime to vary between these conditions. One thing to note was that the power-of-2 implementation of thrust took 4-5 seconds, so it was discluded from the results. This may have been due to thrust being slow the first time it is called.
+
+CPU operations perform better at lower array size but becomes much slower at larger array sizes. This is because CPU has a faster clock and memory access for smaller array sizes, but this scales linearly with the size of the array as opposed to more logarithmically in the parallel approach. CPU also requires no memory transfer to the GPU which could be an additional bottleneck for low array sizes.
+
+In my implementation, the naive approach performs better than the efficient one. This is likely because I did not optimize my block size at each iteration, so many threads are wasted at each implementation. There is also modulus division happening in the kernel to check if it is a valid thread, which is computationally inefficient on the GPU. If these are optimized, the efficient implementation would likely perform better.
diff --git a/img/compact.png b/img/compact.png
diff --git a/img/scan.png b/img/scan.png
diff --git a/img/scan2.png b/img/scan2.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -58,7 +58,7 @@ int main(int argc, char* argv[]) {
 	onesArray(SIZE, c);
 	printDesc("1s array for finding bugs");
 	StreamCompaction::Naive::scan(SIZE, c, a);
-	printArray(SIZE, c, true); */
+	printArray(SIZE, c, true);*/
 
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");

diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
@@ -13,5 +13,5 @@ set(SOURCE_FILES
 
 cuda_add_library(stream_compaction
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_50
     )
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -23,7 +23,11 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+			if (index >= n)
+				return;
+
+			bools[index] = idata[index] > 0 ? 1 : 0;
         }
 
         /**
@@ -32,7 +36,13 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+			if (index >= n)
+				return;
+
+			if (bools[index] == 1) {
+				odata[indices[index]] = idata[index];
+			}
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -1,50 +1,72 @@
 #include <cstdio>
 #include "cpu.h"
 
-#include "common.h"
+#include "common.h"
 
 namespace StreamCompaction {
-    namespace CPU {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-	        static PerformanceTimer timer;
-	        return timer;
-        }
+	namespace CPU {
+		using StreamCompaction::Common::PerformanceTimer;
+		PerformanceTimer& timer()
+		{
+			static PerformanceTimer timer;
+			return timer;
+		}
 
-        /**
-         * CPU scan (prefix sum).
-         * For performance analysis, this is supposed to be a simple for loop.
-         * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
-         */
-        void scan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-            // TODO
-	        timer().endCpuTimer();
-        }
+		/**
+		 * CPU scan (prefix sum).
+		 * For performance analysis, this is supposed to be a simple for loop.
+		 * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
+		 */
+		void scanNoTimer(int n, int *odata, const int *idata) {
+			odata[0] = 0;
+			for (int i = 1; i < n; ++i) {
+				odata[i] = idata[i - 1] + odata[i - 1];
+			}
+		}
 
-        /**
-         * CPU stream compaction without using the scan function.
-         *
-         * @returns the number of elements remaining after compaction.
-         */
-        int compactWithoutScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-            // TODO
-	        timer().endCpuTimer();
-            return -1;
-        }
+		void scan(int n, int *odata, const int *idata) {
+			timer().startCpuTimer();
+			scanNoTimer(n, odata, idata);
+			timer().endCpuTimer();
+		}
 
-        /**
-         * CPU stream compaction using scan and scatter, like the parallel version.
-         *
-         * @returns the number of elements remaining after compaction.
-         */
-        int compactWithScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-	        // TODO
-	        timer().endCpuTimer();
-            return -1;
-        }
-    }
+		/**
+		 * CPU stream compaction without using the scan function.
+		 *
+		 * @returns the number of elements remaining after compaction.
+		 */
+		int compactWithoutScan(int n, int *odata, const int *idata) {
+			timer().startCpuTimer();
+			int count = 0;
+			for (int i = 0; i < n; ++i) {
+				if (idata[i] != 0) {
+					odata[count++] = idata[i];
+				}
+			}
+			timer().endCpuTimer();
+			return count;
+		}
+
+		/**
+		 * CPU stream compaction using scan and scatter, like the parallel version.
+		 *
+		 * @returns the number of elements remaining after compaction.
+		 */
+		int compactWithScan(int n, int *odata, const int *idata) {
+			timer().startCpuTimer();
+			int* mapped = new int[n];
+			for (int i = 0; i < n; ++i) {
+				mapped[i] = idata[i] != 0 ? 1 : 0;
+			}
+			int* scanned = new int[n];
+			scanNoTimer(n, scanned, mapped);
+			for (int i = 0; i < n; ++i) {
+				if (mapped[i] == 1) {
+					odata[scanned[i]] = idata[i];
+				}
+			}
+			timer().endCpuTimer();
+			return scanned[n - 1];
+		}
+	}
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
@@ -5,20 +5,68 @@
 
 namespace StreamCompaction {
     namespace Efficient {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
         }
 
+		// Kernels for efficient prefix scan
+		__global__ void kernUpScan(int n, int *data, const int offset, const int offset2) {
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+			if (index >= n)
+				return;
+
+			if (index % offset2 != 0)
+				return;
+
+			data[index + offset2 - 1] += data[index + offset - 1];
+		}
+
+		__global__ void kernDownScan(int n, int *data, const int offset, const int offset2) {
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+			if (index >= n)
+				return;
+
+			if (index % offset2 != 0)
+				return;
+
+			int temp = data[index + offset - 1];
+			data[index + offset - 1] = data[index + offset2 - 1];
+			data[index + offset2 - 1] += temp;
+		}
+
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
+		void scanNoTimer(int n, int *odata, const int *idata) {
+			int *dev_data;
+			cudaMalloc((void**)&dev_data, n * sizeof(int));
+			cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+			const int blockSize = 256;
+			dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+
+			for (int d = 0; d < ilog2ceil(n); ++d) {
+				int offset = pow(2, d);
+				kernUpScan << <fullBlocksPerGrid, blockSize >> > (n, dev_data, offset, offset * 2);
+			}
+
+			cudaMemset(dev_data + n - 1, 0, 1);
+			for (int d = ilog2ceil(n); d >= 0; --d) {
+				int offset = pow(2, d);
+				kernDownScan << <fullBlocksPerGrid, blockSize >> > (n, dev_data, offset, offset * 2);
+			}
+
+			cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost);
+			cudaFree(dev_data);
+		}
+
         void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
+			timer().startGpuTimer();
+			scanNoTimer(n, odata, idata);
+			timer().endGpuTimer();
         }
 
         /**
@@ -31,10 +79,34 @@ namespace StreamCompaction {
          * @returns      The number of elements remaining after compaction.
          */
         int compact(int n, int *odata, const int *idata) {
+			int *dev_idata, *dev_odata, *dev_mapped, *dev_scanned;
+			cudaMalloc((void**)&dev_idata, n * sizeof(int));
+			cudaMalloc((void**)&dev_odata, n * sizeof(int));
+			cudaMalloc((void**)&dev_mapped, n * sizeof(int));
+			cudaMalloc((void**)&dev_scanned, n * sizeof(int));
+			cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+			const int blockSize = 256;
+			dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+
             timer().startGpuTimer();
-            // TODO
+			Common::kernMapToBoolean << <fullBlocksPerGrid, blockSize >> > (n, dev_mapped, dev_idata);
+			scanNoTimer(n, dev_scanned, dev_mapped);
+			Common::kernScatter << <fullBlocksPerGrid, blockSize >> > (n, dev_odata, dev_idata, dev_mapped, dev_scanned);
             timer().endGpuTimer();
-            return -1;
+
+			int count, lastbool;
+			cudaMemcpy(&count, dev_scanned + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+			cudaMemcpy(&lastbool, dev_mapped + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+			count += lastbool;
+
+			cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
+			cudaFree(dev_idata);
+			cudaFree(dev_odata);
+			cudaFree(dev_mapped);
+			cudaFree(dev_scanned);
+
+            return count;
         }
     }
 }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
@@ -5,21 +5,52 @@
 
 namespace StreamCompaction {
     namespace Naive {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
         }
+
         // TODO: __global__
+		// Kernel for naive prefix scan
+		__global__ void kernNaiveScan(int n, int *odata, const int *idata, const int offset) {
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+			if (index >= n)
+				return;
+
+			if (index >= offset) {
+				odata[index] = idata[index - offset] + idata[index];
+			}
+			else {
+				odata[index] = idata[index];
+			}
+		}
 
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
+		void scan(int n, int *odata, const int *idata) {
+			int *dev_idata, *dev_odata;
+			cudaMalloc((void**)&dev_idata, n * sizeof(int));
+			cudaMalloc((void**)&dev_odata, n * sizeof(int));
+			cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+			timer().startGpuTimer();
+
+			const int blockSize = 512;
+			dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+
+			for (int d = 1; d <= ilog2ceil(n); ++d) {
+				kernNaiveScan << <fullBlocksPerGrid, blockSize >> > (n, dev_odata, dev_idata, pow(2, d - 1));
+				std::swap(dev_odata, dev_idata);
+			}
+
             timer().endGpuTimer();
+
+			cudaMemcpy(odata + 1, dev_idata, (n - 1) * sizeof(int), cudaMemcpyDeviceToHost);
+			cudaFree(dev_idata);
+			cudaFree(dev_odata);
         }
     }
 }