CIS565-Fall-2018 · Maknee · Sep 15, 2018 · Sep 17, 2018 · Sep 18, 2018
diff --git a/README.md b/README.md
@@ -1,14 +1,99 @@
-CUDA Stream Compaction
-======================
+**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
+Project 2 - Stream-Compaction**
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
+* Henry Zhu
+  * [Github](https://github.com/Maknee), [LinkedIn](https://www.linkedin.com/in/henry-zhu-347233121/), [personal website](https://maknee.github.io/), [twitter](https://twitter.com/maknees1), etc.
+* Tested on: Windows 10 Home, Intel i7-4710HQ @ 2.50GHz 22GB, GTX 870M (Own computer)
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+## Scanning and stream-compaction 
+
+![](scan_all_things.png)
+
+### What is scanning/stream-compaction
+
+Scanning (prefix sum) is summing all the values from previous indices into the current index. The image below from [GPU Gems](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html) depicts how scan works:
+
+![](scan.png)
+
+Stream compaction is the removal of a particular value from an array. The image below from [GPU Gems](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html) depicts how stream compaction works:
+
+![](stream-compaction.png)
+
+## Answer to Questions
+
+### Performance of scan speeds (Array sizes)
+
+![](scan_speeds.png)
+
+### To guess at what might be happening inside the Thrust implementation (e.g. allocation, memory copy), take a look at the Nsight timeline for its execution. Your analysis here doesn't have to be detailed, since you aren't even looking at the code for the implementation.
+
+The thrust implementation is much better optimized for larger instances. It does not, however, work well with small array sizes.
+This is most likely due to its implementation. It takes in consideration a better block size for a larger amount of elements to scan through, so that is why it is much better in performance for larger numbers.
+
+### Can you find the performance bottlenecks? Is it memory I/O? Computation? Is it different for each implementation?
+
+I did find performance bottlenecks. I think, at least for the work efficient implementation of scan, it might be memory I/O as one has to send data to the GPU and then send it back for each iteration of d. For CPU, as shown in the graph below, it has to be with the computation as it has to iterate through every single node. The work-efficient implementation beats the naive implementation at first, but later, the naive implementation beats the work-efficient implementation as the memory i/o overhead becomes much larger than the computation overhead. 
+
+## Output of program
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  24  39   5   4   7  25  26  45  30  32  42  22  35 ...  19   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.865021ms    (std::chrono Measured)
+    [   0  24  63  68  72  79 104 130 175 205 237 279 301 ... 801298 801317 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.826429ms    (std::chrono Measured)
+    [   0  24  63  68  72  79 104 130 175 205 237 279 301 ... 801241 801278 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.67424ms    (CUDA Measured)
+    [   0  24  63  68  72  79 104 130 175 205 237 279 301 ... 801298 801317 ]
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 1.50448ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 1.03907ms    (CUDA Measured)
+    [   0  24  63  68  72  79 104 130 175 205 237 279 301 ... 801298 801317 ]
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.866112ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 4.9281ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.254016ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   3   1   0   1   1   2   1   0   2   0   2   3 ...   1   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.12029ms    (std::chrono Measured)
+    [   3   1   1   1   2   1   2   2   3   1   1   3   1 ...   3   1 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.11988ms    (std::chrono Measured)
+    [   3   1   1   1   2   1   2   2   3   1   1   3   1 ...   1   1 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 1.28747ms    (std::chrono Measured)
+    [   3   1   1   1   2   1   2   2   3   1   1   3   1 ...   3   1 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 1.34822ms    (CUDA Measured)
+    [   3   1   1   1   2   1   2   2   3   1   1   3   1 ...   3   1 ]
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 1.15571ms    (CUDA Measured)
+    [   3   1   1   1   2   1   2   2   3   1   1   3   1 ...   1   1 ]
+    passed
+```
 
-### (TODO: Your README)
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
 
diff --git a/scan.png b/scan.png
diff --git a/scan_all_things.png b/scan_all_things.png
diff --git a/scan_speeds.png b/scan_speeds.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 13; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -44,14 +44,14 @@ int main(int argc, char* argv[]) {
     printDesc("cpu scan, non-power-of-two");
     StreamCompaction::CPU::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(NPOT, b, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
 	/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
@@ -71,7 +71,7 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
@@ -137,14 +137,14 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     system("pause"); // stop Win32 console from closing on exit

diff --git a/stream-compaction.png b/stream-compaction.png
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -23,16 +23,30 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            const int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index > n)
+                return;
+
+            bools[index] = idata[index] ? 1 : 0;
         }
 
         /**
          * Performs scatter on an array. That is, for each element in idata,
          * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
          */
         __global__ void kernScatter(int n, int *odata,
-                const int *idata, const int *bools, const int *indices) {
-            // TODO
+            const int *idata, const int *bools, const int *indices) {
+            const int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index > n)
+                return;
+
+            //filter only elements that are not zero in the bool map.
+            if (bools[index])
+            {
+                //get the index of where the element is suppose to be in the in the final array
+                const int index_of_filtered_element = indices[index];
+                odata[index_of_filtered_element] = idata[index];
+            }
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -1,50 +1,150 @@
 #include <cstdio>
 #include "cpu.h"
 
-#include "common.h"
-
-namespace StreamCompaction {
-    namespace CPU {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-	        static PerformanceTimer timer;
-	        return timer;
-        }
-
-        /**
-         * CPU scan (prefix sum).
-         * For performance analysis, this is supposed to be a simple for loop.
-         * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
-         */
-        void scan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-            // TODO
-	        timer().endCpuTimer();
-        }
-
-        /**
-         * CPU stream compaction without using the scan function.
-         *
-         * @returns the number of elements remaining after compaction.
-         */
-        int compactWithoutScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-            // TODO
-	        timer().endCpuTimer();
-            return -1;
-        }
-
-        /**
-         * CPU stream compaction using scan and scatter, like the parallel version.
-         *
-         * @returns the number of elements remaining after compaction.
-         */
-        int compactWithScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-	        // TODO
-	        timer().endCpuTimer();
-            return -1;
-        }
-    }
+#include "common.h"
+#include <memory>
+#include <iostream>
+
+namespace StreamCompaction
+{
+	namespace CPU
+	{
+		using StreamCompaction::Common::PerformanceTimer;
+
+		PerformanceTimer& timer()
+		{
+			static PerformanceTimer timer;
+			return timer;
+		}
+
+		//actual implementation of scan
+		//because timer().startCpuTimer() is called inside
+		//scan(...) from scatter(...), causing an abort
+		void scan_impl(int n, int* odata, const int* idata)
+		{
+			/// super naive cpu implementation ///
+			// memset(odata, 0, n * sizeof(int));
+			//
+			//  for(int k = 1; k < n; k++)
+			//  {
+			//  	odata[k] = odata[k - 1] + idata[k - 1];
+			//  }
+
+			/// psuedo parallel implementation ///
+
+			//make sure the data is set first before beginning
+			memcpy(odata, idata, sizeof(int) * n);
+
+			for (int d = 1; static_cast<float>(d) <= std::ceil(std::log2(n)); d++)
+			{
+				//make a copy, because naive can't be done in place
+				auto temp = std::make_unique<int[]>(n);
+				memcpy(temp.get(), odata, n * sizeof(int));
+				for (int k = 0; k < n; k++)
+				{
+					//follow the formula
+					if (k >= static_cast<int>(std::pow(2, d - 1)))
+					{
+						odata[k] = temp[k - static_cast<int>(std::pow(2, d - 1))] + temp[k];
+					}
+				}
+			}
+
+			//copy the data back
+			auto temp = std::make_unique<int[]>(n);
+			memcpy(temp.get(), odata, n * sizeof(int));
+
+			//shift right by 1
+			for (int i = 1; i < n; i++)
+			{
+				odata[i] = temp[i - 1];
+			}
+			//set first element to 0
+			odata[0] = 0;
+		}
+
+		/**
+		 * CPU scan (prefix sum).
+		 * For performance analysis, this is supposed to be a simple for loop.
+		 * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
+		 */
+		void scan(int n, int* odata, const int* idata)
+		{
+			timer().startCpuTimer();
+
+			scan_impl(n, odata, idata);
+
+			timer().endCpuTimer();
+		}
+
+		/**
+		 * CPU stream compaction without using the scan function.
+		 *
+		 * @returns the number of elements remaining after compaction.
+		 */
+		int compactWithoutScan(int n, int* odata, const int* idata)
+		{
+			timer().startCpuTimer();
+
+			memset(odata, 0, n * sizeof(int));
+
+			int index = 0;
+
+			//iterate through and count
+			for (int i = 0; i < n; i++)
+			{
+				if (idata[i])
+				{
+					odata[index] = idata[i];
+					index++;
+				}
+			}
+
+			timer().endCpuTimer();
+			return index;
+		}
+
+		/**
+		 * CPU stream compaction using scan and scatter, like the parallel version.
+		 *
+		 * @returns the number of elements remaining after compaction.
+		 */
+		int compactWithScan(int n, int* odata, const int* idata)
+		{
+			timer().startCpuTimer();
+
+			auto counters = std::make_unique<int[]>(n);
+
+			int count = 0;
+
+			//iterate through and count
+			for (int i = 0; i < n; i++)
+			{
+				counters[i] = idata[i] ? 1 : 0;
+				if (counters[i])
+				{
+					count++;
+				}
+			}
+
+			auto indicies = std::make_unique<int[]>(n);
+
+			memcpy(indicies.get(), counters.get(), n);
+
+			//scan
+			scan_impl(n, indicies.get(), counters.get());
+
+			//now set the scanned result to the correct index
+			for (int i = 0; i < n; i++)
+			{
+				if (counters[i])
+				{
+					odata[indicies[i]] = idata[i];
+				}
+			}
+
+			timer().endCpuTimer();
+			return count;
+		}
+	}
 }