- This assignment helps you get familiar with MPI by implementing a parallel odd-even sort algorithm.
- Experimental evaluations on your implementation are required to guide you analyze the performance and scalability of a parallel program.
- You are also encouraged to explore any performance optimization and parallel computing skills in order to receive a higher score.
- You are required to implement a parallel version of odd-even sort under the given restrictions.
- Your goal is to optimize the performance and reduce the total execution time.
- You are asked to parallelize the sequential Mandelbrot Set program.
- Get familiar with thread programming using Pthread and OpenMP.
- Combine process and thread to implement a hybrid parallelism solution.
- Understand the importance of load balance.
- You are asked to parallelize the sequential Mandelbrot Set program by implementing the following two versions:
pthread
: Single node shared memory programming using Pthread.- This program only needs to be run on a single node.
hybrid
: Multi-node hybrid parallelism programming using MPI + OpenMP.- This program must be run across multiple nodes.
- MPI processes are used to balance tasks among nodes, and OpenMP threads are used to perform computations.
- Pthread library could also be used to create additional threads for communications.
- This assignment helps you manage to solve the all-pairs shortest path problem with CPU threads and then further accelerate the program with CUDA accompanied by Blocked Floyd-Warshall algorithm.
- We encourage you to optimize your program by exploring different optimizing strategies for performance points.
- You are asked to implement 3 versions of programs that solve the all-pairs shortest path problem.
- CPU version (hw3-1)
- You are required to use threading to parallelize the computation in your program.
- You can choose any threading library or framework you like (pthread, std::thread, OpenMP, Intel TBB, etc).
- You can choose any algorithm to solve the problem.
- You must implement the shortest path algorithm yourself.
- Single-GPU version (hw3-2)
- Should be optimized to get the performance points.
- Multi-GPU version (hw3-3)
- Must use 2 GPUs.
- CPU version (hw3-1)
- This assignment provides an opportunity for you to practice your parallel programming skills by implementing the scheduling and parallel programming model of the well-known big data processing framework, MapReduce.
- You are asked to implement a parallel program that mimics the data locality-aware scheduling policy and the functional level programming model of MapReduce.
- You will implement the parallel program using MPI and Pthread library. The jobtracker(scheduler) and tasktrackers(workers) are implemented as MPI processes, and threads are used for executing computing tasks and IO.
- The jobtracker is responsible for generating the map tasks, reducing tasks of a MapReduce job and following the data-locality scheduling principle to dispatch tasks on worker nodes for execution.
- Each node runs a tasktracker which is responsible for creating and managing a set of mapper and reducer threads to execute the receiving map tasks and reduce tasks and outputs the intermediate and final output files.
- We do NOT consider worker nodes to join, leave or fail during the job execution.
- You are required to implement MapReduce system architecture, programming model, and scheduling algorithm described in Section 3, 4 and 5, respectively.
- You are required to implement a WordCount sample code to demonstrate your implementation.
- All the codes should be compiled into a single MPI program, and you should make sure the program terminates properly after all the computing tasks are completed.
- Performance is not the primary concern in this assignment, but you are still encouraged to improve the code efficiently.
- Suppose we want to draw a filled circle of radius r on a 2D monitor, how many pixels will be filled?
- We fill a pixel when any part of the circle overlaps with the pixel.
- We also assume that the circle center is at the boundary of 4 pixels.
- Parallelize the calculation using MPI.
- Same as Lab1.
- We are going to approximate pixels using pthread, OpenMP and hybrid of MPI and OpenMP in this lab.
- Identifying points in a digital image at which the image brightness changes sharply.
- Sobel Operator: Used in image processing and computer vision, particularly within edge detection algorithms.
- Uses two 5 x 5 kernels gx, gy which are convolved with the original image to calculate approximations of the derivatives one for horizontal changes, and one for vertical.
- Parallelize the calculation using GPU and cuda.
- Same as Lab3
- Further optimize the code with following techniques
- Coalesced Memory Access
- Lower Precision
- Shared Memory
- Hands-on Lab
- Self-studying
- Make model distributed by Horovod
- Initialize Horovod
- Adjust learning rate based on number of process
- Broadcast mnist_model.variables and opt.variables() to all other processes
- Given n balloons. Each balloon is painted with a number on it represented by an array nums. You are asked to burst all the balloons.
- If you burst the ith balloon, you will get nums[i-1]*nums[i]*nums[i+1] coins. If i-1 or i+1 goes out of bounds of the array, then treat it as if there is a balloon with a 1 painted on it.
- Return the maximum coins you can collect by bursting the balloons wisely.
- Constraints: 1 <= n <= 10000, 1 <= nums[i] <= 50.
- CPU - sequential
- GPU - optimize step by step
- Baseline (21.92s)
- DP reindexing (7.57s)
- Parallel maxreduce (6.82s)
- Two data per thread (5.99s)
- Multiple data per thread (4.07s)
- Unroll last warp (3.93s)
- Unroll all (3.90s)