Skip to content

Latest commit

 

History

History
82 lines (54 loc) · 22.9 KB

Homework2.md

File metadata and controls

82 lines (54 loc) · 22.9 KB

Homework 2

The goal of this homework is for students to expand their experience with solving distributed computational problems using cloud computing technologies. The main textbook group (option 1) will design and implement the project using the Spark/GraphX computational model and deploy it on AWS EMR whereas the alternative textbook group (option 2) will use the Facebook/Meta Wangle framework and deploy it on AWS EC2. You can check your textbook option in the corresponding column of the gradebook on the Blackboard.

Grade: 15%

This Git repo contains the homework description that uses an open-source implementation of a network simulator in Scala.

Preliminaries

As part of this first homework assignment students have learnt how to create and manage Git project repository, create an application in Scala, create tests using widely popular Scalatest framework, and expand on the provided SBT build and run script for their applications based on randomly generated graphs that represent big data.

First things first, if you haven't done so as part of the first homework, you must create your account at Github, a Git repo management system. Please make sure that you write your name in your README.md in your repo as it is specified on the class roster. Since it is a large class, please use your UIC email address for signing your projects and you should avoid emails from other accounts like funnybunny2000@gmail.com. As always, the homeworks class' Teams channel is the preferred way to exchange information and ask questions. If you don't receive a response within 12 hours, please contact your TA or me by tagging our names. If you use emails it may be a case that your direct emails went to the spam folder.

Next, if you haven't done so as part of your first homework, you will install IntelliJ with your academic license, the JDK, the Scala runtime and the IntelliJ Scala plugin and the Simple Build Toolkit (SBT) and make sure that you can create, compile, and run Java and Scala programs. Please make sure that you can run various Java tools from your chosen JDK between versions 8 and 19.

In all homeworks you should use logging and configuration management frameworks. You will comment your code extensively to describe your design choices and intricate, i.e., nontrivial details of your implementation, and you will add logging statements at different logging levels (e.g., TRACE, INFO, WARN, ERROR) to record information at some salient points in the executions of your programs. All input configuration variables/parameters must be supplied through configuration files -- hardcoding these values in the source code is prohibited and will be punished by taking a large percentage of points from your total grade! You are expected to use Logback and SLFL4J for logging and Typesafe Conguration Library for managing configuration files. These and other libraries should be imported into your project using your script build.sbt. These libraries and frameworks are widely used in the industry, so learning them is the time well spent to improve your resumes. Also, please set up your account with AWS Educate. Using your UIC email address may enable you to receive free credits for running your jobs in the cloud. Preferably, you should create your developer account for $29 per month to enjoy the full range of AWS services, if you haven't already done so.

As you already know the implementation of the network simulator demonstrates how to use Scala to create a fully functional (not imperative) implementation with subprojects and tests. As you see from the StackOverflow survey, knowledge of Scala is highly paid and in great demand, and it is expected that you pick it relatively fast, especially since it is tightly integrated with Java. I recommend using the book on Programming in Scala Fourth and Fifth Editions by Martin Odersky et al. You can obtain this book using the academic subscription on Safari Books Online. There are many other books and resources available on the Internet to learn Scala. Those who know more about functional programming can use the book on Functional Programming in Scala published in 2023 by Michael Pilquist, Rúnar Bjarnason, and Paul Chiusano.

As always when creating your application in Scala you should avoid using vars and while/for loops that iterate over collections using induction variables. Instead, you should learn to use collection methods map, flatMap, foreach, filter and many others with lambda functions, which make your code linear and easy to understand as we studied it in class. Also, avoid mutable variables that expose the internal states of your modules at all cost. Points will be deducted for having unreasonable vars and inductive variable loops without explanation why mutation is needed in your code unless it is confined to method scopes - you can always do without it.

Overview

In this homework, you will create a distributed program for parallel random walks on the large graphs that are generated using the network graph simulation plaform (NetGraphSim). Readme.md for NetGraphSim contains detailed instructions for cloning, building and running the program if you haven't done so. Depending on the power of your computer you can generate large-scale pairs of graphs with tens of thousand nodes where the original graph and its perturbed equivalent differ in some nodes and edges depending on the chosen parameter. The differences are produced in the yaml format that you will use in this homework to determine the precision of your approach of computing differences between the original and the perturbed graph.

A random walk is a stochastic process defined as a sequence of random variables for choosing an object of some data structure based on the previous choice and a function of time. For graphs it is defined as randomly choosing the next node connected by an edge to the current node as a function of some random variable. Random choices of nodes and edges starting with some initial node result in a path in a graph that is an instance of the random walk. Random walks are used extensively in finance, electrical engineering, physics, chemistry, genetics and brain research to model cascading events at a large scale. In this homework our goal is to use parallel random walks to model an inside attacker in a large enterprise network whose goal is to investigate network nodes, i.e., computers under the guise of providing services to them to obtain credentials to computers that store sensitive high-value data, e.g., credit card information.

Consider a real-world situation where a computer network at a large company is represented as a graph of connected computers and company employees have limited privileges to access these computers. The goal of the insider Man-in-the-Middle (MitM) attack is to create and deploy a program that provides some services within the network to add value to communications among these computers, e.g., message auditing or caching. Obtaining permissions for such a program allows this attacker to observe communications among computers to capture credentials embedded in these communications. Accessing computers directly as part of the transaction string on the company network may trigger alarms, however, if a program provides useful services that accessing computers that exchange messages because of the provided services is justified.

To prevent MitM attack companies use randomization of the network and defense by deception where some computers are simulated as honeypots to catch attackers by allowing them to log in. That is, if a program attempts to log into a simulated honeypot it is immediately detected as malicious. In addition, some computers are modified or dissimulated, e.g., their labels or other attributes are changed to confound attackers or guide them into believing that these legitimate computers are honeypots. In general, computer networks are modeled as graphs and their modifications can be viewed as perturbed graphs. Communications among these nodes can be modeled as random walks on these graphs. The problem is the following: by capturing information about nodes, i.e., computers in random walks is it possible for an attacker to tell authentic computers from honeypots automatically?

All homework scripts are written using a retroscripting technique, in which the homework outlines are generally and loosely drawn, and the individual students improvise to create the implementation that fits their refined objectives. In doing so, students are expected to stay within the basic requirements of the homework and they are free to experiments. Asking questions is important, so please ask away at MS Teams!

Functionality

Your homework assignment is to create a program for parallel random walks on large graphs to simulate how MitM attackers work. Each node in the original graph has the attribute Valuable Data that specifies whether a given computer holds some sensitive data. Of course, the insider MitM attacker knows the original graph because the attacker is an insider, however, s/he does not know whether the nodes that represent computers in the perturbed graph are authentic or not. Making a mistake and choosing a honeypot believing that it is the valuable data computer results in the security systems detecting the attacker and disabling the malicious program.

Starting with the initial node of the perturbed graph a random walk should be performed on this graph with a termination criteria that may be based on the maximum size of the random walk as a multiple of the total number of nodes in the graph or based on some random variable that is a function of time, i.e., the number of steps in the walk. A random walk on the graph simulates a communication path in the actual computer network where malicious programs provide some services and they can eavesdrop on the computers engaged into the communication path. Once a walk is completed the attacker applies a SimRank algorithm to determine what nodes in the original graph match the nodes on the walk path in the perturbed graph. Based on the outcome of this comparison the attacker decides which nodes on the path represent authentic computers and which ones are simulated honeypots.

Your goal is to produce the MitM attacker statistics about simulated attacks using the generated graphs and their perturbed counterparts using the notion of pipelined execution process in the cloud. First step is to create a number of random walks on the perturbed graph in parallel. Second, for each node in a computed path of the random walk you will compute the likelihood that a node matches a specific node in the original graph using the same algorithm for SimRank you created in the first homework. If the likelihood score is above your chosen threshold then a matching node/computer is found in the original graph and if there are more then one matches then you can make a choice at random. This determination will depend on your own uniquely fine-tuned graph matching algorithm. If a match between nodes is determined and it is actually verified as correct and if the node contains valuable data then the attack is considered successful. However, if a match is found but it is a false positive and the node's attribute specifies that it contains valuable data then the attacker is destroyed. Next, if no match is found then the iteration loop continues by going to the first step and repeating a random walk using the information about retrieved nodes from the previous iterations. The number of iterations is a configurable parameter that may designate a threshold for the number experiments or it can be based on the threshold for the total elapsed running time. Finally, you will produce an estimate of the goodness of your attack algorithm similar to precision and recall ratios based on your experiments.

The main metric that your algorithm issues is the number of successful attacks and the failed attacks for a given number of iterations. Other metrics can include the statistics about random walks, e.g., min/max/median/mean number of nodes in these walks and the ratio of the number of random walks resulting in successful attacks to the total number of random walks.

Assignment for the main textbook group

Your job is to map the process described above to the constraints of Apache Spark/GraphX framework, explain how your design and architecture, and then to implement it and to run on the big data graphs that you will generate using your predefined configuration parameters for NetGraphSim. The output of your program is a data file in some format of your choice, e.g., Yaml or CSV with the required statistics. The explanation of the Spark model is given in the main textbook and covered in class lectures. After creating and testing your Spark/GraphX program locally, you will deploy it and run it on the AWS Spark EMR - you can find plenty of documentation online. Just like for the first homework you will produce a short movie that documents all steps of the deployment and execution of your program with your narration and you will upload this movie to youtube and you will submit a link to your movie as part of your submission in the README.md file. To produce a movie, you may use an academic version of Camtasia or Zoom or some other cheap/free screen capture technology from the UIC webstore or an application for a movie capture of your choice. The captured web browser content should show your login name in the upper right corner of the AWS application and you should introduce yourself in the beginning of the movie speaking into the camera. The display of your passwords and your credit card numbers should be avoided :-).

Assignment for the alternative textbook group

Your job is to create the distributed objects using the Wangle framework for each task, explain how they work, and then to implement them and run on the generated graphs. In your documentation you should explain the design of your data processing pipeline and how it uses the provisioned cloud resources. You can complete your implementation using C++.

Next, after creating and testing your program locally, you will deploy it and run it on the AWS EC2 IaaS. You will produce a short movie that documents all steps of the deployment and execution of your program with your narration and you will upload this movie to youtube and you will submit a link to your movie as part of your submission in the README.md file. To produce a movie, you may use an academic version of Camtasia or some other cheap/free screen capture technology from the UIC webstore or an application for a movie capture of your choice. The captured web browser content should show your login name in the upper right corner of the AWS application and you should introduce yourself in the beginning of the movie speaking into the camera.

Baseline Submission

Your baseline project submission should include your implementation, a conceptual explanation in the document or in the comments in the source code of how your Spark/GraphX processing components work to solve the problem for Option 1 group or how your Wangle distributed object pipeline works for Option 2 group, and the documentation that describe the build and runtime process, to be considered for grading. Your should use markdown for your project's Readme.md. Your project submission should include all your source code as well as non-code artifacts (e.g., configuration files), your project should be buildable using the SBT, and your documentation must specify how you paritioned the data and what input/outputs are.

Collaboration

You can post questions and replies, statements, comments, discussion, etc. on Teams using the corresponding channel. For this homework, feel free to share your ideas, mistakes, code fragments, commands from scripts, and some of your technical solutions with the rest of the class, and you can ask and advise others using Teams on where resources and sample programs can be found on the Internet, how to resolve dependencies and configuration issues. When posting question and answers on Teams, please make sure that you selected the appropriate channel, to ensure that all discussion threads can be easily located. Active participants and problem solvers will receive bonuses from the big brother :-) who is watching your exchanges (i.e., your class instructor and your TA). However, you must not describe intricate details of your architecture or your models!

Git logistics

This is an individual homework. Please remember to grant a read access to your repository to your TA and your instructor. You can commit and push your code as many times as you want. Your code will not be visible and it should not be visible to other students - your repository should be private. Announcing a link to your public repo for this homework or inviting other students to join your fork for an individual homework before the submission deadline will result in losing your grade. For grading, only the latest commit timed before the deadline will be considered. If your first commit will be pushed after the deadline, your grade for the homework will be zero. For those of you who struggle with the Git, I recommend a book by Ryan Hodson on Ry's Git Tutorial. The other book called Pro Git is written by Scott Chacon and Ben Straub and published by Apress and it is freely available. There are multiple videos on youtube that go into details of the Git organization and use.

Please follow this naming convention to designate your authorship while submitting your work in README.md: "Firstname Lastname" without quotes, where you specify your first and last names exactly as you are registered with the University system, as well as your UIC.EDU email address, so that we can easily recognize your submission. I repeat, make sure that you will give both your TA and the course instructor the read/write access to your private forked repository so that we can leave the file feedback.txt in the root of your repo with the explanation of the grade assigned to your homework.

Discussions and submission

As it is mentioned above, you can post questions and replies, statements, comments, discussion, etc. on Teams. Remember that you cannot share your code and your solutions privately, but you can ask and advise others using Teams and StackOverflow or some other developer networks where resources and sample programs can be found on the Internet, how to resolve dependencies and configuration issues. Yet, your implementation should be your own and you cannot share it. Alternatively, you cannot copy and paste someone else's implementation and put your name on it. Your submissions will be checked for plagiarism. Copying code from your classmates or from some sites on the Internet will result in severe academic penalties up to the termination of your enrollment in the University.

Submission deadline and logistics

Wednesday, November 1, 2023 at 11PM CST by submitting the link to your homework repo in the Teams Assignments channel. Your submission repo will include the code for the program, your documentation with instructions and detailed explanations on how to assemble and deploy your program along with the results of your program execution, the link to the video and a document that explains these results based on the characteristics and the configuration parameters of your log generator, and what the limitations of your implementation are. Again, do not forget, please make sure that you will give both your TAs and your instructor the read access to your private repository. Your code should compile and run from the command line using the commands sbt clean compile test and sbt clean compile run. Also, you project should be IntelliJ friendly, i.e., your graders should be able to import your code into IntelliJ and run from there. Use .gitignore to exlude files that should not be pushed into the repo.

Evaluation criteria

  • the maximum grade for this homework is 15%. Points are subtracted from this maximum grade: for example, saying that 2% is lost if some requirement is not completed means that the resulting grade will be 15%-2% => 13%; if the core homework functionality does not work or it is not implemented as specified in your documentation, your grade will be zero;
  • only some basic GraphX or Wangle examples from some repos are given and nothing else is done: zero grade;
  • not implementing a random walk algorithm: 7% penalty;
  • homework submissions for an incorrectly chosen textbook assignment option will be desk-rejected with the grade zero;
  • having less than five unit and/or integration scalatests: up to 10% lost;
  • missing comments and explanations from your program with clarifications of your design rationale: up to 10% lost;
  • logging is not used in your programs: up to 5% lost;
  • hardcoding the input values in the source code instead of using the suggested configuration libraries: up to 5% lost;
  • for each used var for heap-based shared variables or mutable collections: 0.3% lost;
  • for each used while or for or other loops with induction variables to iterate over a collection: 0.5% lost;
  • no instructions in README.md on how to install and run your program: up to 10% lost;
  • the program crashes without completing the core functionality: up to 15% lost;
  • the documentation exists but it is insufficient to understand your program design and models and how you assembled and deployed all components of your mappers and reducers: up to 15% lost;
  • the minimum grade for this homework cannot be less than zero.

That's it, folks!