Senior thesis project done for 2016-2017. Utilizes Hadoop Ecosystem, Java MapReduce, Apache Pig, Apache Hive.
This project was completed during the academic year at Rose-Hulman Institute of Technology for the Computer Science and Software Engineering Departmant, under the adivisement of Dr. Sriram Mohan.
In the following setup, each of the folders have the following:
- Diagrams: Contains various diagrams that have been used in papers I typed or for presentations.
- Example_Data: Contains subfolders that hold example data:
- AdvertisementData: Contains example output of the advertisement randomly generated by GenerateThoroughBillboardData project. Contains the
regionAds.csv
for the 29 Regions over the time period coded into the program andsegmentAds.csv
for the ~1300 Segments over the time period coded into the program. - RegionData: 3 files,
csv
,xml
, andjson
, that were downloaded from the Chicago Data Portal using automated download script. The 3 show what the 'Congestion Estimates by Regions' datasets look like. - SegmentData: 3 files,
csv
,xml
, andjson
, that were downloaded from the Chicago Data Portal using automated download script. The 3 show what the 'Congestion Estimates by Regions' datasets look like. - Visual Data: Folder containing the datasets used to find the coordinates of the regions and segments, and the subfolder holds an HTML file that visually shows the map of zip codes of Chicago and 58 blimps, where the matching colors mean the North-East corner and South-West corner of the region square.
- AdvertisementData: Contains example output of the advertisement randomly generated by GenerateThoroughBillboardData project. Contains the
- Misc: Various files and projects
- Projects/GenerateBillboardData: Project that was initially used to create advertisement data. (Eclipse Project)
- Scripts: Scripts that I used to keep track of commands.
- Summaries: Documents related to the progress of the thesis during the academic school year.
- Paper: Document to create final thesis paper for the class.
- Presentations: Various files used for a poster presentation and for quarterly updates
- Projects: Projects used for the project (Eclipse Projects)
- DataCollection:Using jsoup to download Census data per zip code.
- GenerateThoroughBillboardData:Generates the advertisement data for segment and region. Every row would be the segment or region identifier, with an advertisement, rating, and length of time for the advertisement. The data is available in the Example_Data/AdvertisementData
- SegmentsInRegions:Quick program to determine which segments fell in a region. Was an idea to determine how much segments factor into a region.
- Traffic Analysis: Main aggregation of data. Contains 4 seperate projects with in the project itself.
- CongestionFinder:Original prototype for day analysis. Attributes from this project is used in the DayAnalysis project, which is described below
- DayAnalysis:Took all the data for a region and segment on a given day and found the statistics of congestion and found the longest period of congestion.
- FlowAnalysis:Original prototype for flow analysis. The code was reused in the project FlowAnalysisGrouping but with added features.
- FlowAnalysisGrouping:Takes in advertisement data joined with DayAnalysis output data and determines is a given area gets an advertisement and what the best advertisement would be for the region and segment. Additional to this, the determining of advertisement placement also took into account of the length of time the congestion calculations took place. Time split ups are week of year, month of year, year, and all time (Same as FlowAnalysis)
- Results:Final results after running all the commands found in the trafficAnalysisCommands.txt.
- Scripts: Contains all the scripts for Pig, Hive, and Bash to either clean and parse the data, store into an SQL format, or download the data. The order to run the commands is found in the trafficAnalysisCommands.txt.
trafficAnalysisCommands.txt:
Contains the order and all the commands to run the project and produce results similar to mine, unless the advertisement data is randomly generated.
Please feel free to contact me for more information through GitHub.