This project seeks to discover, investigate, and solve big data-set questions while utilizing MongoDB for storage and computations. This summer internship project also shows how to answer questions concerning big datasets stored in MongoDB using MongoDB's frameworks and connector. Both the MongoDB native aggregation framework and hadoop were utilized to explore the data.
The data for this project comes from two major sources:
- The Bureau of Transportation Statistics provided our Flights dataset, which is the domestic flight schedules for the past year.
- The Stanford Network Analysis Project provided us with the Twitter-Memes dataset which contains the blog posts and new articles for the 2008 presidential election.
This project can be divided into three sections, each with in-depth wiki pages describing our steps and observation:
- Basic-Flights - Basic analysis on the Flights dataset using MongoDB Aggregation Framework
- PageRank-Flights - Computing PageRank over the Flights dataset using the MongoDB MapReduce Framework
- Twitter-Memes - Computing PageRank over the Twitter-Memes dataset using Hadoop and associated frameworks/languages (like Apache Pig, Amazon EMR)