Full fledged data analysis project using Hadoop stack
Steps performed in the project:
- Acquire the top 200,000 posts by viewcount
- Using Pig or MapReduce , extract, transform and load the data as applicable
- Using Hive Query Language , compute: I. The top 10 posts by score II. The top 10 users by post score III. The number of distinct users, who used the word “Hadoop” in one of their posts
- Using Mapreduce calculate the per user TF IDF and find 10 most used words, excluding stop words.
Refer to "Documentation" for step by step guide.