Data Warehouse Course Project

The final project for Data Warehouse in 2015

Projects Requirements

Data Resource
- http://snap.stanford.edu/data/web-Movies.html
Data needs to be stored
- movie ID
- comment user's ID
- comment user's Profilename
- comment user's Helpfulness
- comment score for each user
- comment time
- comment summary
- comment text
- movie actors
- movie show time
- movie genre
- movie director
- movie starring actors
- movie version
Most frequent Research
- Check for time
  - the number of ovies in XX year, xx month, xx season
  - how much new movies have been shown on Tuesday
- check by movie name
  - how much versions a movie may possess
- check by directors/ actors/ genres
- combines research

The project implementation Process

Step 1.

Process Data on Amazon
- write scrawl script with python and simple bash
  - get 230 thousand items on Amazon with three servers running multiple threads at one night.
- clean data
  - done with help of http://www.crummy.com/software/BeautifulSoup/
  - get the data need to be stored from raw html

Step 2.

design storing plan
- the logical storing plan: ERD
- the physical storing plan : database table design
Build Hive clusters
- I bought three servers temporarily from https://www.digitalocean.com/ playing the role of namenode, edgenode and datanode. Their configuration informations are as follows:
- Edgenode can keep watch on the performance of the cluster.
- We need to compare the time consumed by Mysql and HDFS for one complex search.

Step 3.

Research by multiple conditions and add condition automatically
For example, search for the thrillers in season 1,2015
The result showing in table
You can click on any item to have a further search
displayed time consumed comparation between mysql and HDFS in histogram and pie charts.

Development Tools

platform : Mac OSX
ETL Tool : http://www.pentaho.com/
Hive : three servers from https://www.digitalocean.com/
Thanks for https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-on-ubuntu-13-10 and stackoverflow tips!
Mysql : configure one on one of the server

Conclusions

skills about scrawling loads of data online
ETL skills
build Hive clusters
display research result, maybe with https://www.joomla.org/

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
ETL代码		ETL代码
schema		schema
导出文件		导出文件
演示网页代码		演示网页代码
电影名爬取		电影名爬取
.DS_Store		.DS_Store
ERD.jpg		ERD.jpg
README.md		README.md
complete_info.txt		complete_info.txt
monitor.png		monitor.png
readme.docx		readme.docx
result1.png		result1.png
result2.png		result2.png
result3.png		result3.png
set1.png		set1.png
set2.png		set2.png
set3.png		set3.png
show.png		show.png
show2.png		show2.png
数据仓库课程练习：.docx		数据仓库课程练习：.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Warehouse Course Project

The final project for Data Warehouse in 2015

Projects Requirements

The project implementation Process

Step 1.

Step 2.

Step 3.

Development Tools

Conclusions

About

Releases

Packages

Languages

likicode/DataWareHouse

Folders and files

Latest commit

History

Repository files navigation

Data Warehouse Course Project

The final project for Data Warehouse in 2015

Projects Requirements

The project implementation Process

Step 1.

Step 2.

Step 3.

Development Tools

Conclusions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages