- Data Resource
- Data needs to be stored
- movie ID
- comment user's ID
- comment user's Profilename
- comment user's Helpfulness
- comment score for each user
- comment time
- comment summary
- comment text
- movie actors
- movie show time
- movie genre
- movie director
- movie starring actors
- movie version
- Most frequent Research
- Check for time
- the number of ovies in XX year, xx month, xx season
- how much new movies have been shown on Tuesday
- check by movie name
- how much versions a movie may possess
- check by directors/ actors/ genres
- combines research
- Check for time
- Process Data on Amazon
- write scrawl script with python and simple bash
- get 230 thousand items on Amazon with three servers running multiple threads at one night.
- clean data
- done with help of http://www.crummy.com/software/BeautifulSoup/
- get the data need to be stored from raw html
- write scrawl script with python and simple bash
-
design storing plan
- the logical storing plan: ERD
- the physical storing plan : database table design
-
Build Hive clusters
-
I bought three servers temporarily from https://www.digitalocean.com/ playing the role of namenode, edgenode and datanode. Their configuration informations are as follows:
-
We need to compare the time consumed by Mysql and HDFS for one complex search.
-
-
Research by multiple conditions and add condition automatically
-
For example, search for the thrillers in season 1,2015
-
The result showing in table
-
You can click on any item to have a further search
-
displayed time consumed comparation between mysql and HDFS in histogram and pie charts.
- platform : Mac OSX
- ETL Tool : http://www.pentaho.com/
- Hive : three servers from https://www.digitalocean.com/
Thanks for https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-on-ubuntu-13-10 and stackoverflow tips! - Mysql : configure one on one of the server
- skills about scrawling loads of data online
- ETL skills
- build Hive clusters
- display research result, maybe with https://www.joomla.org/