===============================================================================================================================================
- Description
- Files Organisation
- Example of Results
===============================================================================================================================================
TVSS aims to recommend TV-Shows based on their similarity computed over a subtitle analysis of their content. It is a content-based recommender computing similarities between shows based on their distribution over the topics that are obtained by a Java MR implementation of the LDA algorithm. Here is the current version of the graph representing clusters of similar shows and some labelled topics we identified in them:
The goal of the project is to analyze the content of TV-Shows according to certain topics, via a subtitles analysis. In order to achieve that, we acquired a large data set of subtitles of good quality (~1100 shows) and then, using an hadoop implementation of the LDA algorithm, analyze the topics present in each show. For example, if we consider the show "Homeland", the resulting score regarding topics could be : 60% terrorism, 20% psychology, 10% espionage and 10% romance.
As a final result, we have differents things:
- For each TV-show, we have a detailed information page that contains the different topics the show is made off and their weight.
- A content-based recommender systems for TV-shows, where given one TV-show, the system can propose the most similar TV-shows to the latter.
===============================================================================================================================================
The project is divided in 4 parts :
- Crawling
- Pre-processing (Cleaning the data)
- Processing (LDA)
- Post-processing (Website & Recommender System)
In each sub folder you can find a README that explains how to run the part in question.
===============================================================================================================================================
Here, you can find an example of results :
-
Game of Thrones :
- Top Words/Topics
- Visuals
- Recommendations: (Title - Similarity)
- Crusoe - 90.4 %
- Krod Mandoon and the Flaming Sword of Fire - 90.36 %
- 1066 The Battle for Middle Earth - 89.49 %
- Divine? The Series - 89.01 %
- Kung Fu - 87.93 %
- Roar - 86.81 %
- Neverwhere - 85.84 %
- The Pillars Of The Earth - 84.46 %
- Rome - 83.69 %
- Poltergeist The Legacy - 83.54 %
- Atlantis - 83.27 %
- Reign - 81.94 %
- Ancient Rome - The Rise and Fall of an Empire - 81.49 %
- Kings - 80.69 %
- Thor & Loki Blood Brothers - 80.58 %
-
The Simpsons :
- Recommendations: (Title - Similarity)
- Men Behaving Badly - 99.39 %
- Beavis and Butt-Head - 99.22 %
- The Cleveland Show - 99.03 %
- Family Guy - 98.04 %
- The Penguins Of Madagascar - 97.59 %
- Robot Chicken - 97.45 %
- American Dad! - 97.32 %
- Futurama - 97.28 %
- My Name Is Earl - 95.2 %
- South Park - 95.2 %
- Raising Hope - 94.98 %
- Neighbors From Hell - 94.75 %
- The Ren & Stimpy Show - 93.94 %
- Clerks - 93.63 %
- Key And Peele - 93.46 %
- Recommendations: (Title - Similarity)
===============================================================================================================================================