Implementation of HITS algorithm in Hive, Spark and SparkSQL

Description

This project consists of implemting the HITS algorithm using data from Wikipedia, in three different distributed computing environments: Hive, Spark and SparkSQL. The goal is to analyze the links of the articles to determine the most important Wikipedia pages.

Data

The dataset consists of two files:

Titles: A list of titles of Wikipedia articles (one title per row)
Links: A list of links in the format "from1: to11 to12 ..."

Methods

Two scores were computed for each page, the authority score and the hub score (initialized to 1). A good hub represents a page that pointed to many other pages, and a good authority represents a page that is linked by many different hubs.

After initialization, the process consists of:

Updating the authority scores:
Updating the hub scores:
Normalizing the authority scores

where
Normalizing the hub scores

where
Iterate

References

Maria Daltayanni - Distributed Computing - Course notes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Implementation of HITS algorithm in Hive, Spark and SparkSQL

Description

Data

Methods

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Implementation of HITS algorithm in Hive, Spark and SparkSQL

Description

Data

Methods

References