Skip to content

Latest commit

 

History

History
38 lines (29 loc) · 1.91 KB

README.md

File metadata and controls

38 lines (29 loc) · 1.91 KB

Implementation of HITS algorithm in Hive, Spark and SparkSQL

Description

This project consists of implemting the HITS algorithm using data from Wikipedia, in three different distributed computing environments: Hive, Spark and SparkSQL. The goal is to analyze the links of the articles to determine the most important Wikipedia pages.

Data

The dataset consists of two files:

  • Titles: A list of titles of Wikipedia articles (one title per row)
  • Links: A list of links in the format "from1: to11 to12 ..."

Methods

Two scores were computed for each page, the authority score and the hub score (initialized to 1). A good hub represents a page that pointed to many other pages, and a good authority represents a page that is linked by many different hubs.

After initialization, the process consists of:

  • Updating the authority scores:

  • Updating the hub scores:

  • Normalizing the authority scores

    where

  • Normalizing the hub scores

    where

  • Iterate

References

  • Maria Daltayanni - Distributed Computing - Course notes