The HITS algorithm is being used on the Twitter follower network to find important hubs and authorities, where good hubs are people who follow good authorities and good authorities are people who are followed by good hubs. In this real-life scenario, a good authority could be a popular music artist and a good hub could be a music lover who follows many accomplished artists.
The dataset can be viewed as a directed graph. Each node in the graph represents a Twitter user and an edge from user A to user B implies that A is a “follower” of B and B is a “friend” of A.
The graph consists of 500 nodes with edges between two nodes if one is a follower/friend of another. The graph is stored as an adjacency list the first time it is prepared but then converted to an adjacency matrix immediately (thus requiring to store a map from matrix index to user id) for repeated use with the HITS algorithm.
src (directory) – Contains python source files
/hits.py – Implements the HITS algorithm
/dataset_fetcher.py – Fetches the dataset using the Twitter API
data (directory) – Contains the structures saved after obtaining the dataset
/adj_list – Adjacency list representing the fetched dataset
/dense_link_matrix – Link matrix using non sparse representation
/sparse_link_matrix – Link matrix using sparse representation
/map – Map from user id to matrix index
/users – Users information
docs (directory) - Contains the documentation for the various components
/dataset_fetcher.html – doc for dataset_fetcher.py
/hits.html – doc for hits.py
/requirements.txt – Contains requirements to run the python code
- Download/Clone this repository
git clone https://github.com/nikhil-iyer-97/HITS-Algorithm-implementation.git
- Change working directory to the where the repository is located
cd HITS-Algorithm-implementation
- Install dependencies:
pip install -r requirements.txt
- Change working directory to
src
:
cd src
Now enter python3 hits.py
for the program to run and display outputs.
Some example outputs for hubbiness scores and authority scores for the first 30 nodes in the graph are shown below:
The change in hub score and authority score with respect to a few selected entities were measured, resulting in:
And finally, the algorithm was benchmarked on Sparse Matrix vs Normal Matrix implementations for various values of :