Skip to content

Rotakka is a distributed Akka cluster application designed for scalable Twitter crawling. It avoids IP-based blocking by exploiting public web proxies.

Notifications You must be signed in to change notification settings

Miroka96/Rotakka

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rotakka

Rotakka is a distributed cluster application designed for scalable Twitter crawling. Its main advantage is that it avoids IP-based blocking by exploiting publicly available web proxies. In contrast to API-based approaches, Rotakka uses browser emulation enabled by Selenium to visit and download Twitter user profiles. It is built on the Akka framework and consists of

  • a proxy-collecting module,
  • a proxy-checking module,
  • a Twitter-crawling module,
  • and a graph-storing module.

Requirements

  • Java 8
  • Maven
  • a working Selenium driver, in our case:
    • an installed Google Chrome or Chromium browser
    • a downloaded chromedriver binary of the same version as the Chrome browser
  • following environment variables must be set:
    • CHROME_DRIVER_PATH
      • our value: /usr/bin/chromedriver
    • CHROME_BINARY_PATH
      • our value: /usr/bin/google-chrome-stable
    • CHROME_HEADLESS_MODE
      • on servers: true
      • for visual development: false

For further instructions, have a look at the scripts in the "deployment" directory.

Usage

Building a Fat-JAR

mvn package

The Jar will be created in the "target" directory.

Running the Fat-JAR

java [-Drotakka.config.parameter="whatever"] -jar rotakka-1.0.jar

There can be multiple config parameters added, each prepended with "-D".

At the end of the command above, either "master" or "slave" must follow, otherwise the help is printed.

Developing with IntelliJ

Just import the project as Maven project.

Master configuration

  • ProgramArguments="master"
  • EnvironmentVariables=... (set them as mentioned above)

Slave configuration

  • ProgramArguments="slave -mh 127.0.0.1"
  • EnvironmentVariables=... (set them as mentioned above)

Useful Config Parameters

All other parameters regaring the system can be found in the rotakka.conf file within the resource folder. A description of each parameter can be found in the config itself.

Project Structure

As mentioned above, Rotakka is split into several parts. In this section, we will examine each package and explain the most important facts.

Top Level Files

On the top level there are several files associated with starting Rotakka. Most importantly, we see the MainApp class which is responsible for starting the system.

Cluster

Within this package we have the ClusterListener and the MetricsListener. Both actors are mostly used for logging and being able to extract the results such as Total Tweets from the logs. It is important to note that these are not cluster singletons, but exist on each node. This means that the outputs will have to be manually aggregated across the different nodes to get a complete picture.

Graph

This package implements the Graph Building and Storing. It will not be further explained here because it has a separate and very detailed README.

Proxy

This part of the codebase is responsible for the crawling of public proxies as well as for checking whether these proxies fulfil the quality requirements which we impose on them. This package includes both the checking-package and the crawling-package and some data classes. While both packages contain the actors already known from the paper, the crawling package also includes the code specific to the public proxy websites.

Twitter

This package is responsible for crawling Twitter. It includes both the scheduler and the worker.

Utils

This package includes several utility classes which are used throughout the project. Most importantly, it also includes the code to start the Selenium WebDriver.

Disclaimer

Rotakka is a powerful system and can be used to scrape huge amounts of data within a short time frame. We encourage any potential user to comply with the limitations set by the service which they intent to crawl. Most of these limitations can be found in the Terms of Service. We are not responsible for any damage created by the misuse of our system.

About

Rotakka is a distributed Akka cluster application designed for scalable Twitter crawling. It avoids IP-based blocking by exploiting public web proxies.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published