Rotakka is a distributed cluster application designed for scalable Twitter crawling. Its main advantage is that it avoids IP-based blocking by exploiting publicly available web proxies. In contrast to API-based approaches, Rotakka uses browser emulation enabled by Selenium to visit and download Twitter user profiles. It is built on the Akka framework and consists of
- a proxy-collecting module,
- a proxy-checking module,
- a Twitter-crawling module,
- and a graph-storing module.
- Java 8
- Maven
- a working Selenium driver, in our case:
- an installed Google Chrome or Chromium browser
- a downloaded chromedriver binary of the same version as the Chrome browser
- following environment variables must be set:
- CHROME_DRIVER_PATH
- our value: /usr/bin/chromedriver
- CHROME_BINARY_PATH
- our value: /usr/bin/google-chrome-stable
- CHROME_HEADLESS_MODE
- on servers: true
- for visual development: false
- CHROME_DRIVER_PATH
For further instructions, have a look at the scripts in the "deployment" directory.
mvn package
The Jar will be created in the "target" directory.
java [-Drotakka.config.parameter="whatever"] -jar rotakka-1.0.jar
There can be multiple config parameters added, each prepended with "-D".
At the end of the command above, either "master" or "slave" must follow, otherwise the help is printed.
Just import the project as Maven project.
- ProgramArguments="master"
- EnvironmentVariables=... (set them as mentioned above)
- ProgramArguments="slave -mh 127.0.0.1"
- EnvironmentVariables=... (set them as mentioned above)
All other parameters regaring the system can be found in the rotakka.conf
file within the resource
folder. A description of each parameter can be found in the config itself.
As mentioned above, Rotakka is split into several parts. In this section, we will examine each package and explain the most important facts.
On the top level there are several files associated with starting Rotakka. Most importantly, we see the
MainApp
class which is responsible for starting the system.
Within this package we have the ClusterListener
and the MetricsListener
. Both actors are mostly used
for logging and being able to extract the results such as Total Tweets from the logs. It is important to note that these
are not cluster singletons, but exist on each node. This means that the outputs will have to be manually aggregated
across the different nodes to get a complete picture.
This package implements the Graph Building and Storing. It will not be further explained here because it has a separate and very detailed README.
This part of the codebase is responsible for the crawling of public proxies as well as for checking whether
these proxies fulfil the quality requirements which we impose on them. This package includes both
the checking
-package and the crawling
-package and some data classes. While
both packages contain the actors already known from the paper, the crawling
package also includes
the code specific to the public proxy websites.
This package is responsible for crawling Twitter. It includes both the scheduler and the worker.
This package includes several utility classes which are used throughout the project. Most importantly, it also includes the code to start the Selenium WebDriver.
Rotakka is a powerful system and can be used to scrape huge amounts of data within a short time frame. We encourage any potential user to comply with the limitations set by the service which they intent to crawl. Most of these limitations can be found in the Terms of Service. We are not responsible for any damage created by the misuse of our system.