Skip to content
amesbah edited this page Jan 14, 2013 · 17 revisions

Getting Crawljax

You can either use Crawljax via Maven or by downloading the zip.

Maven

If you are familiar Maven, you can declare crawljax as a Maven dependency in your pom.xml.

It is also possible to automatically generate a Crawljax skeleton project. The generated sample project includes the running example used in this tutorial.

For details see the this page.

Downloading Crawljax

You can also download crawljax and all the dependencies if you are not a Maven user. Extract the zip file and add all the jar libraries in the “/lib” directory to your classpath.

Command-line

For running Crawljax from the command-line see this page.

Crawljax API

The JavaDoc API of the latest release is available online.

Using Crawljax Programatically

The preferred way of configuring Crawljax is using a simple “runner class” written in Java (see the examples package in Crawljax code). Create a class with a main method that configures Crawljax through its configuration API. To change the default crawling behaviour, you can add a CrawlSpecification in which you can configure how Crawljax should crawl your site. In this example we will crawl www.google.com:

CrawlSpecification crawler = new CrawlSpecification("http://www.google.com");

Clicking Elements

The general method to define which elements Crawljax should click and which it should not click is:

  1. Select a (large) set of elements to click to get good coverage. For example: often all anchor tags should be clicked.
  2. Exclude elements from the set which should NOT be clicked.

We want Crawljax to do a full exploration of the website. Crawljax should therefore click on all the anchor elements (we can define any element type as a candiate clickable in Crawljax, for example, div, span, etc).

// click all anchor elements
crawler.click("a");

When you inspect the Google page you also see buttons (Google Search and I’m Feeling Lucky), which also should be clicked:

// click the submit buttons
crawler.click("input").withAttribute("type", "submit");

If we would run Crawljax with this specification it would click all the anchor elements and input buttons. There are often some links we don’t want to click, thus we should exclude these elements. For example we are not interested in Language Tools so we specify that Crawljax should not click this link:

// don't click elements with the text Language Tools
crawler.dontClick("a").withText("Language Tools");

Often you want Crawljax not to click links in a certain area. In this example we do not want to click on any links in the top bar (e.g. Sign out and iGoogle). With the underXpath() function you can select the elements under a certain element. In this example we do not wish to click the links which are under div’s with id='guser'

// don't click elements in the top bar
crawler.dontClick("a").underXPath("//DIV[@id='guser']");

Now we specified that Crawljax should click all the anchor tags and submit buttons, except the language tools and the anchor elements in the top bar. Note: External links to other websites are automatically ignored by Crawljax.

Specifying Input Data

By default Crawljax enters random values into input fields. Some fields may need specific values to enable the user to proceed (e.g a valid phone number or URL). To specify manual values you can use the InputSpecification class:

InputSpecification input = new InputSpecification();

To specify that Crawljax should enter manual values in form fields you need to know the id or name of the input elements. Whenever Crawljax encounters a form field it checks if there is manual input specified for this field by checking its id and name (in this order).

On www.google.com the input field for the search value has the name “q” for which we want a specified value:

// when Crawljax encouters a form element with the id or name "q" enter "Crawljax"
input.field("q").setValue("Crawljax");

When you specified all the form input values, add this to the CrawlSpecification

crawler.setInputSpecification(input);

Limiting the Crawling scope When you are testing your crawl specification or want to limit the crawling scope/time you can limit Crawljax:

// limit the crawling scope
crawler.setMaximumStates(5);
crawler.setDepth(2);

Running Crawljax

Crawljax needs a configuration in which different Crawljax settings can be set.

CrawljaxConfiguration config = new CrawljaxConfiguration();

Add the created CrawlSpecification

config.setCrawlSpecification(crawler);

We can specify the browser (ie, chrome, firefox, ...) that should be used for crawling:

config.setBrowser(BrowserType.firefox);

Firefox is the default browser in Crawljax (See also WebDriver browsers).

Now we are ready to run Crawljax:

CrawljaxController crawljax = new CrawljaxController(config);
crawljax.run();

Crawljax core crawls and creates an internal state machine of the states and transitions. Note that it does not produce any output. To generate output, you can use the SaveCrawlSession plugin (see the SaveCrawlSessionExample) or write your own plugin.

You can also checkout some our Advanced examples page.

Happy crawling!