Skip to content

Commit

Permalink
Merge pull request #45 from peterbencze/development
Browse files Browse the repository at this point in the history
Serritor 2.0.0
  • Loading branch information
peterbencze authored May 30, 2019
2 parents 10ed6f8 + a23b29e commit eaac224
Show file tree
Hide file tree
Showing 88 changed files with 8,052 additions and 1,656 deletions.
45 changes: 26 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
Serritor
========

Serritor is an open source web crawler framework built upon [Selenium](http://www.seleniumhq.org/) and written in Java. It can be used to crawl dynamic web pages that use JavaScript.
Serritor is an open source web crawler framework built upon [Selenium](http://www.seleniumhq.org/)
and written in Java. It can be used to crawl dynamic web pages that require JavaScript to render
data.

## Using Serritor in your build
### Maven
Expand All @@ -11,45 +13,47 @@ Add the following dependency to your pom.xml:
<dependency>
<groupId>com.github.peterbencze</groupId>
<artifactId>serritor</artifactId>
<version>1.6.0</version>
<version>2.0.0</version>
</dependency>
```

### Gradle

Add the following dependency to your build.gradle:
```groovy
compile group: 'com.github.peterbencze', name: 'serritor', version: '1.6.0'
compile group: 'com.github.peterbencze', name: 'serritor', version: '2.0.0'
```

### Manual dependencies

The standalone JAR files are available on the [releases](https://github.com/peterbencze/serritor/releases) page.
The standalone JAR files are available on the
[releases](https://github.com/peterbencze/serritor/releases) page.

## Documentation
* The [Wiki](https://github.com/peterbencze/serritor/wiki) contains usage information and examples
* The Javadoc is available [here](https://peterbencze.github.io/serritor/)

## Quickstart
The `BaseCrawler` abstract class provides a skeletal implementation of a crawler to minimize the effort to create your own. The extending class should define the logic of the crawler.
The `Crawler` abstract class provides a skeletal implementation of a crawler to minimize the effort
to create your own. The extending class should implement the logic of the crawler.

Below you can find a simple example that is enough to get you started:
```java
public class MyCrawler extends BaseCrawler {
public class MyCrawler extends Crawler {

private final UrlFinder urlFinder;

public MyCrawler(final CrawlerConfiguration config) {
super(config);

// Extract URLs from links on the crawled page
// A helper class that is intended to make it easier to find URLs on web pages
urlFinder = UrlFinder.createDefault();
}

@Override
protected void onPageLoad(final PageLoadEvent event) {
// Crawl every URL that match the given pattern
urlFinder.findUrlsInPage(event)
protected void onResponseSuccess(final ResponseSuccessEvent event) {
// Crawl every URL found on the page
urlFinder.findUrlsInPage(event.getCompleteCrawlResponse())
.stream()
.map(CrawlRequest::createDefault)
.forEach(this::crawl);
Expand All @@ -58,38 +62,40 @@ public class MyCrawler extends BaseCrawler {
}
}
```
By default, the crawler uses [HtmlUnit headless browser](http://htmlunit.sourceforge.net/):
By default, the crawler uses the [HtmlUnit](http://htmlunit.sourceforge.net/) headless browser:
```java
// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
.setOffsiteRequestFiltering(true)
.setOffsiteRequestFilterEnabled(true)
.addAllowedCrawlDomain("example.com")
.addCrawlSeed(CrawlRequest.createDefault("http://example.com"))
.build();

// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);

// Start it
// Start crawling with HtmlUnit
crawler.start();
```
Of course, you can also use any other browsers by specifying a corresponding `WebDriver` instance:
Of course, you can also use other browsers. Currently Chrome and Firefox are supported.
```java
// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
.setOffsiteRequestFiltering(true)
.setOffsiteRequestFilterEnabled(true)
.addAllowedCrawlDomain("example.com")
.addCrawlSeed(CrawlRequest.createDefault("http://example.com"))
.build();

// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);

// Start it
crawler.start(new ChromeDriver());
// Start crawling with Chrome
crawler.start(Browser.CHROME);
```

That's it! In just a few lines you can create a crawler that crawls every link it finds, while filtering duplicate and offsite requests. You also get access to the `WebDriver` instance, so you can use all the features that are provided by Selenium.
That's it! In just a few lines you can create a crawler that crawls every link it finds, while
filtering duplicate and offsite requests. You also get access to the `WebDriver`, so you can use
all the features that are provided by Selenium.

## Support
If this framework helped you in any way, or you would like to support the development:
Expand All @@ -99,4 +105,5 @@ If this framework helped you in any way, or you would like to support the develo
Any amount you choose to give will be greatly appreciated.

## License
The source code of Serritor is made available under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
The source code of Serritor is made available under the
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
80 changes: 66 additions & 14 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<modelVersion>4.0.0</modelVersion>
<groupId>com.github.peterbencze</groupId>
<artifactId>serritor</artifactId>
<version>1.6.0</version>
<version>2.0.0</version>
<packaging>jar</packaging>

<name>Serritor</name>
Expand Down Expand Up @@ -54,17 +54,63 @@
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.14.0</version>
<version>3.141.59</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>htmlunit-driver</artifactId>
<version>2.33.0</version>
<version>2.35.1</version>
</dependency>
<dependency>
<groupId>net.lightbody.bmp</groupId>
<artifactId>browsermob-core</artifactId>
<version>2.1.5</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>27.0-jre</version>
<version>27.1-jre</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-server</artifactId>
<version>9.4.18.v20190429</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlet</artifactId>
<version>9.4.18.v20190429</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlets</artifactId>
<version>9.4.18.v20190429</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty.websocket</groupId>
<artifactId>websocket-server</artifactId>
<version>9.4.18.v20190429</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.datatype</groupId>
<artifactId>jackson-datatype-jdk8</artifactId>
<!-- browsermob-core depends on 2.8.9, do not upgrade version -->
<version>2.8.9</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.26</version>
</dependency>
<dependency>
<groupId>com.auth0</groupId>
<artifactId>java-jwt</artifactId>
<version>3.8.0</version>
</dependency>
<dependency>
<groupId>org.mindrot</groupId>
<artifactId>jbcrypt</artifactId>
<version>0.4</version>
</dependency>
<dependency>
<groupId>junit</groupId>
Expand All @@ -79,15 +125,21 @@
<scope>test</scope>
</dependency>
<dependency>
<groupId>net.lightbody.bmp</groupId>
<artifactId>browsermob-core</artifactId>
<version>2.1.5</version>
<groupId>com.github.tomakehurst</groupId>
<artifactId>wiremock-jre8-standalone</artifactId>
<version>2.23.2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.github.tomakehurst</groupId>
<artifactId>wiremock</artifactId>
<version>2.19.0</version>
<groupId>org.awaitility</groupId>
<artifactId>awaitility</artifactId>
<version>3.1.6</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>net.jodah</groupId>
<artifactId>failsafe</artifactId>
<version>2.0.1</version>
<scope>test</scope>
</dependency>
</dependencies>
Expand All @@ -97,7 +149,7 @@
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
<version>3.0.1</version>
<version>3.1.0</version>
<executions>
<execution>
<id>attach-source</id>
Expand All @@ -110,7 +162,7 @@
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-javadoc-plugin</artifactId>
<version>3.0.1</version>
<version>3.1.0</version>
<executions>
<execution>
<id>attach-javadoc</id>
Expand All @@ -134,7 +186,7 @@
<dependency>
<groupId>com.puppycrawl.tools</groupId>
<artifactId>checkstyle</artifactId>
<version>8.14</version>
<version>8.20</version>
</dependency>
</dependencies>
<configuration>
Expand All @@ -152,7 +204,7 @@
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-failsafe-plugin</artifactId>
<version>2.22.1</version>
<version>2.22.2</version>
<configuration>
<argLine>-Djdk.net.URLClassPath.disableClassPathURLCheck=true</argLine>
</configuration>
Expand Down
Loading

0 comments on commit eaac224

Please sign in to comment.