Merge pull request #45 from peterbencze/development

Serritor 2.0.0
peterbencze · May 30, 2019 · eaac224 · eaac224
2 parents 10ed6f8 + a23b29e
commit eaac224
Show file tree

Hide file tree

Showing 88 changed files with 8,052 additions and 1,656 deletions.
diff --git a/README.md b/README.md
@@ -1,7 +1,9 @@
 Serritor
 ========
 
-Serritor is an open source web crawler framework built upon [Selenium](http://www.seleniumhq.org/) and written in Java. It can be used to crawl dynamic web pages that use JavaScript.
+Serritor is an open source web crawler framework built upon [Selenium](http://www.seleniumhq.org/) 
+and written in Java. It can be used to crawl dynamic web pages that require JavaScript to render 
+data.
 
 ## Using Serritor in your build
 ### Maven
@@ -11,45 +13,47 @@ Add the following dependency to your pom.xml:
 <dependency>
     <groupId>com.github.peterbencze</groupId>
     <artifactId>serritor</artifactId>
-    <version>1.6.0</version>
+    <version>2.0.0</version>
 </dependency>
 ```
 
 ### Gradle
 
 Add the following dependency to your build.gradle:
 ```groovy
-compile group: 'com.github.peterbencze', name: 'serritor', version: '1.6.0'
+compile group: 'com.github.peterbencze', name: 'serritor', version: '2.0.0'
 ```
 
 ### Manual dependencies
 
-The standalone JAR files are available on the [releases](https://github.com/peterbencze/serritor/releases) page.
+The standalone JAR files are available on the 
+[releases](https://github.com/peterbencze/serritor/releases) page.
 
 ## Documentation
 * The [Wiki](https://github.com/peterbencze/serritor/wiki) contains usage information and examples
 * The Javadoc is available [here](https://peterbencze.github.io/serritor/)
 
 ## Quickstart
-The `BaseCrawler` abstract class provides a skeletal implementation of a crawler to minimize the effort to create your own. The extending class should define the logic of the crawler.
+The `Crawler` abstract class provides a skeletal implementation of a crawler to minimize the effort 
+to create your own. The extending class should implement the logic of the crawler.
 
 Below you can find a simple example that is enough to get you started:
 ```java
-public class MyCrawler extends BaseCrawler {
+public class MyCrawler extends Crawler {
 
     private final UrlFinder urlFinder;
 
     public MyCrawler(final CrawlerConfiguration config) {
         super(config);
 
-        // Extract URLs from links on the crawled page
+        // A helper class that is intended to make it easier to find URLs on web pages
         urlFinder = UrlFinder.createDefault();
     }
 
     @Override
-    protected void onPageLoad(final PageLoadEvent event) {
-        // Crawl every URL that match the given pattern
-        urlFinder.findUrlsInPage(event)
+    protected void onResponseSuccess(final ResponseSuccessEvent event) {
+        // Crawl every URL found on the page
+        urlFinder.findUrlsInPage(event.getCompleteCrawlResponse())
                 .stream()
                 .map(CrawlRequest::createDefault)
                 .forEach(this::crawl);
@@ -58,38 +62,40 @@ public class MyCrawler extends BaseCrawler {
     }
 }
 ```
-By default, the crawler uses [HtmlUnit headless browser](http://htmlunit.sourceforge.net/):
+By default, the crawler uses the [HtmlUnit](http://htmlunit.sourceforge.net/) headless browser:
 ```java
 // Create the configuration
 CrawlerConfiguration config = new CrawlerConfigurationBuilder()
-        .setOffsiteRequestFiltering(true)
+        .setOffsiteRequestFilterEnabled(true)
         .addAllowedCrawlDomain("example.com")
         .addCrawlSeed(CrawlRequest.createDefault("http://example.com"))
         .build();
 
 // Create the crawler using the configuration above
 MyCrawler crawler = new MyCrawler(config);
 
-// Start it
+// Start crawling with HtmlUnit
 crawler.start();
 ```
-Of course, you can also use any other browsers by specifying a corresponding `WebDriver` instance:
+Of course, you can also use other browsers. Currently Chrome and Firefox are supported.
 ```java
 // Create the configuration
 CrawlerConfiguration config = new CrawlerConfigurationBuilder()
-        .setOffsiteRequestFiltering(true)
+        .setOffsiteRequestFilterEnabled(true)
         .addAllowedCrawlDomain("example.com")
         .addCrawlSeed(CrawlRequest.createDefault("http://example.com"))
         .build();
 
 // Create the crawler using the configuration above
 MyCrawler crawler = new MyCrawler(config);
 
-// Start it
-crawler.start(new ChromeDriver());
+// Start crawling with Chrome
+crawler.start(Browser.CHROME);
 ```
 
-That's it! In just a few lines you can create a crawler that crawls every link it finds, while filtering duplicate and offsite requests. You also get access to the `WebDriver` instance, so you can use all the features that are provided by Selenium.
+That's it! In just a few lines you can create a crawler that crawls every link it finds, while 
+filtering duplicate and offsite requests. You also get access to the `WebDriver`, so you can use 
+all the features that are provided by Selenium.
 
 ## Support
 If this framework helped you in any way, or you would like to support the development:
@@ -99,4 +105,5 @@ If this framework helped you in any way, or you would like to support the develo
 Any amount you choose to give will be greatly appreciated.
 
 ## License
-The source code of Serritor is made available under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
+The source code of Serritor is made available under the 
+[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
diff --git a/pom.xml b/pom.xml
@@ -3,7 +3,7 @@
     <modelVersion>4.0.0</modelVersion>
     <groupId>com.github.peterbencze</groupId>
     <artifactId>serritor</artifactId>
-    <version>1.6.0</version>
+    <version>2.0.0</version>
     <packaging>jar</packaging>
 
     <name>Serritor</name>
@@ -54,17 +54,63 @@
         <dependency>
             <groupId>org.seleniumhq.selenium</groupId>
             <artifactId>selenium-java</artifactId>
-            <version>3.14.0</version>
+            <version>3.141.59</version>
         </dependency>
         <dependency>
             <groupId>org.seleniumhq.selenium</groupId>
             <artifactId>htmlunit-driver</artifactId>
-            <version>2.33.0</version>
+            <version>2.35.1</version>
+        </dependency>
+        <dependency>
+            <groupId>net.lightbody.bmp</groupId>
+            <artifactId>browsermob-core</artifactId>
+            <version>2.1.5</version>
         </dependency>
         <dependency>
             <groupId>com.google.guava</groupId>
             <artifactId>guava</artifactId>
-            <version>27.0-jre</version>
+            <version>27.1-jre</version>
+        </dependency>
+        <dependency>
+            <groupId>org.eclipse.jetty</groupId>
+            <artifactId>jetty-server</artifactId>
+            <version>9.4.18.v20190429</version>
+        </dependency>
+        <dependency>
+            <groupId>org.eclipse.jetty</groupId>
+            <artifactId>jetty-servlet</artifactId>
+            <version>9.4.18.v20190429</version>
+        </dependency>
+        <dependency>
+            <groupId>org.eclipse.jetty</groupId>
+            <artifactId>jetty-servlets</artifactId>
+            <version>9.4.18.v20190429</version>
+        </dependency>
+        <dependency>
+            <groupId>org.eclipse.jetty.websocket</groupId>
+            <artifactId>websocket-server</artifactId>
+            <version>9.4.18.v20190429</version>
+        </dependency>
+        <dependency>
+            <groupId>com.fasterxml.jackson.datatype</groupId>
+            <artifactId>jackson-datatype-jdk8</artifactId>
+            <!-- browsermob-core depends on 2.8.9, do not upgrade version -->
+            <version>2.8.9</version>
+        </dependency>
+        <dependency>
+            <groupId>org.slf4j</groupId>
+            <artifactId>slf4j-api</artifactId>
+            <version>1.7.26</version>
+        </dependency>
+        <dependency>
+            <groupId>com.auth0</groupId>
+            <artifactId>java-jwt</artifactId>
+            <version>3.8.0</version>
+        </dependency>
+        <dependency>
+            <groupId>org.mindrot</groupId>
+            <artifactId>jbcrypt</artifactId>
+            <version>0.4</version>
         </dependency>
         <dependency>
             <groupId>junit</groupId>
@@ -79,15 +125,21 @@
             <scope>test</scope>
         </dependency>
         <dependency>
-            <groupId>net.lightbody.bmp</groupId>
-            <artifactId>browsermob-core</artifactId>
-            <version>2.1.5</version>
+            <groupId>com.github.tomakehurst</groupId>
+            <artifactId>wiremock-jre8-standalone</artifactId>
+            <version>2.23.2</version>
             <scope>test</scope>
         </dependency>
         <dependency>
-            <groupId>com.github.tomakehurst</groupId>
-            <artifactId>wiremock</artifactId>
-            <version>2.19.0</version>
+            <groupId>org.awaitility</groupId>
+            <artifactId>awaitility</artifactId>
+            <version>3.1.6</version>
+            <scope>test</scope>
+        </dependency>
+        <dependency>
+            <groupId>net.jodah</groupId>
+            <artifactId>failsafe</artifactId>
+            <version>2.0.1</version>
             <scope>test</scope>
         </dependency>
     </dependencies>
@@ -97,7 +149,7 @@
             <plugin>
                 <groupId>org.apache.maven.plugins</groupId>
                 <artifactId>maven-source-plugin</artifactId>
-                <version>3.0.1</version>
+                <version>3.1.0</version>
                 <executions>
                     <execution>
                         <id>attach-source</id>
@@ -110,7 +162,7 @@
             <plugin>
                 <groupId>org.apache.maven.plugins</groupId>
                 <artifactId>maven-javadoc-plugin</artifactId>
-                <version>3.0.1</version>
+                <version>3.1.0</version>
                 <executions>
                     <execution>
                         <id>attach-javadoc</id>
@@ -134,7 +186,7 @@
                     <dependency>
                         <groupId>com.puppycrawl.tools</groupId>
                         <artifactId>checkstyle</artifactId>
-                        <version>8.14</version>
+                        <version>8.20</version>
                     </dependency>
                 </dependencies>
                 <configuration>
@@ -152,7 +204,7 @@
             <plugin>
                 <groupId>org.apache.maven.plugins</groupId>
                 <artifactId>maven-failsafe-plugin</artifactId>
-                <version>2.22.1</version>
+                <version>2.22.2</version>
                 <configuration>
                     <argLine>-Djdk.net.URLClassPath.disableClassPathURLCheck=true</argLine>
                 </configuration>