Skip to content

External Browser Extractor Processor for heritrix3. Execute an external browser via command line and parse JSON results

Notifications You must be signed in to change notification settings

adam-miller/ExternalBrowserExtractorHTML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ExternalBrowserExtractorHTML

External Browser Extractor Processor for heritrix3. Execute an external browser via command line and parse JSON results

Build: mvn -Dmaven.test.skip=false clean install Result: target/ExternalBrowserExtractorHTML-0.1.jar

Configuration crawler-beans.cxml

Define your extractor bean (You can put this near the settings for the other extractors: extractorHTML, extractorCSS, etc). The optional proxy mode configuration is shown below:

<bean id="browserExtractorHtml" class="org.archive.modules.extractor.ExternalBrowserExtractorHtml" >
  <property name="executionString" value="nice -n4 /phantomjs/phantomjs-1.6.1-linux-i686-dynamic/bin/phantomjs --proxy=127.0.0.1:3128 /phantomjs/phantomBrowserExtractor/phantomBrowserExtractor.js --userAgent _USERAGENT_ --url _URI_ 2>/dev/null"/>
  <property name="shouldBrowserReload" value="True" />
  <property name="commandTimeout" value="30000" />
  <!-- <property name="treatFramesAsEmbedLinks" value="true" /> -->
  <!-- <property name="treatXHRAsEmbedLinks" value="true" /> -->
</bean>

Add it to the processor chain: inside Add it before the extractorHTML processor:

<!-- ...extract outlinks from HTML content via external browser process... -->
<ref bean="browserExtractorHtml"/>

When using a caching proxy like Squid with the browser, you will want to direct H3 to request the discovered items through the proxy. Create a sheet assocation with the configuration shown below. The ViaContextMatchesRegexDecide rule is included in this package:

<bean class='org.archive.crawler.spring.DecideRuledSheetAssociation'>
  <property name='rules'>
    <bean class="org.archive.modules.deciderules.ViaContextMatchesRegexDecideRule">
      <property name="regex" value=".*BROWSER_MISC.*" />
    </bean>
  </property>
  <property name='targetSheetNames'>
   <list>
    <value>cacheProxy</value>
   </list>
  </property>
</bean>
<bean id='cacheProxy' class='org.archive.spring.Sheet'>
  <property name='map'>
    <map>
      <entry key='fetchHttp.httpProxyHost' value='127.0.0.1' />
      <entry key='fetchHttp.httpProxyPort' value='3128' />
    </map>
  </property>
</bean>

About

External Browser Extractor Processor for heritrix3. Execute an external browser via command line and parse JSON results

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages