Skip to content

Commit

Permalink
Initial tests with latest Heritrix
Browse files Browse the repository at this point in the history
  • Loading branch information
machawk1 committed Feb 10, 2019
1 parent 3de983c commit 5cf1067
Show file tree
Hide file tree
Showing 136 changed files with 172 additions and 160 deletions.
2 changes: 1 addition & 1 deletion .codeclimate.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@ exclude_paths:
- "build/*"
- "support/*"
- "WAIL.spec"
- "bundledApps/heritrix-3.2.0/*"
- "bundledApps/heritrix-3.4.0-20190207/*"
- "bundledApps/html/*"
- "bundledApps/tomcat/*"
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

Web Archiving Integration Layer (WAIL) is a graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.

Tools included and accessible through the GUI are <a href="https://github.com/internetarchive/heritrix3">Heritrix 3.2.0</a> and <a href="https://github.com/iipc/openwayback">OpenWayback 2.3.2</a>. Support packages include Apache Tomcat, <a href="https://github.com/pyinstaller/pyinstaller/">pyinstaller</a>, and <a href="https://github.com/oduwsdl/memgator">MemGator</a>.
Tools included and accessible through the GUI are <a href="https://github.com/internetarchive/heritrix3">Heritrix heritrix-3.4.0-20190207</a> and <a href="https://github.com/iipc/openwayback">OpenWayback 2.3.2</a>. Support packages include Apache Tomcat, <a href="https://github.com/pyinstaller/pyinstaller/">pyinstaller</a>, and <a href="https://github.com/oduwsdl/memgator">MemGator</a>.

WAIL is written in Python and compiled to a native executable using <a href="http://www.pyinstaller.org/">PyInstaller</a>.

Expand Down
288 changes: 151 additions & 137 deletions bundledApps/HeritrixJob.py

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions bundledApps/WAILConfig.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@
if 'darwin' in sys.platform: # OS X Specific Code here
# This should be dynamic but doesn't work with WAIL binary
wailPath = "/Applications/WAIL.app"
heritrixPath = wailPath + "/bundledApps/heritrix-3.2.0/"
heritrixPath = wailPath + "/bundledApps/heritrix-3.4.0-20190207/"
heritrixBinPath = "sh " + heritrixPath+"bin/heritrix"
heritrixJobPath = heritrixPath + "jobs/"
fontSize = 10
Expand Down Expand Up @@ -215,7 +215,7 @@

aboutWindow_iconPath = wailPath + aboutWindow_iconPath

heritrixPath = wailPath + "\\bundledApps\\heritrix-3.2.0\\"
heritrixPath = wailPath + "\\bundledApps\\heritrix-3.4.0-20190207\\"
heritrixBinPath = heritrixPath + "\\bin\\heritrix.cmd"
heritrixJobPath = heritrixPath + "\\jobs\\"
tomcatPath = wailPath + "\\bundledApps\\tomcat"
Expand Down
Binary file removed bundledApps/heritrix-3.2.0/adhoc.keystore
Binary file not shown.
1 change: 0 additions & 1 deletion bundledApps/heritrix-3.2.0/conf/jobs/.gitignore

This file was deleted.

1 change: 0 additions & 1 deletion bundledApps/heritrix-3.2.0/jobs/.gitignore

This file was deleted.

Binary file removed bundledApps/heritrix-3.2.0/lib/guava-r08.jar
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file removed bundledApps/heritrix-3.2.0/lib/jets3t-0.5.0.jar
Binary file not shown.
Binary file removed bundledApps/heritrix-3.2.0/lib/jetty-ajp-6.1.11.jar
Binary file not shown.
Binary file removed bundledApps/heritrix-3.2.0/lib/json-20090211.jar
Binary file not shown.
Binary file not shown.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ Readme for Heritrix
6. License


1. Introduction
----------------
## 1. Introduction

Heritrix is the Internet Archive's open-source, extensible, web-scale,
archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or
misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word
Expand All @@ -19,8 +19,8 @@ preserve the digital artifacts of our culture for the benefit of future
researchers and generations, this name seemed apt.


2. Crawl Operators!
--------------------
## 2. Crawl Operators!

Heritrix is designed to respect the robots.txt
<http://www.robotstxt.org/wc/robots.html> exclusion directives and META robots
tags <http://www.robotstxt.org/wc/exclusion.html#meta>. Please consider the
Expand All @@ -30,25 +30,25 @@ User-Agent so sites that may be adversely affected by your crawl can contact
you or adapt their server behavior accordingly.


3. Getting Started
-------------------
See the User Manual at <https://webarchive.jira.com/wiki/display/Heritrix/Heritrix+3.0+and+3.1+User+Guide>
## 3. Getting Started

See the User Manual, available from <https://github.com/internetarchive/heritrix3/wiki


## 4. Developer Documentation

4. Developer Documentation
---------------------------
See <http://crawler.archive.org/articles/developer_manual/index.html>.
For API documentation, see <https://webarchive.jira.com/wiki/display/Heritrix/Heritrix+3.x+API+Guide>
and <http://builds.archive.org/javadoc/heritrix-3.2.0/>
For REST API documentation, see <https://heritrix.readthedocs.io/en/latest/api.html>
and for JavaDoc see <http://builds.archive.org/javadoc/heritrix-3.2.0/> (n.b. Javadoc currently out of date).


5. Release History
-------------------
See the Heritrix Release Notes at
<https://webarchive.jira.com/wiki/display/Heritrix/Release+Notes+-+Heritrix+3.2.0>
## 5. Latest Releases

Information about releases can be found at <https://github.com/internetarchive/heritrix3/wiki#latest-releases>


## 6. License

6. License
-----------
Heritrix is free software; you can redistribute it and/or modify it
under the terms of the Apache License, Version 2.0:

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Binary file not shown.
Binary file not shown.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Binary file not shown.
File renamed without changes.
File renamed without changes.
Binary file not shown.
Binary file not shown.
File renamed without changes.
File renamed without changes.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
File renamed without changes.
Binary file not shown.
Binary file not shown.
Binary file not shown.
File renamed without changes.
Binary file not shown.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Binary file not shown.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Binary file not shown.
Binary file not shown.
File renamed without changes.
Binary file not shown.
File renamed without changes.
File renamed without changes.
Binary file not shown.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Binary file not shown.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Binary file not shown.

0 comments on commit 5cf1067

Please sign in to comment.