Releases: iipc/webarchive-commons
webarchive-commons-1.3.0
URL Canonicalization Changed
The output of WaybackURLKeyMaker and other canonicalizers based on BasicURLCanonicalizer has changed for URLs that
contain non UTF-8 percent encoded sequences. For example when a URL contains "%C3%23" it will now be normalised to
"%c3%23" whereas previous releases produced "%25c3%23". This change brings webarchive-commons more inline with pywb,
surt (Python), warcio.js and RFC 3986. While CDX file compatibility with these newer tools should improve, note that CDX
files generated by the new release which contain such URLs may not work correctly with existing versions of
OpenWayback that use the older webarchive-commons. #102
Bug fixes
- WAT: Duplicated payload metadata values for "Actual-Content-Length" and "Trailing-Slop-Length" #103
- ObjectPlusFilesOutputStream.hardlinkOrCopy now uses
Files.createLink()
instead of executingln
. This
prevents the potential for security vulnerabilities from command line option injection and improves portability.
Dependency upgrades
- fastutil removed
- dsiutils removed
Deprecations
The following classes and enum members have been marked deprecated as a step towards removal of the dependency on
Apache Commons HttpClient 3.1.
- org.archive.httpclient.HttpRecorderGetMethod
- org.archive.httpclient.HttpRecorderMethod
- org.archive.httpclient.HttpRecorderPostMethod
- org.archive.httpclient.SingleHttpConnectionManager
- org.archive.httpclient.ThreadLocalHttpConnectionManager
- org.archive.util.binsearch.impl.http.ApacheHttp31SLR
- org.archive.util.binsearch.impl.http.ApacheHttp31SLRFactory
- org.archive.util.binsearch.impl.http.HTTPSeekableLineReaderFactory.HttpLibs.APACHE_31
webarchive-commons-1.2.0
New features
- MetaData is now multivalued to support repeated WARC and HTTP headers. #98
Dependency upgrades
- commons-io 2.18.0
- commons-lang 2.6
- guava 33.3.1-jre
- hadoop 3.4.1
- htmlparser 2.1
- httpcore 4.4.16
- json 20240303
- junit 4.13.2
webarchive-commons-1.1.11
Bug fixes
- Fixed URLParser and WaybackURLKeyMaker failing on URLs with IPv6 address hostnames #100
webarchive-commons-1.1.10
Fixes
- WAT extractor: do not fail on missing WARC-Filename in warcinfo record
- ExtractingParseObserver: extract rel, hreflang and type attributes
- ExtractingParseObserver: extract links from onClick attributes
Dependency Upgrades
- commons-collections 3.2.2
- commons-io 2.14.0
- dsiutils 2.2.8
- guava 33.3.0-jre
- hadoop 3.4.0 (now optional)
- pig 0.17.0
- org.json 20231013
Dependency Removals
- joda-time (was unused)
webarchive-commons-1.1.9
webarchive-commons-1.1.9 (2019-05-07)
Closed issues:
- CompressedWARCReader does not work for Common Crawl WARC files. #81
- Fixing bad dates in WARC file #80
- upgrade to commons-collections.jar 3.2.2 #76
Merged pull requests: