Releases: openzim/warc2zim
Releases · openzim/warc2zim
2.1.3
2.1.2
2.1.1
2.1.0
Added
- New fuzzy-rule for cheatography.com (#342), der-postillon.com (#330), iranwire.com (#363)
- Properly rewrite redirect target url when present in HTML tag (#237)
- New
--encoding-aliases
argument to pass encoding/charset aliases (#331) - Add support for SVG favicon (#148)
- Automatically index PDF content and use PDF title (#289 and #290)
Changed
- Upgrade to python-scraperlib 4.0.0
- Generate fuzzy rules tests in Python and Javascript (#284)
- Refactor HTML rewriter class to make it more open to change and expressive (#305)
- Detect charset in document header only for HTML documents (#331)
- Use
software
property fromwarcinfo
record to set ZIMScraper
metadata (#357) - Store
ContentDate
as metadata, based onWARC-Date
(#358) - Remove domain specific rules (#328)
- Revisit retrieve_illustration logic to prefer best favicons (#352 and #369)
- Upgrade dependencies (zimscraperlib 4.0.0, wombat.js 3.7.12 and others) (#376)
Fixed
- Handle case where the redirect target is bad / unsupported (#332 and #356)
- Fixed WARC files handling order to follow creation order (#366)
- Remove subsequent slashes in URLs, both in Python and JS (#365)
- Ignore non HTTP(S) WARC records (#351)
- Fix
vimeo_cdn_fix
fuzzy rule for proper operation in Javascript (#348) - Performance issue linked to new "extensible" HTML rewriting rules (#370)
2.0.3
Changed
- Moved rules definition from JSON to YAML and documented update process (#216)
- Upgrade to wombat.js 3.7.11
Added
- Exit with cleaner message when no entries are expected in the ZIM (#336) and when main entry is not processable (#337)
- Add debug log for items whose content is empty (#344)
Fixed
- Some resources rewrite mode are still not correctly identified (#326)
2.0.2
Added
- Add
--ignore-content-header-charsets
option to disable automatic retrieval of content charsets from content first bytes (#318) - Add
--content-header-bytes-length
option to specify how many first bytes to consider when searching for content charsets in header (#320) - Add
--ignore-http-header-charsets
option to disable automatic retrieval of content charsets from content HTTPContent-Type
headers (#318)
Changed
- Simplify logic deciding content charset, stop guessing with chardet (#312)
Fixed
- Rewrite only content with mimetype
text-html
whenWARC-Resource-Type
ishtml
(#313)
2.0.1
2.0.0
Added
- Allow to specify a scraper suffix for the ZIM scraper metadata at the CLI (#168)
- New test website to test many known situations supposed to be handled (#166)
Changed
- Replace Service Worker approach by scraper-side rewriting of static content (kiwix/overview#95)
- Adopted Python bootstrap conventions (#152)
- Upgrade dependencies, especially move to Python 3.12 (only) and zimscraperlib 3.3.2
- Change wording in logs about the return code 100 (which is not an error code)
- Added checks in
converter.py
to verify output directory existence, logging appropriate error messages and cleanly exit if checks fail. (#106) - Added check for invalid zim file names (#232)
- Changed default publisher metadata from 'Kiwix' to 'openZIM' (#150)