tool-v0.19.0
[tool-v0.19.0] - 2024-12-07: Powerful filtering, exporting of different URL visits, hybrid export modes
Changed: Semantics
-
*
:- In
--expr
expressions,sha256
function changed semantics.
From now on it returns the raw hash digest instead of the hexadecimal one.
To get the old value, usesha256|to_hex
.
- In
Added
-
*
exceptorganize --move
,organize --hardlink
,organize --symlink
,get
, andrun
:-
From now on, all sub-commands except for above can take inputs in all supported file formats.
I.e., you can now do
hoardy-web export mirror --to ~/hoardy-web/mirror1 mitmproxy.*.dump
on
mitmproxy
dumps without evenimport
ing them first. -
By default, the above commands now also automatically dispatch between loaders of different file formats based on file extensions.
So you can mix and match different file formats on the same command line. -
Added a bunch of
--load-*
options that force a specific loader instead, e.g.--load-wrrb
,--load-mitmproxy
.
-
-
*
:-
Added a ton of new filtering options.
For example, you can now do:
hoardy-web find --method GET --method DOM --status-re .200C --response-mime text/html \ --response-body-grep-re "\bPotter\b" ~/hoardy-web/raw
As before, these filters can still be used with other commands, like
stream
, orexport mirror
, etc.--root-*
options ofexport mirror
now use the same syntax and machinery as the normal input filters.Also, the overall filtering semantics changed a bit.
The top-level logical expression the filters compute is now a large conjunction.
I.e. the above example now compiles to, a bit simplified,(response.method == "GET" or response.method == "DOM") and re.match(".200C", status) and (response_mime == "text/html") and re.match("\\bPotter\\b", response.body)
. -
Added a bunch of new
--output
formats.
Mostly, this adds a bunch of output formats that refer tostime
s.
Mainly, to simplifyexport mirror --all
usage, described below.
-
-
export mirror
:-
Implemented exporting of different URL visits.
I.e., you can now export not just
--latest
visit to each URL, but an--oldest
one, or one--nearest
to a given date, or--all
of them. -
Implemented
--latest-hybrid
,--oldest-hybrid
, and--nearest-hybrid
options.These allow you to export each page with resource requisites that are date-vise closest to the
stime
of the page itself, instead of taking globally--latest
,--oldest
, or--nearest
versions of all requisite URLs.At the moment, this takes a lot more memory, but makes the results much more consistent for websites that do not use versioned resource requisites.
-
Implemented
--hardlink
and--symlink
options, which allow exporting into content-addressed destinations.I.e.
export mirror --hardlink
will render and write each exported file to<--to>/_content/<hash/based/path>.<ext>
and only then hardlink the result to<--to>/<output/format/based/path>.<ext>
target destination.
And similarly for--symlink
.Typically, doing this saves quite a bit of space, e.g., when pages refer to the same resource requisites by slightly different URLs, same images and fonts get distributed via different CDN hosts, when you export
--all
visits to some URLs and many of those are absolutely identical, etc.So, from now on,
--hardlink
is the default.
The old behavior can be archived by running it with--copy
instead. -
Implemented
--relative
and--absolute
options, which control if URLs should be remapped to relative or absolutefile:
URLs, respectively.
-
-
Documented all the new things.
-
Added a bunch of new
test-cli.sh
tests.
Changed
-
export mirror
:-
Switched default
--output
tohupq_n
to prevent collisions when using--*-hybrid
and--all
. -
Improved handling of
base
HTML
tags,_target
s are supported now. -
Links that reference a page from itself will no longer refer to the page's filename, even when the link has no
fragment
.The results can be a bit confusing, but this makes the new content de-duplication options much more effective.
-
Made
export mirror
default filters explicit and changed them from--method "GET" --status-re ".200C"
to--method "GET" --method "DOM" --status-re ".200C"
. -
Implemented
--ignore-bad-inputs
and--index-all-inputs
options to allow you to change the above default. -
Improved output log format.
-
-
Improved file loading performance a bit.
-
Improved documentation.
Fixed
-
Added a bunch of new tests for
organize
, which cover theorganize --symlink --latest
bug oftool-v0.18.0
.
Won't happen again. -
Fixed a couple of silly filtering-related bugs.