-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
38 lines (30 loc) · 1.51 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
NAME
wk2txt - A Website to Text Converter
SYNOPSIS
wk2txt *options*
Options:
--input URL convert *URL* to text
--urlfile file convert URLs in *file* to text
--output directory save output in directory *directory*
--xml dump in XML format
--timeout inverval timeout after *interval* seconds (NOT IMPLEMENTED)
--debug print diagnostic information
--help print help on options
--version print version number
DESCRIPTION
wk2txt is a command line tool based on the WebKit Engine to get the text
content of a websites as seen by current webbrowsers which includes the
evaluation of javascript code present in many modern webpages.
wk2txt can process a whole list of URLs, which can be built by using the
"--input" option multiple times or by passing the filename of a list of
URLs to the "--urlfile". This file is expected to have 1 URL per line.
The contents of the webpages is written to one file per webpage. The
files are stored in the directory specified by the "--output" option.
The filename is derived from the URL, it is the SHA1 hash code for the
URL.
By default, wk2txt dumps the contents and the meta data of the webpage
in plain text format. If the "--xml" option is used, content and meta
data is dumped in XML format.
AUTHOR
Christian Plessl <christian.plessl@oetiker.ch>
Roman Plessl <roman.plessl@oetiker.ch>