You can download executable static binary from https://github.com/kitsuyui/scraper/releases/latest .
- typically Linux ... download scraper_linux_amd64.
- typically macOS ... download scraper_darwin_amd64.
- typically Windows ... download scraper_windows_amd64.exe
and then just you have to do is adding PATH and permitting it to executable.
$ target=scraper_darwin_amd64 # executable binary you downloaded.
$ cp "./$target" /usr/bin/scraper
$ chmod +x /usr/bin/scraper
$ go install github.com/kitsuyui/scraper@latest
$ wget http://example.com/ -O index.html
$ cat <<'EOF'> scraper-config.json
[
{"type": "css", "label": "Title", "query": "h1"},
{"type": "xpath", "label": "LinkURL", "query": "//a/@href"}
]
EOF
$ scraper < index.html
[
{
"type": "css",
"label": "Title",
"query": "h1",
"results": [
"Example Domain"
]
},
{
"type": "xpath",
"label": "LinkURL",
"query": "//a/@href",
"results": [
"http://www.iana.org/domains/example"
]
}
]
Table element scraping is also available. Obviously it supports colspan/rowspan attributes.
$ cat <<'EOF'> table.html
<table border=1>
<tr><td>a</td><td>b</td><td>c</td><td>d</td></tr>
<tr><td>e</td><td rowspan="2" colspan="2">f</td><td>g</td></tr>
<tr><td>h</td><td>i</td></tr>
</table>
EOF
$ cat <<'EOF'> table-scraper-config.json
[
{"type": "table-xpath", "label": "Tables", "query": "//table"}
]
EOF
$ scraper -c table-scraper-config.json < table.html
[
{
"type": "table-xpath",
"label": "Tables",
"query": "//table",
"results": [
[
["a", "b", "c", "d"],
["e", "f", "f", "g"],
["h", "f", "f", "i"]
]
]
}
]
$ wget http://example.com/ -O index.html
$ cat <<'EOF'> scraper-config.json
[
{"type": "regex", "label": "Description", "query": "This domain is .+? to be used for .+?\\."}
]
EOF
$ scraper < index.html
[
{
"type": "regex",
"label": "Description",
"query": "This domain is .+? to be used for .+?\\.",
"results": [
"This domain is established to be used for illustrative examples in documents."
]
}
]
Obviously these recipes are mixable. You can scrape all at once with one-config by writing these target.
Configurations can be checked by using subcommand validate
.
This sub-command also checks format of each types.
For example,
regex
type will be checked the query is valid regular expression.xpath
type will be checked the query is valid XPath.
$ scraper validate
Error: Recipe[0]: expected identifier, found < instead
exit status 1
$ scraper server
$ curl -X PUT localhost:8080/example.com -d@- <<'EOT'
[{"type": "xpath", "label": "LinkURL", "query": "//a/@href"}]
EOT
$ curl -X GET localhost:8080/example.com
[{"type": "xpath", "label": "LinkURL", "query": "//a/@href"}]
$ curl example.com | curl -X POST localhost:8080/example.com -d @-
[
{
"type": "xpath",
"label": "LinkURL",
"query": "//a/@href",
"results": [
"http://www.iana.org/domains/example"
]
}
]
Docker image for server mode: kitsuyui/scraper
$ docker run -p 8080:8080 -it kitsuyui/scraper
$ scraper -h
scraper: Swiss army knife for web scraping
Usage:
scraper [-c=<conf>] [-i=<input>]
scraper validate [-c=<conf>]
scraper server [-d=<conf-dir>] [-H=<host>] [-p=<port>]
Options:
-c=<conf> --config=<conf> Configuration file [default: scraper-config.json].
-i=<input> --input=<input> Input file specified instead of STDIN.
-o=<output> --output=<output> Output file specified instead of STDOUT.
-H=<host> --host=<host> Server mode host [default: 127.0.0.1].
-p=<port> --port=<port> Server mode port [default: 8080].
-d=<conf-dir> --conf-dir=<conf-dir> Configuration directory for server mode [default: .].
$ go get -d ./src/...
$ go build -o build/scraper main.go
$ ./bin/build.sh
$ docker run --rm -v "$(pwd)":/root -w /root tcnksm/gox sh -c "./bin/build.sh"
The 3-Clause BSD License. See also LICENSE file.