Stream docs from Elasticsearch to stdout for ad-hoc data mangling using the
Scroll
API.
Just like solrdump, but for
elasticsearch. Since esdump 0.1.11, the default operator can be set explicitly and changed from OR
to AND
.
Libraries can use both GET and POST requests to issue scroll requests.
- elasticsearch-py uses POST
- esapi uses GET
This tool uses HTTP GET only, and does not clear scrolls (which would probably use DELETE) so this tool works with read-only servers, that only allow GET.
$ go install github.com/miku/esdump/cmd/esdump@latest
Or via a release.
esdump uses the elasticsearch scroll API to stream documents to stdout.
Originally written to extract samples from https://search.fatcat.wiki (a
scholarly communications preservation and discovery project).
$ esdump -s https://search.fatcat.wiki -i fatcat_release -q 'web archiving'
Usage of ./esdump:
-i string
index name (default "fatcat_release")
-ids string
a path to a file with one id per line to fetch
-l int
limit number of documents fetched, zero means no limit
-mq string
path to file, one lucene query per line
-op string
default operator for query string queries (default "AND")
-q string
lucene syntax query to run, example: 'affiliation:"alberta"' (default "*")
-s string
elasticsearch server (default "https://search.fatcat.wiki")
-scroll string
context timeout (default "5m")
-size int
batch size (default 1000)
-v show version
-verbose
be verbose
925636 docs in 4m47.460217252s (3220 docs/s)
- move to
search_after