giashard
is a tool for batching webcrawled data for later processing. It is designed as part of a corpus creation pipeline in projects like Paracrawl and HPLT.
giashard
is written in Go. To install, you need to clone the repo and then build the application:
git clone https://github.com/paracrawl/giashard.git
cd giashard/cmd/giashard
go build
giashard
can accept three input formats:
- A directory (or list of directories) in bitextor/Paracrawl column storage format: each directory contains three files named
url.gz
,mime.gz
andplain_text.gz
(by default). A different number of files and different names for these files can be specified with the-f
flag - A zstd-compressed file (or list of files) in the JSONL format where each record contains at minimum one field named
u
containing a URL and one field namedtext
containing the extracted content in plain text. - An uncompressed stream to stdin in the above JSONL format (indicated by
-
as the input file: e.g.cat myfile.jsonl | giashard -o myoutput -
)
giashard
uses the following flags:
-o
: Output directory location (default: current directory)-l
: Input file containing a list of files/directories to shard (default: "")-f
: Comma-separated list of files to shard for bitextor/Paracrawl column storage format input (default:"url,mime,plaintext"
)-n
: Exponent to calculate number of shards (2^n) (default: 8)-b
: Batch size in MB (default: 100)-d
: Additional public suffix entries (default: "")-jsonl
: Boolean indicating data is in JSONL format (default: False)
ls -1d output_wide15_filtered/*/is | xargs giashard/cmd/giashard/giashard -n 8 -o output_wide15_sharded -f text,url -b 1024
This runs giashard
on all Icelandic data in the output_wide15_filtered
directory (in bitextor/Paracrawl column storage format) where each input directory contains two files: text.gz
and url.gz
. It sorts this data into 2^8 numbered shards where shard membership is assigned based on a hash of the URL. The data in each shard is split into numbered batches of approximately 1024MB. Output text is base64 encoded.
cat icelandic.jsonl | giashard -o output -jsonl -
This runs giashard on JSONL file icelandic.jsonl
which is in the format described above. It writes the resulting shards to the output
directory. Note the trailing -
to indicate reading from stdin. Other parameters are set to their default values.
There is a companion tool called giashardid
that you can give a URL to either on the command line or stdin, and it will print the shard id that that URL will get sorted to. If you give it the -s
flag, instead of printing the shard id, it will print the slug derived from the hostname in the URL.
So, for example, we can find out what shard, Google lives in,
$ giashardid google.com
48
And then, if we are curious, we can find out what other domains containing Dutch text live in that shard,
$ find wide00006-shards/nl/48 -name url.gz | xargs cat | gzip -dc | \
giashardid -s | sort | uniq -c | sort -nr | head -10
6483 google 855 paginamarkt 604 vikingdirect 592 ajax1 392 jijislief 277 ixina 209 punkyfish 182 bongo 154 ooyyo 150 ledlampendirect
This should be easily installable using
go get github.com/paracrawl/giashardid/cmd/...