This repo contains the code corresponding to the SIGIR 2021 short paper Faster Index Reordering with Bipartite Graph Partitioning by Joel Mackenzie, Matthias Petri, and Alistair Moffat.
If you use this code in your own work or research, please consider citing our work:
@inproceedings{mpm21-sigir,
title = {Faster Index Reordering with Bipartite Graph Partitioning},
author = {J. Mackenzie and M. Petri and A. Moffat},
booktitle = {Proc. SIGIR},
pages = {1910--1914},
year = {2021},
}
The paper can be found at the following DOI: https://doi.org/10.1145/3404835.3462991
This work was built on previous work from Dhulipala et. al: Compressing Graphs and Indexes with Recursive Graph Bisection, ACM Proceedings.
We also used the reproducibility study from Mackenzie et. al: Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study, Springer Proceedings.
Our codebase is based on the implementation found in the PISA search engine, which corresponds to the reproducibility study discussed above. The codebase works with the Common Index File Format, an open-source index exchange format for information retrieval experimentation.
You can build the code using Cargo:
cargo build --release
However, if you follow the command above, running the code will give an error:
./target/release/create-rgb
03:09:14 [INFO] Error: A gain function needs to be passed at compile time via the environment variable `GAIN` -- Please recompile...
The explanation is that, since we experimented with three different gain functions, the desired gain function must be passed in
at compile time via an environment variable. The valid options are default
, approx_1
, or approx_2
. So, recompile as such:
GAIN=approx_1 cargo build --release
You will need, at bare minimum, a CIFF index corresponding to whatever data you wish to reorder. Some pre-generated CIFF files can be found here.
For our following example, let's grab the Robust04 CIFF file.
mkdir example
cd example
wget https://www.dropbox.com/s/rph6udiqs2k7bfo/robust04-complete-20200306.ciff.gz?dl=0 -O robust-ciff.gz
gunzip robust-ciff.gz
mv robust-ciff robust.ciff
Now, let's run the BP algorithm, output a reordered CIFF file, and compute the loggap improvement.
../target/release/create-rgb --input robust.ciff --output-ciff robust-reordered.ciff --loggap
03:28:02 [INFO] Using the `approx_1` gain function.
03:28:02 [INFO] Opt { input: "robust.ciff", output_ciff: Some("robust-reordered.ciff"), min_len: 4096, cutoff_frequency: 0.1, recursion_stop: 16, swap_iterations: 20, loggap: true, sort_leaf: false, max_depth: 100, input_fidx: None, output_fidx: None, output_mapping: None }
03:28:02 [INFO] (1) building forward index
create_fwd: ⠒ [00:00:02] [████████████████████░░░░░░░░░░░░░░░░░░░░] (482000/923436, ETA 2s, SPEED: 205002/s) crecreate_fwd: ⠙ [00:00:06] [████████████████████████████████░░░░░░░░] (748000/923436, ETA 2s, SPEED: 122324/s) 03:28:10 [INFO] forward index stats:
03:28:10 [INFO] total terms: 923436
03:28:10 [INFO] discarded frequent terms: 384
03:28:10 [INFO] discarded infrequent terms: 920126
03:28:10 [INFO] remaining terms: 2927
03:28:10 [INFO] (2) sort empty docs to the back
03:28:11 [INFO] fwd duration: 8.87 secs
03:28:11 [INFO] docs 528030 non_empty 527908
03:28:11 [INFO] put docs back into default order...
03:28:11 [INFO] (3) perform graph bisection
03:28:20 [INFO] rgb duration: 9.57 secs
03:28:20 [INFO] (4) clear forward index
03:28:21 [INFO] (5) starting output operations...
03:28:21 [INFO] --> (5.2) write new ciff file
03:28:21 [INFO] writing to ciff file: robust-reordered.ciff
03:28:39 [INFO] write duration: 18.80 secs
03:28:39 [INFO] (6) compute loggap cost
03:28:44 [INFO] before reorder: 3.975 BPI
03:28:49 [INFO] after reorder: 2.968 BPI
03:28:49 [INFO] ALL DONE! duration: 47.67 secs
So, with this configuration:
approx_1
gain function,- minimum postings length of 4096,
- maximum postings length of 0.1 * N (where N is the number of documents in the collection),
- 20 iterations per level, and
- the recursion depth fixed by only recursing while there are more than 16 elements within each partition,
we observe the RGB process taking about 10 seconds, improving loggap from 3.975 to 2.968.
Running the same configuration with the default
gain function takes about 20 seconds, and yields a final
loggap of 2.989. Similarly, using approx_2
takes 6 seconds, and yields a loggap of 3.038.
A full suite of settings can be found using the --help
flag, and are listed as follows:
create-rgb 0.1.0
Reorders documents using recursive graph bisection and ciff files.
USAGE:
create-rgb [FLAGS] [OPTIONS] --input <input>
FLAGS:
-h, --help Prints help information
-l, --loggap Show loggap cost
--sort-leaf Sort leaf by identifier
-V, --version Prints version information
OPTIONS:
-c, --cutoff-frequency <cutoff-frequency>
Maximum length to consider in percentage of documents in the index [default: 0.1]
-i, --input <input> Input file ciff file
--input-fidx <input-fidx> Read forward index
--max-depth <max-depth> Maximum depth [default: 100]
-m, --min-len <min-len> Minimum number of occurrences to consider [default: 4096]
-o, --output-ciff <output-ciff> Output ciff file
--output-fidx <output-fidx> Output forward index
--output-mapping <output-mapping> Dump the document map
-r, --recursion-stop <recursion-stop> Min partition size [default: 16]
-s, --swap-iterations <swap-iterations> Swap iterations [default: 20]
For example, you can save a forward index using the --output-fidx
command, and can read a saved forward index
with the --input-fidx
flag. If you only wish to dump the reordered document map, use the --output-mapping
flag.
Other algorithmic configurations can be made inside the codebase (sorry). For example, the default behavior
uses the Floyd-Rivest median partitioning approach to "sort" documents between partitions. You can instead
invoke sorting behavior (and, indeed, parallel sorts) by modifying the flags on lines 564 and 565 of the
src/rgb.rs
file.
You can also explore the simulated annealing techniques we tested by providing a tolerance
parameter to either the partition_quickselect
or swap_documents
functions, depending on
which you are using. Some examples are shown in the codebase (see src/rgb.rs
line 591 and 619).
Feel free to raise issues here, we'll do our best to assist.