This repository contains code used for our PVLDB paper "The next 50 Years in Database Indexing or: The Case for Automatically Generated Index Structures".
To use this code, we recommend to clone the repository using git
. You should use the --recurse-submodules
flag in
your git clone
command to automatically download the submodules while cloning the repository.
git clone --recurse-submodules git@github.com:BigDataAnalyticsGroup/GENE.git
The execution and visualization of experiments is based on the following tools:
- C++ compiler supporting C++17
bash>=4
cmake>=3.1
Python>=3.5
md5sum
wget
zstd
graphviz
(optional)
To install the required Python modules, you can use Python's package installer pip
in combination with the
requirements.txt
.
pip install -r requirements.txt
To download the data required by the experiments, run the following shell script from the root folder of the project.
./data/download.sh
Our build system is based on cmake
. See the following instructions for an example build process in release
mode.
NOTE: The following example requires the execution from the root folder. In addition, reproducing our experimental results with the provided scripts requires exactly this folder structure.
mkdir -p build/release
cd build/release
cmake -DCMAKE_BUILD_TYPE=Release ../..
make
cd ../..
After executing the build instructions, the generated binaries are located in ./build/release/bin/
. The executable
for the genetic search is called main
. It offers a multitude of different parameters to specify nearly all important
hyperparameters and inputs. Calling the executable with the -h
or --help
flag will display all available options
With the default settings, the executable will run a minimal example with 100 keys and 10 generations, which allows to
test if everything works as expected.
./build/release/bin/main
In the ./experiments/
folder, we provide scripts to reproduce the experimental results form our paper. In particular,
we provide the following experiments.
hyperparameter-tuning
: Section 6.1 "Hyperparameter Tuning", computes the best hyperparameters for our genetic search.rediscover-baselines
: Section 6.2 "Rediscover Suitable Baseline Indexes", demonstrates that our genetic algorithm is capable of reproducing the performance of various basline indexes.optimized-vs-heuristic
: Section 6.3 "Optimized vs Heuristic Indexes", compares the performance of GENE with representatives of different prevalent heuristic index types.
To run the different experiments, navigate to the corresponding folder and execute the Python
run_experiment.py
or Shell run_experiment.sh
script inside the respective folder. For example, to run the
hyperparameter search:
cd experiments/hyperparameter-tuning/
python3 run_experiment.py
Depending on the specific experiment, the script generates the following folders:
data
: contains the underlying dataset filesworkloads
: contains the underlying workload filesresults
: contains the corresponding result files incsv
and/ordot
format
To visualize the results, execute the accompanying visualization.py
script (except for the hyperparameter-tuning
experiment). This will produce pdf
files containing the plots as shown in the paper.
To export the dot files using graphviz
as pdf
, execute the following command:
dot -Tpdf path/to/dot/file -o /path/to/pdf/file