Skip to content

Benchmark for Sampling Applied Learned Indexes (SIGMOD '24)

License

GPL-3.0, GPL-3.0 licenses found

Licenses found

GPL-3.0
LICENSE
GPL-3.0
COPYING
Notifications You must be signed in to change notification settings

DKU-StarLab/BASIL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

BASIL

Benchmark for Sampling Applied (Learned) Indexes

| Paper | Slides | Poster | Video |


🌿 BASIL is a benchmark that evaluates the performance of 8 state-of-the-art learned and traditional indexes using sampling on sorted datasets. It is an artifact of "Choi et al. Can Learned Indexes be Built Efficiently? A Deep Dive into Sampling Trade-offs. SIGMOD 2024". It is built on the earlier benchmark SOSD.

1. Contents

LIST
β”œβ”€β”€ ...
β”œβ”€β”€ benchmarks            // configurations to be compiled and executed
β”œβ”€β”€ data                  // datasets
β”œβ”€β”€ indexes               // source code of sampling applied indexes
β”œβ”€β”€ results               // results of benchmark execution
β”œβ”€β”€ scripts               // scripts for benchmark automation
β”‚   └── reproduce         // scripts for reproduction
└── wrappers              // interface of indexes
     └── meta             // interface of indexes for measuring internal changes

2. Build

2.1. Clone the repository and install dependencies.

On vanilla Ubuntu 20.04 LTS:

git clone https://github.com/DKU-StarLab/BASIL.git
cd BASIL
sudo apt -y update
sudo apt -y install zstd python3-pip m4 cmake clang libboost-all-dev 
pip3 install -r requirements.txt

2.2. Download datasets, build benchmarks, and generate query files.

./scripts/download_data.sh
./scripts/prepare.sh
./scripts/generate.sh

3. Run

3.1. Benchmarks Types

Five types of benchmarks can be conducted using the following scripts. All benchmarks measure the build time and index size according to the hyper-parameters of each index. The results from the scripts are saved as CSV files in the results/ folder.

  • scripts/excute_avg.sh: Measures the average lookup latency.
  • scripts/excute_tail.sh: Measures the lookup latency at 0/50/90/99/99.9/99.99%th percentiles.
  • scripts/excute_sep.sh: Measures the average latency of prediction and correction separately.
  • scripts/excute_perf.sh: Measures the perf counter (LLC, TLB, Branch, Instructions) of prediction and correction separately.
  • scripts/excute_build.sh: Does not perform lookups.

3.2. Dataset

  • scripts/datasets_list.txt: List of all available datasets.
  • scripts/datasets_run.txt: List of datasets to be run by the benchmark script in section 3.1.
  • Modify scripts/datasets_run.txt to change the datasets for benchmarking.

3.3. Index Hyper-parameter

  • Benchmarks can be performed by setting each index using the templates in the benchmarks/ folder. This method of using templates is inherited from SOSD to avoid accidentally measuring vtable lookup times. However, this can lead to long benchmark build times.
  • The benchmarks/ folder contains parameters used in the paper's experiments. To avoid unnecessary long build times by compiling unnecessary templates, conditional compilation directives (#ifdef, #endif) were used to restrict compilation to only the required templates.
  • ⚠️ Note: When changing conditional compilation directives (#ifdef, #endif) for a build, please reset the macro options stored in the CMake cache (rm -rf build/CMakeCache.txt build/CMakeFiles/). Otherwise, previously used conditional compilation directives may be included in the build.

3.4. "User-defined" Hyper-parameter Setup

  • If a user wants to perform a benchmark with newly defined parameters, they config parameters in the section defined by #ifdef USER in the templates in the benchmarks/ folder, compile with scripts/prepare.sh, and then run the benchmark execution scripts described above 3.1.
  • Parameter Description for Each Index Template
// Parameters that can be modified in the template example code below are marked with ''.
// Refer to sections 2.2, 4.1, and Table 1 of the paper for a description of each parameter.

// binary search (no parameter)
benchmark.template Run<BinarySearch<uint64_t>>();

// traditional indexes
benchmark.template Run<ARTPrimaryLB<'sampling interval (I)'>>();
benchmark.template Run<STXBTree<uint64_t, 'sampling interval (I)'>>();
benchmark.template Run<InterpolationBTree<uint64_t, 'sampling interval (I)'>>();

// learned & histogram indexes
benchmark.template Run<sRadixTable<uint64_t, 'radix bits (r)', 'sampling interval (I)'>>();
benchmark.template Run<
    sRMICppRobust<uint64_t, srmi::LinearSpline, srmi::LinearRegression,
                srmi::sRmiRobust, '# of segments (branching factor, b)', 0, 1, 'sampling interval (I)'>>();
// Since '𝛿(= πœ€ βˆ’ 𝐼 + 1) β‰₯ 0', it must hold that '𝐼 ≀ πœ€ + 1'. (Theorem 3.1 / 3.2)
benchmark.template Run<sCHT<uint64_t, '# of bins (2^r)', 'error bound (πœ€)', 'sampling interval (I)'>>();
benchmark.template Run<sRS<uint64_t, 'radix bits (r)', 'error bound (πœ€)', 'sampling interval (I)'>>();
benchmark.template Run<sPGM<uint64_t, '(bottom) error bound (πœ€_b)', '(bottom) sampling interval (I)',
                '(upper) error bound (πœ€_u)', '(upper) sampling interval (I)'>>();

3.5. Reproduction

For reproduction, please refer to reproduce.md file.

4. References

We have developed the 🌿 BASIL based on various previous benchmarks, open-source projects, and papers. We thank the researchers for their valuable contributions.

Benchmark

  • Kipf, Andreas, et al. "SOSD: A benchmark for learned indexes." (MLSys@NeurIPS '19) :octocat:
  • Marcus, Ryan, et al. "Benchmarking learned indexes." (VLDB '20) :octocat:
  • Wongkham, Chaichon, et al. "Are updatable learned indexes ready?." (VLDB '22) :octocat:

Index

  • Kraska, Tim, et al. "The case for learned index structures." (SIGMOD '18) :octocat:
  • Marcus, Ryan, et al. "Cdfshop: Exploring and optimizing learned index structures." (SIGMOD '20) :octocat:
  • Maltry, Marcel, et al. "A critical analysis of recursive model indexes." (VLDB '22) :octocat:
  • Ferragina, Paolo, et al. "The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds." (VLDB '20) :octocat:
  • Kipf, Andreas, et al. "RadixSpline: a single-pass learned index." (aiDM@SIGMOD '20) :octocat:
  • Crotty, Andrew. "Hist-Tree: Those Who Ignore It Are Doomed to Learn." (CIDR '21)
  • Stoian, Mihail, et al. "Towards Practical Learned Indexing." (AIDB@VLDB '21) :octocat:

5. Citation

@article{choi2024can,
  title={Can Learned Indexes be Built Efficiently? A Deep Dive into Sampling Trade-offs},
  author={Choi, Minguk and Yoo, Seehwan and Choi, Jongmoo},
  journal={Proceedings of the ACM on Management of Data},
  volume={2},
  number={3},
  pages={1--25},
  year={2024},
  publisher={ACM New York, NY, USA}
}