Skip to content

PostBOUND is a framework for studying query optimization algorithms for (relational) database systems.

License

Notifications You must be signed in to change notification settings

rbergm/PostBOUND

Repository files navigation

PostBOUND

PostBOUND is a framework for studying query optimization algorithms for (relational) database systems. It provides tools to easily implement prototypes of new optimization algorithms and to compare them in a transparent and reproducible way. It is implemented as a Python tool that takes an input query, applies a user-configured optimization pipeline to that query and produces an annotated output query that enforces the selected query plan during execution on a native database system. This repository provides the actual Python implementation of the PostBOUND framework (located in the postbound directory), along with a number of utilities that automate some of the tedious parts of evaluating an optimization algorithm on a specific benchmark. Those utilities are mostly focused on setting up different popular database management systems and loading commonly used databases and benchmarks for them.

Overview

The repository is structured as follows. The postbound directory contains the actual source code, all other folders are concerned with "supporting" aspects (which are important nevertheless.). Almost all of the subdirectories contain further READMEs that explain their purpose and structure in more detail.

Folder Description
postbound Contains the source code of the PostBOUND framework
docs contains the high-level documentation as well as infrastructure to export the source code documentation
examples contains general examples for typical usage scenarios. These should be run from the root directory, e.g. as python3 -m examples.example-01-basic-workflow
tests contains the unit tests and integration tests for the framework implementatino. These should also be run from the root directory, e.g. as python3 -m unittest tests
db-support Contains utilities to setup instances of the respective database systems and contain system-specific scripts to import popular benchmarks for them
workloads Contains the raw SQL queries of some popular benchmarks
tools Provides different other utilities that are not directly concerned with specific database systems, but rather with common problems encoutered when benchmarking query optimizers

Getting started

All package requirements can be installed from the requirements.txt file. To use PostBOUND in different projects, it can also be build and installed as a local package using pip. This is generally the recommended way to go and can be automated using the tools/setup-py-venv.sh script.

In addition to the Python packages, PostBOUND also needs a database connection in order to optimize and execute queries. Currently, PostgreSQL and MySQL are supported, with the MySQL features being a bit more limited due to restrictions of the system. The root directory of the PostBOUND repository contains setup utilities for some database systems, databases and workloads.

The best way to familiarize yourself with PostBOUND is to study the examples and the documentation of the used classes and functions. A high-level documentation is also being worked on, but still subject to change and not entirely up-to-date. Therefore, the Python documentation of the source code is more extensive. Consult it for the specifics of how to use a specific feature and take a look at the examples to get an idea of when to use it and which features are available. The best starting point for the in-code documentation is the __init__.py file in the postbound source directory.

We also published a paper1 which explains the concepts that motivated the initial versions of PostBOUND. Notice however, that at the time of its publication the framework had a much more limited scope and was heavily expanded since then. More specifically, PostBOUND is no longer limited to upper bound-driven optimization strategies and much more independent of specific database systems.

Example

The following snippet gives a glimpse of the different parts of the framework and how they can interact. The specific example implements the UES upper-bound optimization algorithm2 to obtain an optimized join order for the queries of the Join Order Benchmark3 and applies them to a Postgres database instance.

##
## Step 0: imports
##

import postbound as pb
from postbound.optimizer import presets

##
## Step 1: System setup
##
postgres_instance = pb.db.postgres.connect()
presets.apply_standard_system_options()
job_workload = pb.workloads.job()
ues_settings = presets.fetch("ues")

##
## Step 2: Optimization pipeline setup
##
optimization_pipeline = pb.TwoStageOptimizationPipeline(postgres_instance)
optimization_pipeline.load_settings(ues_settings)
optimization_pipeline.build()

##
## Step 3: Query optimization
##
input_query = job_workload["1a"]
optimized_query = optimization_pipeline.optimize_query(input_query)

##
## Step 4: Query execution
##
query_result = postgres_instance.execute_query(optimized_query)
print(query_result)

This examples can be almost executed as-is, there is only one setup step missing: in Step 1, PostBOUND is asked to connect to a Postgres database, but no information is provided on how this connection should be obtained. By default, PostBOUND reads this information from a hidden config file. In the case of Postgres, this file is called .psycopg_connection and it has to be placed in the same directory from which the code is executed. The Postgres file has to contain a connection string that can be used to establish a database connection, such as dbname=<my db> user=<my user> host=localhost. Consult the documentation on the database systems for more info on what information is required and how it should be stored. As a final note, the database setup utilities that are shipped with PostBOUND also contain scripts that automatically generate valid connect files for the system. These files than only need to be moved to the correct location.

Package structure

The postbound directory contains the actual source code of the framework. On a high-level, the PostBOUND framework is structured as follows:

Package Description
optimizer provides the different optimization strategies, interfaces and some pre-defined algorithms
qal provides the query abstraction used throughout PostBOUND, as well as logic to parse and transform query instances
db contains all parts of PostBOUND that concern database interaction. That includes retrieving data from different database systems, as well as generating optimized queries to execute on the database system
experiments provides tools to conveniently load benchmarks and to measure their execution time for different optimization settings
util contains algorithms and types that do not belong to specific parts of PostBOUND and are more general in nature
vis contains utilities to visualize different concepts in query optimization (join orders, join graphs, query execution plans, ...)

The actual optimization pipelines is defined in the postbound module at the package root. Depending on the specific use-case, different pipelines are available.


Literature

Footnotes

  1. Bergmann et al.: "PostBOUND: PostgreSQL with Upper Bound SPJ Query Optimization", BTW'2023 (paper)

  2. Hertzschuch et al.: "Simplicity Done Right for Join Ordering", CIDR'21 (paper, GitHub)

  3. Leis et al.: "How Good are Query Optimizers, Really?", PVLDB'15 (paper)

About

PostBOUND is a framework for studying query optimization algorithms for (relational) database systems.

Resources

License

Stars

Watchers

Forks