This implementation of Kanren supports defining, and efficiently querying, large-scale relations.
Typical use:
(require "PATH-TO-DBKANREN/dbk.rkt")
-
Replace higher-order micro implementation with formula building, rewriting, and an interpreter
-
Datalog performance improvements
- Semi-naive evaluation
- Lowering to relational algebra operations
- e.g., projecting away irrelevant variables to reduce intermediate duplication
- Magic sets transformation
-
Relation definitions and queries involving semantic extensions
- A relation defined with
define-relation
will raise an error if it does not support a finite mode- For Racket-embedded definitions, the error cannot be issued immediately due to mutually-recursive relations that may not be fully defined. Instead, the error will be issued at the earliest query or request for compilation.
- Use
define-search-relation
to suppress the error - e.g.,
- Each pure relation has one finite mode: all attributes can be safely unknown, aka
#f
(appendo xs ys xsys)
has two finite modes:(#t #t #f)
and(#f #f #t)
(membero x xs)
has one finite mode:(#f #t)
(evalo expr env result)
has no finite modes, so it's considered a search-based relation- Because exprs like
((lambda (d) (d d)) (lambda (d) (d d)))
are possible
- Because exprs like
- Each pure relation has one finite mode: all attributes can be safely unknown, aka
- A query executed with
run
will raise an error if it cannot guarantee termination- It may also provide diagnostics about worst-case behavior and execution progress
- Use
run-search
to suppress the error
- A relation defined with
-
Database improvements
- Maybe we don't need the augmented 0th column for relations and incomplete indexes
- Only index the listed columns? Consider missing columns to have been projected away.
- Try switching from fxvectors to using bytes where possible, and compress fixnums otherwise
id=>id
remapping, andcol<->row
transposition with varying byte widths: annoying?
- Improve checkpointing storage
- Consolidate block files to use fewer
- Make use of
(file-stream-buffer-mode in 'none)
file portsread-bytes
is as efficient asfile->bytes
if you know thefile-size
up front
- May need a protocol for scattered/interrupted/fragmented writes to block ids
- Will need a compaction protocol
- Make use of
- Should we move away from block names and just use block ids?
- Consolidate block files to use fewer
- Vectorized relational operators
- Consider radix burst sort for ingesting text more quickly
- Also consider alternatives to multi-merge for text
- Maybe we don't need the augmented 0th column for relations and incomplete indexes
-
Data access interface for queries
- What form should RA-level (table-like) entities take?
- Include database-path and a database-local UID for recognizing identitical entities
- Important for using functional dependencies and other optimizations
- Relation type (for eliminating pointless joins/unions/differences)
index-ordering => (cons key-columns position-columns)
- Once a query is planned, this info is used to build the necessary dicts
- Include database-path and a database-local UID for recognizing identitical entities
- What form should RA-level (table-like) entities take?
-
Try a simple text search strategy using a triples-containment relation
- Don't build this into the database, this can all be user-defined
- Filter name text to retain only lower-case alphanumeric (plus an empty value for 37 total values)
- Build the relation:
(text-contains text c0 c1 c2)
or(text-contains text compact-c012)
- Latter approach compacts c0 c1 c2 to a single 2-byte value
- c0 c1 c2 are treated as digits
- compact-c012 = c037^2 + c137 + c2
- which is < 52059, which is < 2^16
- Indexing
"foobar"
would produce these tuples:(text-contains "foobar" "f" "o" "o") (text-contains "foobar" "o" "o" "b") (text-contains "foobar" "o" "b" "a") (text-contains "foobar" "b" "a" "r") (text-contains "foobar" "a" "r" "_") (text-contains "foobar" "r" "_" "_")
- Searching for
"test"
would use this query to produce a tractable number of candidates:(run* candidate (text-contains candidate "t" "e" "s") (text-contains candidate "e" "s" "t"))
- Candidates can then be filtered using arbitrary rules defined in Racket
- Latter approach compacts c0 c1 c2 to a single 2-byte value
-
a smaller extended syntax
- logical connectives (not, and, or)
- apply for relations
- anonymous vars
-
explain
for extracting the database subset needed to reproduce a query's results- with each result, run an instrumented query to record supporting facts
- there may be redundancy (one result could be computed in multiple ways)
- analyze, simplify, and plan with query structure
- GOO join heuristic
- index-based path-finding to find candidate variable orderings
-
redesign states, strategies to support adaptive analysis and optimization
- decorate unexpanded user-defined relation constraints with the chain of
already-expanded parent/caller relations that led to this constraint
- to identify nonterminating loops
- to help measure progress
- optional bottom-up evaluation strategy
- safety analysis of relations referenced during query/materialization
- randomized variants of interleaving or depth-first search
- decorate unexpanded user-defined relation constraints with the chain of
already-expanded parent/caller relations that led to this constraint
-
use small tables for finite domains
-
fixed point computation
- binding signatures with constraints
- share work by generalizing signatures
- binding signatures with constraints
- improve
bounds
algebra by adding slightly-less and slightly-greater values for open interval endpoints- think of these as +epsilon, -epsilon
- define comparison operators, min, and max, over these values
-
place-based concurrency/parallelism?
-
thread-safe table retrieval
-
background worker threads/places for materialization
-
documentation and examples
- small tsv data example for testing materialization
- LiveJournal, Orkut graph benchmark examples for
- https://github.com/frankmcsherry/blog/blob/master/posts/2019-09-06.md
- reachability
- connected components
- single-source shortest path
- http://yellowstone.cs.ucla.edu/papers/rasql.pdf
- http://pages.cs.wisc.edu/~aws/papers/vldb19.pdf
-
support an interactive stepping/user-choice "strategy"
- both for debugging and devising new strategies
run/interactive
- same as a
run^
withcurrent-config
search-strategy
set tointeractive
- non-stream, first-order representation of strategy-independent states
- also support embedding and combining sub-states (hypothetical along a disjunct)
- states are just fancy representations of conjunctions, and should be combinable/disolvable
- also support embedding and combining sub-states (hypothetical along a disjunct)
- same as a
-
relation compilation to remove interpretive overhead
- mode analysis
- represent low-level operation representation depending on mode
- e.g., == can act like an assignment, or a type/equality test
- likewise for other constraint evaluations
- e.g., == can act like an assignment, or a type/equality test
- partial evaluation
- e.g., unify on partially-known term structures can be unrolled
- constraint evaluations can be pre-simplified and reordered
- code generation
-
eventually, make sure relation metadata contains information for analysis
- e.g., degree constraints, fast column ordering, subsumption tag/rules
- descriptions used for subsumption
- #(,relation ,attributes-satisfied ,attributes-pending)
- within a relation, table constraint A subsumes B if B's attributes-pending is a prefix of A's AND B does not have any attributes-satisfied that A does not have
- schema: heading, constraints, other dependencies
- heading: set of attributes and their types
- degree constraints (generalized functional dependencies)
- interpret degree constraints to find useful special cases
- functional dependency
- bijection (one-to-one mapping via opposing functional dependencies)
- uniqueness (functional dependency to full set of of attributes)
- interpret degree constraints to find useful special cases
- maybe join and inclusion dependencies
- body: finite set of tuples
-
floating point numbers are not valid terms
- reordering operations endangers soundness
- detect float literals when doing so won't hurt performance
- types
- text suffix
- suffix:
(ID . start)
- suffix:
- bigint, rational
- text suffix
- ingestion
- parsing: nq (n-quads)
- gathering statistics
- reservoir sampling
- optional column type inference
- per-type statistics for polymorphic columns
- count, min, max, sum, min-length, max-length
- histogram up to distinct element threshold
- range bucketing
-
intensional relation search strategy: backward or forward chaining
- safety-type could be inferred, or forced by parameter choice
- can use a begin-like macro to set the safety parameter:
(dbk-datalog-begin definitions ...)
(dbk-total-begin definitions ...)
- forward-chaining supports stratified negation and aggregation
-
metadata for extensional relations (user-level)
- constraints:
- degree (subsumes uniqueness, functional dependency, cardinality)
- in relation R, given (A B C), how many (D E F)s are there?
- lower and upper bounds, i.e., 0 to 2, exactly 1, at least 5, etc.
- maybe support more precise constraints given specific field values
- in relation R, given (A B C), how many (D E F)s are there?
- monotone dependencies
- e.g., C is nondecreasing when sorting by (A B C D)
- implies sorted rows for (C A B D) appear in same order as (A B C D)
- one index can support either ordering
- generalized: (sorted-columns (A B . C) (B . D) B) means:
- B is already nondecreasing
- C becomes nondecreasing once both A and B are chosen
- D becomes nondecreasing once B is chosen (regardless of A)
- degree (subsumes uniqueness, functional dependency, cardinality)
- statistics (derived from table statistics)
- constraints:
-
indicate or infer aggregation monotonicity for efficient incremental update
-
misc conveniences
apply
to supply a single argument/variable to an n-ary relation- optional keyword argument calling convention for relations
- user can implement this as Racket procedures that accept plist-style keyword args
-
relations
- local relation definitions to share work (cached results) during aggregation
(let-relations (((name param ...) formula ...) ...) formula ...)
- express subqueries by capturing other vars
- if
r
is a relation:(r arg+ ...)
relatesarg+ ...
byr
(relations-ref r)
accesses a metaprogramming control structure- configure persistence, representation, uniqueness and type constraints
- indicate evaluation preferences or hints, such as indexing
- dynamically disable/enable
- view source, statistics, configuration, preferences, other metadata
- local relation definitions to share work (cached results) during aggregation
-
evaluation
- query evaluation
- precompute relevant persistent/cached safe relations via forward-chaining
- safe/finite results loop (mostly backward-chaining)
- prune search space top-down to shrink scale of bulk operations
- expand recursive relation calls while safe
- safety: at least one never-increasing parameter is decreasing
- this safety measure still allows polynomial-time computations
- may prefer restricting to linear-time computations
- or a resource budget to limit absolute cost for any complexity
- track call history to measure this
- recursive calls of exactly the same size are considered failures
- safety: at least one never-increasing parameter is decreasing
- keep results of branching relation calls independent until join phase
- expand recursive relation calls while safe
maintain constraint satisfiabilityconstraint propagation loop:state-enforce-local-consistency
cheap first-pass via domain consistency, then arc consistency- backjump and learn clauses when conflict is detected
then global satisfiability via searchchoose candidate for the most-constrained variablemaybe the variable participating in largest number of constraints
interleave search candidate choice with cheap first-pass methods
- perform the (possibly multi-way) join with lowest estimated cost
- repeat until either all results computed or only unsafe calls remain
- prune search space top-down to shrink scale of bulk operations
- perform unsafe interleaving until finished or safe calls reappear
- like typical miniKanren search
- if safe calls reappear, re-enter safe/finite results loop
- safety (and computation) categories for relations
- simple
- persistent (extensional or precomputed intensional)
- compute via retrieval
- nonrecursive
- compute via naive expansion
- "safe" in the datalog sense
- compute fixed point via semi-naive eval after top-down rewriting
- all relations used in the body are safe
- all head parameter vars mentioned in positive position in body
- i.e., all head parameters will be bound to ground values
- no constructors containing vars in "dangerous" positions
- head parameters of recursive relation
- arguments to recursive relations in body
- aggregation is stratified
- "unsafe": top-down interleaving search
- no safety guarantees at all
- refutational incompleteness
- unlimited answers
- answers with unbound variables, possibly with constraints
- no safety guarantees at all
- persistent (extensional or precomputed intensional)
- more complex alternatives
- "safe" for forward chaining only
- compute fixed point via semi-naive eval
- generalizes "safe" datalog slightly
- constructors w/ vars allowed as args to recursive relations in body
- "safe" for backward chaining only
- compute via DFS
- may or may not guarantee all variables are bound (groundness)
- termination guaranteed by structural recursion argument metric
- "safe" for forward chaining only
- simple
- query evaluation
-
mode analysis
- modes as binding patterns with result cardinalities
- parameter binding classes from least to most constraining
- f: free (no guarantees of any kind)
- x: constrained (if it's a variable, it at least has some constraints)
- c: constructed (not a variable, but not necessarily fully ground)
- b: bound (fully ground)
- appendo: (b f f) -> (0 1) (b f b)
- bound first param leads to 0 or 1 result with bound third param
- parameter binding classes from least to most constraining
- mode-specific safety
- appendo termination requires (c f f) or (f f c) at every recursive call
- mode-specific cost
- different parameter bindings determine effectiveness of indices
- modes as binding patterns with result cardinalities
-
lazy population of text/non-atomic fields
- equality within the same shared-id column can be done by id
- also possible with foreign keys
- analogous to pointer address equality
- equality within the same nonshared-id column must be done by value
- equal if ids are the same, but may still be equal with different ids
- equality between incomparable columns must be done by value
- address spaces are different
- how is this integrated with mk search and query evaluation?
- can do this in general for functional dependencies
(conj (relate-function x a) (relate-function x b))
implies(== a b)
- can do this in general for functional dependencies
- would like similar "efficient join" behavior for text suffix indexes
- given multiple text constraints, find an efficient way to filter
- will involve intersecting a matching suffix list for each constraint
- one suffix list may be tiny, another may be huge
- rather than finding their intersection, it's likely more efficient to just throw the huge one away and run the filter directly on the strings for each member of the tiny suffix set
- how is this integrated with mk search and query evaluation?
- text search can be expressed with
string-appendo
constraints(conj (string-appendo t1 needle t2) (string-appendo t2 t3 hay))
- suffix index can be searched via
(string-appendo needle t1 hay-s)
- multiple needles
(conj (string-appendo n1 t1 hay-s) (string-appendo n2 t2 hay-s))
- maybe best done explicitly as aggregation via
:==
- text search can be expressed with
- given multiple text constraints, find an efficient way to filter
- equality within the same shared-id column can be done by id
-
finite relations may relate their tuples to an extra position parameter
(apply-relation-position R pos tuple)
-
aggregation via negation
-
can this be done efficiently?
-
edb:
(edge x y)
-
(define-relation (1-hops x targets) (=/= targets '()) (:== targets (query y (edge x y)))) or (define-relation (1-hops x targets) (=/= targets '()) (query== targets y (edge x y))) vs. (define-relation (1-hops-rest x prev targets) (conde ((== targets '()) (not (fresh (z) (any< prev z) (edge x z)))) ((fresh (y targets.rest) (== targets (cons y targets.rest)) (edge x y) (not (fresh (z) (any< prev z) (any< z y) (edge x z))) (1-hops-rest x y targets.rest))))) (define-relation (1-hops x targets) (fresh (y targets.rest) (== targets (cons y targets.rest)) (edge x y) (not (fresh (z) (any< z y) (edge x z))) (1-hops-rest x y targets.rest)))
(define-relation (max+o threshold x xs) (conde ((== xs '()) (== x threshold)) ((fresh (y ys) (== xs (cons y ys)) (conde ((any<= y threshold) (max+o threshold x ys)) ((any< threshold y) (max+o y x ys))))))) (define-relation (maxo x xs) (fresh (y ys) (== xs (cons y ys)) (max+o y x ys))) (define-relation (edge-largest x largest) (fresh (targets) (query== targets y (edge x y)) (maxo largest targets))) vs. (define-relation (edge-largest x largest) (edge x largest) (not (fresh (y) (any< largest y) (edge x y))))
-
We need some low-level language for transformation (not for humans or direct interpretation). Is this the right one?
- is there a more operational datalog-like notation? e.g., datalog + ordering hints?
- (and _ ...) and (or _ ...) can be used to specify specific join/union order and nesting
- but these seem hard, so this idea is probably not enough:
- what about specifying particular kinds of joins?
- what about projection and filtering?
- what about aggregation?
- how about sub-computation sharing and scope?
Initial sketch of relational algebra language for lower-level optimization transformations:
position ::= <natural number>
count ::= <natural number>
merge-op ::= + | * | min | max | set-union | (min-union count) | (max-union count)
op ::= cons | vector | list->vector | car | cdr | vector-ref | vector-length | merge-op
expr ::= (quote value) | (ref position position) | (app op expr ...)
cmp ::= == | any<= | any< | =/=
cx ::= (cmp expr expr) | (and cx ...) | (or cx ...)
relation ::= (quote ((value ...) ...)) ; What operator does this correspond to? Does it assume sorting/deduping? Probably no assumptions, just enumerate raw tuples.
| (table rid) ; This corresponds to a basic table scan, use a different operator for more efficient access.
| (rvar name)
| (filter (lambda 1 cx) relation)
| (project (lambda 1 (expr ...)) relation)
| (union relation ...)
| (subtract relation relation)
| (cross relation ...) ; TODO: replace with nested [for-]loop notation.
;; How do we specify n-way join shared keys?
;; New notation for join-key specification?
;; Operators for sorting (with direction), deduplication.
;; Operators for skipping (dropping), and limiting (taking).
| (agg relation (lambda 1 (expr ...)) value (lambda 1 (expr ...)) (merge-op ...)) ; rel group init map merge (map exprs and merge ops must be same length)
| (let ((name relation) ...) relation) ; Relations are materialized, nesting describes stratification.
| (letrec ((name relation) ...) relation) ; Relations computed as fixed points, relations are materialized, nesting describes stratification.
;; Let-bound relations should be specific about what physical tables/indexes are formed?
direction ::= < | >
direction-suffix ::= () | #f
offset ::= <natural number>
limit ::= <natural number> | #f
query ::= (query offset limit (direction ... . direction-suffix) relation) ; Do we need this? Should we embed sorting, skipping, and limiting as operators instead?
Example query thought experiments:
Maybe bounds feedback doesn't make sense, and should be part of the plan ordering, where bounds actually come from the bottom rather than the top? e.g., consider variable-centric plans for:
(run* (a b d e)
(exist (c)
(and (P a b c)
(Q c d e)
(R c d e))))
- c then (c d e) then c: (P ((P Q) (P R))) ; aka semijoins
- (join P [2] [1 2]
(join (join P [2] [] Q [0] [0 1 2]) [0 1 2] [1 2] ; this is a semijoin
(join P [2] [] R [0] [0 1 2]) [0 1 2] []) ; this is also a semijoin
[0] [1 2])
- (c d e) then c: (P (Q R))
- (join P [2] [1 2]
(join Q [0 1 2] [1 2] R [0 1 2] []) ; no semijoining performed
[0] [1 2])
But neither of these gives adaptive feedback ...
Upstream and downstream can't communicate bounds with this approach.
Ideally we would choose an ordering like:
- c d e
or
- d e c
Binary joins seem too inflexible here.
Multi-join GOO perspective:
- (run* (a c) (exist (b) (P a b) (Q b c)))
- max cardinality: (* (cardinality P a) (cardinality Q c))
- (run* (a b c) (P a b) (Q b c))
- max cardinality: (min (* (cardinality P a)
(cardinality Q b c))
(* (cardinality Q c)
(cardinality P a b))
(* (cardinality P a)
(cardinality Q c)
(min (cardinality P b)
(cardinality Q b))))
Naive selectivity for join cardinality estimation:
- |S join.x T| ~= (/ (* |S||T|) (max (cardinality S x) (cardinality T x)))
- can generalize for multi-way joins:
- if joining k relations
- numerator is a product of the k relation sizes
- denominator is a product of the (- k 1) largest join-column cardinalities
Non-materializing multi-way joins may be wasteful:
- (P a b) (Q b c) (R c d) (S d e)
- joining variables: b c d
- for each choice of b, we learn a set of cs and repeat some of the work of joining (R c d) (S d e) on d
- the repeated work has two parts:
- finding values of c where a satisfying (d e) exists
- bisection revisitation overhead in enumerating these
- if the result cardinality of (R c d) (S d e) is small, materializing this sub-join may be better
- semijoins as degenerate joins
- (P a b) (Q b c) where P is used to reduce the size of Q based on b
- equivalent to (P _ b) (Q b c)
- Q^ = (P _ b) (Q b c)
- R^ = (Q^ _ c) (R c d)
- S^ = (R^ _ d) (S d e)
- R^^ = (R^ c d) (S^ d _)
- Q^^ = (Q^ b c) (R^^ c _)
- P^ = (P a b) (Q^^ b _)
- arc-constrained query: (P^ a b) (Q^^ b c) (R^^ c d) (S^ d e)
- benefit: increased cardinality estimation accuracy
Multi-way joins are not wasteful when all relations share the same join attributes (assuming proper indexes):
- (P x y a) (Q x b y) (R c y x)
- can intersect on either (x y) or (y x), and simply enumerate crosses of the non-join attributes
Multi-way joins may improve over binary joins whose expected result cardinality is greater than the max cardinality of either input.
_ = blank
x-y = x <space> y ; for multi-word phrases, e.g., launch-all-missiles
x/y = x with y
x^y = x superscript y
x.y = x subscript y ; emphasizing grouping by x
y:x = x subscript y ; emphasizing grouping by y, the subscript; e.g., a type
; y with constructors x, or implementing a common
; interface of operations x, or other situations where
; if the context is clear, the y: could be dropped from
; the name
x@y = x at y ; result of projection or access using address/key y
x->y = x to y ; procedure mapping type x to y
x=>y = x to y ; finite map (e.g., hash or vector) with key type x
x&y = x and y ; a pair, or sometimes a 2-element list or vector
x* = 0 or more xs ; (typically homogeneous) lists, sometimes vectors
x? = x huh ; x is either a boolean(-ish) value itself, or a
; predicate (procedure returning a boolean(-ish) value)
x! = x bang ; x may cause important side effects, such as mutation
; or throwing an error; I/O operations often don't use
; this convention
x?! = assert x ; a predicate/guard that throws an error if false
- miniKanren: an embedded DSL for logic programming
- Bloom Programming Language
- Dedalus: Datalog in Time and Space
- Datafun Programming Language
- Flix Programming Language
- What You Always Wanted to Know About Datalog (And Never Dared to Ask)
- Integrating Datalog and Constraint Solving
- Compilation Of Bottom-Up Evaluation For a Pure Logic Programming Language
- Bottom-Up Evaluation
- Magic Sets
- Automatic Reordering for Dataflow Safety of Datalog
- Slides for Automatic Reordering for Dataflow Safety of Datalog
- A Cost Estimation Technique for Recursive Relational Algebra
- On the Optimization of Recursive Relational Queries: Application to Graph Queries
- Cost-Based Optimization for Magic: Algebra and Implementation