Skip to content

Commit

Permalink
opt: add support for calculating selectivity from multi-column stats
Browse files Browse the repository at this point in the history
This commit adds support for calculating selectivity from multi-column
statistics. It changes selectivityFromDistinctCounts to have the following
semantics:

selectivityFromDistinctCounts calculates the selectivity of a filter
by using estimated distinct counts of each constrained column before
and after the filter was applied. We can perform this calculation in
two different ways: (1) by treating the columns as completely independent,
or (2) by assuming they are correlated.

(1) Assuming independence between columns, we can calculate the selectivity
    by taking the product of selectivities of each constrained column. In
    the general case, this can be represented by the formula:
```
                     ┬-┬ ⎛ new_distinct(i) ⎞
      selectivity =  │ │ ⎜ --------------- ⎟
                     ┴ ┴ ⎝ old_distinct(i) ⎠
                    i in
                 {constrained
                   columns}
```
(2) If useMultiCol is true, we assume there is some correlation between
    columns. In this case, we calculate the selectivity using multi-column
    statistics.
```
                    ⎛ new_distinct({constrained columns}) ⎞
      selectivity = ⎜ ----------------------------------- ⎟
                    ⎝ old_distinct({constrained columns}) ⎠
```
    This formula looks simple, but the challenge is that it is difficult
    to determine the correct value for new_distinct({constrained columns})
    if each column is not constrained to a single value. For example, if
    new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2,
    3 or 4. We estimate the new distinct count as follows, using the concept
    of "soft functional dependency (FD) strength":
```
      new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled)

    where

      min_value = max(new_distinct(x), new_distinct(y))
      max_value = new_distinct(x) * new_distinct(y)
      range     = max_value - min_value

                    ⎛ max(old_distinct(x),old_distinct(y)) ⎞
      FD_strength = ⎜ ------------------------------------ ⎟
                    ⎝         old_distinct({x,y})          ⎠

                        ⎛ max(old_distinct(x), old_distinct(y)) ⎞
      min_FD_strength = ⎜ ------------------------------------- ⎟
                        ⎝   old_distinct(x) * old_distinct(y)   ⎠

                           ⎛ FD_strength - min_FD_strength ⎞
      FD_strength_scaled = ⎜ ----------------------------- ⎟
                           ⎝      1 - min_FD_strength      ⎠
```
    Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are
    perfectly correlated, old_distinct({x,y})=100. Using the example from
    above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values
    into the equation, we get:
```
      FD_strength_scaled  = 1
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2
```
    If x and y are completely independent, however, old_distinct({x,y})=1000.
    In this case, we get:
```
      FD_strength_scaled  = 0
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4
```
Note that even if useMultiCol is true and we calculate the selectivity
based on equation (2) above, we still want to take equation (1) into
account. This is because it is possible that there are two predicates that
each have selectivity s, but the multi-column selectivity is also s. In
order to ensure that the cost model considers the two predicates combined
to be more selective than either one individually, we must give some weight
to equation (1). Therefore, instead of equation (2) we actually return the
following selectivity:
```
  selectivity = (1 - w) * (eq. 1) + w * (eq. 2)
```
where w currently set to 0.9.

This selectivity will be used later to update the row count and the
distinct count for the unconstrained columns.

Fixes #34422

Release note (performance improvement): Added support for calculating the
selectivity of filter predicates in the optimizer using multi-column
statistics. This improves the cardinality estimates of the optimizer when
a query has filter predicates constraining multiple columns. As a result,
the optimizer may choose a better query plan in some cases.
  • Loading branch information
rytaft committed May 29, 2020
1 parent dbcde09 commit df34c56
Show file tree
Hide file tree
Showing 23 changed files with 1,800 additions and 485 deletions.
79 changes: 40 additions & 39 deletions pkg/sql/opt/exec/execbuilder/testdata/upsert
Original file line number Diff line number Diff line change
Expand Up @@ -528,42 +528,43 @@ EXPLAIN (VERBOSE)
INSERT INTO target SELECT x, y, z FROM source WHERE (y IS NULL OR y > 0) AND x <> 1
ON CONFLICT (b, c) DO UPDATE SET b=5
----
· distributed false · ·
· vectorized false · ·
count · · () ·
└── upsert · · () ·
│ into target(a, b, c) · ·
│ strategy opt upserter · ·
│ auto commit · · ·
└── render · · (x, y, z, a, b, c, upsert_b, a) ·
│ render 0 x · ·
│ render 1 y · ·
│ render 2 z · ·
│ render 3 a · ·
│ render 4 b · ·
│ render 5 c · ·
│ render 6 upsert_b · ·
│ render 7 a · ·
└── render · · (upsert_b, x, y, z, a, b, c) ·
│ render 0 CASE WHEN a IS NULL THEN y ELSE 5 END · ·
│ render 1 x · ·
│ render 2 y · ·
│ render 3 z · ·
│ render 4 a · ·
│ render 5 b · ·
│ render 6 c · ·
└── lookup-join · · (x, y, z, a, b, c) ·
│ table target@target_b_c_key · ·
│ type left outer · ·
│ equality (y, z) = (b, c) · ·
│ equality cols are key · · ·
│ parallel · · ·
└── distinct · · (x, y, z) ·
│ distinct on y, z · ·
│ nulls are distinct · · ·
│ error on duplicate · · ·
│ order key y, z · ·
└── scan · · (x, y, z) +y,+z
· table source@source_y_z_idx · ·
· spans /NULL-/!NULL /1- · ·
· filter x != 1 · ·
· distributed false · ·
· vectorized false · ·
count · · () ·
└── upsert · · () ·
│ into target(a, b, c) · ·
│ strategy opt upserter · ·
│ auto commit · · ·
└── render · · (x, y, z, a, b, c, upsert_b, a) ·
│ render 0 x · ·
│ render 1 y · ·
│ render 2 z · ·
│ render 3 a · ·
│ render 4 b · ·
│ render 5 c · ·
│ render 6 upsert_b · ·
│ render 7 a · ·
└── render · · (upsert_b, x, y, z, a, b, c) ·
│ render 0 CASE WHEN a IS NULL THEN y ELSE 5 END · ·
│ render 1 x · ·
│ render 2 y · ·
│ render 3 z · ·
│ render 4 a · ·
│ render 5 b · ·
│ render 6 c · ·
└── merge-join · · (a, b, c, x, y, z) ·
│ type right outer · ·
│ equality (b, c) = (y, z) · ·
│ mergeJoinOrder +"(b=y)",+"(c=z)" · ·
├── scan · · (a, b, c) +b,+c
│ table target@target_b_c_key · ·
│ spans FULL SCAN · ·
└── distinct · · (x, y, z) +y,+z
│ distinct on y, z · ·
│ nulls are distinct · · ·
│ error on duplicate · · ·
│ order key y, z · ·
└── scan · · (x, y, z) +y,+z
· table source@source_y_z_idx · ·
· spans /NULL-/!NULL /1- · ·
· filter x != 1 · ·
Loading

0 comments on commit df34c56

Please sign in to comment.