Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: add support for multi-column statistics #34422

Closed
rytaft opened this issue Jan 30, 2019 · 0 comments · Fixed by #49134
Closed

opt: add support for multi-column statistics #34422

rytaft opened this issue Jan 30, 2019 · 0 comments · Fixed by #49134
Assignees
Labels
A-sql-optimizer SQL logical planning and optimizations. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@rytaft
Copy link
Collaborator

rytaft commented Jan 30, 2019

Currently we only support creation of table statistics on single columns. Although the syntax for creation of statistics with multiple columns is supported, execution of such a request causes an error. For example:

> CREATE STATISTICS employee_multi_col_stats ON first_name, last_name FROM employees;
ERROR:  multi-column sketches not supported yet

This issue will track support for this feature.

@rytaft rytaft self-assigned this Jan 30, 2019
@awoods187 awoods187 added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-optimizer SQL logical planning and optimizations. labels Mar 6, 2019
craig bot pushed a commit that referenced this issue Mar 6, 2019
34428: sql: Update error message for multi-column stats r=rmloveland a=rmloveland

Updated the error message as follows:

- Changed from referencing "multi-column sketches" (which are not
  mentioned in our docs anywhere) to "multi-column statistics"

- Added link to CockroachDB issue discussing multi-column stats for the
  user's reference.

See also: #34422.

Release note: None

35410: sqlsmith: major refactor r=mjibson a=mjibson

Refactor sqlsmith to use internal cockroach types for everything. Change reference handling to be more explicit and correct. These should enable us to more easily and correctly add 

35427: roachtest: Fix blacklist lookups to handle change from v2.2 to v19.1 r=BramGruneir a=BramGruneir

I've been working on this fix as part of #35423 but that's been taking too long
and I don't want the tests to keep failing.  So here is a direct fix that also
updates the blacklist for psycopg as a new test is now passing.

Fixes #35271.
Fixes #35285.

Release note: None

Co-authored-by: Rich Loveland <rich@cockroachlabs.com>
Co-authored-by: Matt Jibson <matt.jibson@gmail.com>
Co-authored-by: Bram Gruneir <bram@cockroachlabs.com>
rytaft added a commit to rytaft/cockroach that referenced this issue Apr 20, 2020
Prior to this commit, only single-column statistics were supported
and collected. This commit adds support for collecting multi-column
statistics, and by default collects stats on all prefixes of each
index (in addition to all the single column statistics that were already
collected). For example, if there is an index on (a, b, c), we now
collect stats on (a, b) and (a, b, c) in addition to stats on each
individual column.

Note that we do not yet make full use of these statistics in the
optimizer. That change will come in a future PR.

Informs cockroachdb#34422

Release note (sql change): Added support for collection of multi-
column statistics. By default, statistics are now collected on all
prefixes of each index, in addition to 100 other individual columns.
rytaft added a commit to rytaft/cockroach that referenced this issue Apr 21, 2020
Prior to this commit, only single-column statistics were supported
and collected. This commit adds support for collecting multi-column
statistics, and by default collects stats on all prefixes of each
index (in addition to all the single column statistics that were already
collected). For example, if there is an index on (a, b, c), we now
collect stats on (a, b) and (a, b, c) in addition to stats on each
individual column.

A new cluster setting sql.stats.multi_column_collection.enabled
controls this feature. It is currently enabled by default, but automatic
collection of multi-column stats can be disabled by setting it to false.

Note that we do not yet make full use of these statistics in the
optimizer. That change will come in a future PR.

Informs cockroachdb#34422

Release note (sql change): Added support for collection of multi-
column statistics. By default, statistics are now collected on all
prefixes of each index, in addition to 100 other individual columns.
This feature can be disabled by setting the cluster setting
sql.stats.multi_column_collection.enabled to false.
rytaft added a commit to rytaft/cockroach that referenced this issue Apr 21, 2020
Prior to this commit, only single-column statistics were supported
and collected. This commit adds support for collecting multi-column
statistics, and by default collects stats on all prefixes of each
index (in addition to all the single column statistics that were already
collected). For example, if there is an index on (a, b, c), we now
collect stats on (a, b) and (a, b, c) in addition to stats on each
individual column.

A new cluster setting sql.stats.multi_column_collection.enabled
controls this feature. It is currently enabled by default, but automatic
collection of multi-column stats can be disabled by setting it to false.

Note that we do not yet make full use of these statistics in the
optimizer. That change will come in a future PR.

Informs cockroachdb#34422

Release note (sql change): Added support for collection of multi-
column statistics. By default, statistics are now collected on all
prefixes of each index, in addition to 100 other individual columns.
This feature can be disabled by setting the cluster setting
sql.stats.multi_column_collection.enabled to false.
rytaft added a commit to rytaft/cockroach that referenced this issue Apr 22, 2020
Prior to this commit, only single-column statistics were supported
and collected. This commit adds support for collecting multi-column
statistics, and by default collects stats on all prefixes of each
index (in addition to all the single column statistics that were already
collected). For example, if there is an index on (a, b, c), we now
collect stats on (a, b) and (a, b, c) in addition to stats on each
individual column.

A new cluster setting sql.stats.multi_column_collection.enabled
controls this feature. It is currently enabled by default, but automatic
collection of multi-column stats can be disabled by setting it to false.

Note that we do not yet make full use of these statistics in the
optimizer. That change will come in a future PR.

Informs cockroachdb#34422

Release note (sql change): Added support for collection of multi-
column statistics. By default, statistics are now collected on all
prefixes of each index, in addition to 100 other individual columns.
This feature can be disabled by setting the cluster setting
sql.stats.multi_column_collection.enabled to false.
craig bot pushed a commit that referenced this issue Apr 22, 2020
47729: sql: collect multi-column stats on index prefixes r=rytaft a=rytaft

Prior to this commit, only single-column statistics were supported
and collected. This commit adds support for collecting multi-column
statistics, and by default collects stats on all prefixes of each
index (in addition to all the single column statistics that were already
collected). For example, if there is an index on (a, b, c), we now
collect stats on (a, b) and (a, b, c) in addition to stats on each
individual column.

A new cluster setting `sql.stats.multi_column_collection.enabled`
controls this feature. It is currently enabled by default, but automatic
collection of multi-column stats can be disabled by setting it to false.

Note that we do not yet make full use of these statistics in the
optimizer. That change will come in a future PR.

Informs #34422

Release note (sql change): Added support for collection of multi-
column statistics. By default, statistics are now collected on all
prefixes of each index, in addition to 100 other individual columns.
This feature can be disabled by setting the cluster setting
sql.stats.multi_column_collection.enabled to false.

47788: vendor: bump pebble to b976c257531658d4b96577393866c42d7b4ca801 r=petermattis a=petermattis

* internal/manifest: Fix merge skew with Formatter -> FormatKey change
* internal/manifest: Introduce L0SubLevels data structure
* *: introduce Comparer.FormatValue
* *: rename base.Formatter to base.FormatKey
* tool: enhance `sstable check` to catch bloom filter problems
* tool: add db properties subcommand
* tool: fix sstable panic

Enabled pretty-printing of values in `debug pebble` commands by hooking
up `MVCCComparer.FormatValue`.

Release note: None

Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
Co-authored-by: Peter Mattis <petermattis@gmail.com>
rytaft added a commit to rytaft/cockroach that referenced this issue Apr 28, 2020
Prior to this commit, histograms and multi-column stats were always used
by the optimizer for cardinality estimation if they were available. This
commit adds two new session settings along with two corresponding cluster
settings to control whether or not the optimizer should use histograms or
multi-column stats.

Note that even if the settings are enabled, the optimizer cannot use these
statistics if they haven't been collected. Collection of histograms and
multi-column stats is still controlled by the existing cluster settings
`sql.stats.histogram_collection.enabled` and
`sql.stats.multi_column_collection.enabled`.

The new settings to control usage by the optimizer are:

Cluster settings:
```
sql.defaults.optimizer_use_histograms.enabled
sql.defaults.optimizer_use_multicol_stats.enabled
```

Session settings:
```
optimizer_use_histograms
optimizer_use_multicol_stats
```

Both settings are enabled by default.

Fixes cockroachdb#43308
Informs cockroachdb#38082
Informs cockroachdb#34422

Release note (sql change): Added two new session settings and corresponding
cluster settings to control whether the optimizer uses histograms and
multi-column statistics for cardinality estimation. The session settings
are optimizer_use_histograms and optimizer_use_multicol_stats, with
corresponding cluster settings sql.defaults.optimizer_use_histograms.enabled
and sql.defaults.optimizer_use_multicol_stats.enabled. Both settings are
enabled by default.
rytaft added a commit to rytaft/cockroach that referenced this issue Apr 28, 2020
Prior to this commit, histograms and multi-column stats were always used
by the optimizer for cardinality estimation if they were available. This
commit adds two new session settings along with two corresponding cluster
settings to control whether or not the optimizer should use histograms or
multi-column stats.

Note that even if the settings are enabled, the optimizer cannot use these
statistics if they haven't been collected. Collection of histograms and
multi-column stats is still controlled by the existing cluster settings
`sql.stats.histogram_collection.enabled` and
`sql.stats.multi_column_collection.enabled`.

The new settings to control usage by the optimizer are:

Cluster settings:
```
sql.defaults.optimizer_use_histograms.enabled
sql.defaults.optimizer_use_multicol_stats.enabled
```

Session settings:
```
optimizer_use_histograms
optimizer_use_multicol_stats
```

Both settings are enabled by default.

Fixes cockroachdb#43308
Informs cockroachdb#38082
Informs cockroachdb#34422

Release note (sql change): Added two new session settings and corresponding
cluster settings to control whether the optimizer uses histograms and
multi-column statistics for cardinality estimation. The session settings
are optimizer_use_histograms and optimizer_use_multicol_stats, with
corresponding cluster settings sql.defaults.optimizer_use_histograms.enabled
and sql.defaults.optimizer_use_multicol_stats.enabled. Both settings are
enabled by default.
craig bot pushed a commit that referenced this issue Apr 28, 2020
48105: opt: add settings to enable usage of histograms and multi-column stats r=rytaft a=rytaft

Prior to this commit, histograms and multi-column stats were always used
by the optimizer for cardinality estimation if they were available. This
commit adds two new session settings along with two corresponding cluster
settings to control whether or not the optimizer should use histograms or
multi-column stats.

Note that even if the settings are enabled, the optimizer cannot use these
statistics if they haven't been collected. Collection of histograms and
multi-column stats is still controlled by the existing cluster settings
`sql.stats.histogram_collection.enabled` and
`sql.stats.multi_column_collection.enabled`.

The new settings to control usage by the optimizer are:

Cluster settings:
```
sql.defaults.optimizer_use_histograms.enabled
sql.defaults.optimizer_use_multicol_stats.enabled
```

Session settings:
```
optimizer_use_histograms
optimizer_use_multicol_stats
```

Both settings are enabled by default.

Fixes #43308
Informs #38082
Informs #34422

Release note (sql change): Added two new session settings and corresponding
cluster settings to control whether the optimizer uses histograms and
multi-column statistics for cardinality estimation. The session settings
are optimizer_use_histograms and optimizer_use_multicol_stats, with
corresponding cluster settings sql.defaults.optimizer_use_histograms.enabled
and sql.defaults.optimizer_use_multicol_stats.enabled. Both settings are
enabled by default.

Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
rytaft added a commit to rytaft/cockroach that referenced this issue May 4, 2020
This commit updates the injected stats for the TPC-C and TPC-H stats
quality tests to include multi-column stats. Since multi-column stats
are currently only used by the optimizer in a few limited cases, none
of the plans changed.

This commit is in preparation for work that will be done in the optimizer
to use multi-column stats for selectivity estimation.

Informs cockroachdb#34422

Release note: None
craig bot pushed a commit that referenced this issue May 5, 2020
48374: opt: update stats quality tests with multi-column stats r=rytaft a=rytaft

This commit updates the injected stats for the TPC-C and TPC-H stats
quality tests to include multi-column stats. Since multi-column stats
are currently only used by the optimizer in a few limited cases, none
of the plans changed.

This commit is in preparation for work that will be done in the optimizer
to use multi-column stats for selectivity estimation.

Informs #34422

Release note: None

Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
rytaft added a commit to rytaft/cockroach that referenced this issue May 15, 2020
This commit adds support for calculating selectivity from multi-column
statistics. It changes selectivityFromDistinctCounts to have the following
semantics:

selectivityFromDistinctCounts calculates the selectivity of a filter
by using estimated distinct counts of each constrained column before
and after the filter was applied. We can perform this calculation in
two different ways: (1) by treating the columns as completely independent,
or (2) by assuming they are correlated.

(1) Assuming independence between columns, we can calculate the selectivity
    by taking the product of selectivities of each constrained column. In
    the general case, this can be represented by the formula:
```
                     ┬-┬ ⎛ new_distinct(i) ⎞
      selectivity =  │ │ ⎜ --------------- ⎟
                     ┴ ┴ ⎝ old_distinct(i) ⎠
                    i in
                 {constrained
                   columns}
```
(2) If useMultiCol is true, we assume there is some correlation between
    columns. In this case, we calculate the selectivity using multi-column
    statistics.
```
                    ⎛ new_distinct({constrained columns}) ⎞
      selectivity = ⎜ ----------------------------------- ⎟
                    ⎝ old_distinct({constrained columns}) ⎠
```
    This formula looks simple, but the challenge is that it is difficult
    to determine the correct value for new_distinct({constrained columns})
    if each column is not constrained to a single value. For example, if
    new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2,
    3 or 4. We estimate the new distinct count as follows, using the concept
    of "soft functional dependency (FD) strength":
```
      new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled)

    where

      min_value = max(new_distinct(x), new_distinct(y))
      max_value = new_distinct(x) * new_distinct(y)
      range     = max_value - min_value

                    ⎛ max(old_distinct(x),old_distinct(y)) ⎞
      FD_strength = ⎜ ------------------------------------ ⎟
                    ⎝         old_distinct({x,y})          ⎠

                        ⎛ max(old_distinct(x), old_distinct(y)) ⎞
      min_FD_strength = ⎜ ------------------------------------- ⎟
                        ⎝   old_distinct(x) * old_distinct(y)   ⎠

                           ⎛ FD_strength - min_FD_strength ⎞
      FD_strength_scaled = ⎜ ----------------------------- ⎟
                           ⎝      1 - min_FD_strength      ⎠
```
    Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are
    perfectly correlated, old_distinct({x,y})=100. Using the example from
    above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values
    into the equation, we get:
```
      FD_strength_scaled  = 1
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2
```
    If x and y are completely independent, however, old_distinct({x,y})=1000.
    In this case, we get:
```
      FD_strength_scaled  = 0
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4
```
Note that even if useMultiCol is true and we calculate the selectivity
based on equation (2) above, we still want to take equation (1) into
account. This is because it is possible that there are two predicates that
each have selectivity s, but the multi-column selectivity is also s. In
order to ensure that the cost model considers the two predicates combined
to be more selective than either one individually, we must give some weight
to equation (1). Therefore, instead of equation (2) we actually return the
following selectivity:
```
  selectivity = (1 - w) * (eq. 1) + w * (eq. 2)
```
where w currently set to 0.9.

This selectivity will be used later to update the row count and the
distinct count for the unconstrained columns.

Fixes cockroachdb#34422

Release note (performance improvement): Added support for calculating the
selectivity of filter predicates in the optimizer using multi-column
statistics. This improves the cardinality estimates of the optimizer when
a query has filter predicates constraining multiple columns. As a result,
the optimizer may choose a better query plan in some cases.
rytaft added a commit to rytaft/cockroach that referenced this issue May 20, 2020
This commit adds support for calculating selectivity from multi-column
statistics. It changes selectivityFromDistinctCounts to have the following
semantics:

selectivityFromDistinctCounts calculates the selectivity of a filter
by using estimated distinct counts of each constrained column before
and after the filter was applied. We can perform this calculation in
two different ways: (1) by treating the columns as completely independent,
or (2) by assuming they are correlated.

(1) Assuming independence between columns, we can calculate the selectivity
    by taking the product of selectivities of each constrained column. In
    the general case, this can be represented by the formula:
```
                     ┬-┬ ⎛ new_distinct(i) ⎞
      selectivity =  │ │ ⎜ --------------- ⎟
                     ┴ ┴ ⎝ old_distinct(i) ⎠
                    i in
                 {constrained
                   columns}
```
(2) If useMultiCol is true, we assume there is some correlation between
    columns. In this case, we calculate the selectivity using multi-column
    statistics.
```
                    ⎛ new_distinct({constrained columns}) ⎞
      selectivity = ⎜ ----------------------------------- ⎟
                    ⎝ old_distinct({constrained columns}) ⎠
```
    This formula looks simple, but the challenge is that it is difficult
    to determine the correct value for new_distinct({constrained columns})
    if each column is not constrained to a single value. For example, if
    new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2,
    3 or 4. We estimate the new distinct count as follows, using the concept
    of "soft functional dependency (FD) strength":
```
      new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled)

    where

      min_value = max(new_distinct(x), new_distinct(y))
      max_value = new_distinct(x) * new_distinct(y)
      range     = max_value - min_value

                    ⎛ max(old_distinct(x),old_distinct(y)) ⎞
      FD_strength = ⎜ ------------------------------------ ⎟
                    ⎝         old_distinct({x,y})          ⎠

                        ⎛ max(old_distinct(x), old_distinct(y)) ⎞
      min_FD_strength = ⎜ ------------------------------------- ⎟
                        ⎝   old_distinct(x) * old_distinct(y)   ⎠

                           ⎛ FD_strength - min_FD_strength ⎞
      FD_strength_scaled = ⎜ ----------------------------- ⎟
                           ⎝      1 - min_FD_strength      ⎠
```
    Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are
    perfectly correlated, old_distinct({x,y})=100. Using the example from
    above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values
    into the equation, we get:
```
      FD_strength_scaled  = 1
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2
```
    If x and y are completely independent, however, old_distinct({x,y})=1000.
    In this case, we get:
```
      FD_strength_scaled  = 0
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4
```
Note that even if useMultiCol is true and we calculate the selectivity
based on equation (2) above, we still want to take equation (1) into
account. This is because it is possible that there are two predicates that
each have selectivity s, but the multi-column selectivity is also s. In
order to ensure that the cost model considers the two predicates combined
to be more selective than either one individually, we must give some weight
to equation (1). Therefore, instead of equation (2) we actually return the
following selectivity:
```
  selectivity = (1 - w) * (eq. 1) + w * (eq. 2)
```
where w currently set to 0.9.

This selectivity will be used later to update the row count and the
distinct count for the unconstrained columns.

Fixes cockroachdb#34422

Release note (performance improvement): Added support for calculating the
selectivity of filter predicates in the optimizer using multi-column
statistics. This improves the cardinality estimates of the optimizer when
a query has filter predicates constraining multiple columns. As a result,
the optimizer may choose a better query plan in some cases.
craig bot pushed a commit that referenced this issue May 29, 2020
49134: opt: add support for calculating selectivity from multi-column stats r=rytaft a=rytaft

**opt: fix bug with histograms and multi-span index constraints**

Prior to this commit, filtering a histogram with a multi-span index
constraint could lead to incorrect results if the histogram column
was part of the exact prefix of the constraint. This was due to the
fact that the same value appeared in multiple spans, breaking the
assumption of the histogram code that values in spans were always
ordered and non-overlapping. For example, filtering a histogram on
column b with the constraint `/b/c/: [/1/2 - /1/2] [/1/4 - /1/4]`
would return an incorrect result, because the values in the matching
histogram bucket would be counted twice.

This commit fixes the problem by only considering a single span if
the column is part of the exact prefix.

Release note (performance improvement): Fixed a bug in the
histogram logic in the optimizer which was causing an over-estimate
for the cardinality of constrained index scans in some cases when
multiple columns of the index were constrained. This problem was
introduced early in the development for the 20.2 release so should
not have ever been part of a release. The fix improves the optimizer's
cardinality estimates so may result in better query plan selection.

**opt: fix distinct count estimates for index constraint columns**

Prior to this commit, the statistics_builder was incorrectly estimating
the distinct count of columns that were only slightly constrained as part
of an index constraint. For example, it was estimating based on
constraints such as `/a/b: [/1 - /5/6]` or `/a/b: [ - /5/6]` that the distinct
count of column b should be reduced by 2/3. However, in reality, we cannot
assume anything about the distinct count of column b based on those two
constraints.

This commit fixes the estimate by only reducing the distinct count for
columns that are part of the prefix of the constraint (columns for which
all the spans have the same start and end values) or the first column after.

Release note (performance improvement): Fixed the optimizer's distinct count
estimate for columns constrained by an index constraint, which was too low
in some cases. The fix improves the optimizer's cardinality estimates, which
can lead to better query plan selection.

**opt: ensure multi-col stats are consistent with single-col stats**

Prior to this commit, it was possible that the estimated distinct count
of a multi-column statistic was larger than the product of the estimated
distinct counts of all of its individual columns. This could happen because
we reduce the distinct count of columns that are constrained by a predicate,
but we don't update the distinct counts of any multi-column stats that
contain that column. This is internally inconsistent and could lead to bad
cardinality estimates.

This commit fixes the problem by explicitly checking for such
inconsistencies and fixing them after the stats for each operator are
built or a column stat is calculated. If it is found that the estimated
distinct count is too high, it is reset to the product of the distinct
counts of its constituent columns.

This commit also adds a check that the distinct count of the multi-column
statistic is no smaller than the largest distinct count of its constituent
columns (although I haven't yet found a test case where this was a problem).

Release note (performance improvement): Fixed the optimizer's estimated
distinct count for a multi-column statistic when all of the columns in the
statistic are constrained by a filter predicate. The fix can lead to
improved cardinality estimates, leading to better query plan selection
in some cases.

**opt: add support for calculating selectivity from multi-column stats**

This commit adds support for calculating selectivity from multi-column
statistics. It changes `selectivityFromDistinctCounts` to have the following
semantics:

`selectivityFromDistinctCounts` calculates the selectivity of a filter
by using estimated distinct counts of each constrained column before
and after the filter was applied. We can perform this calculation in
two different ways: (1) by treating the columns as completely independent,
or (2) by assuming they are correlated.

(1) Assuming independence between columns, we can calculate the selectivity
    by taking the product of selectivities of each constrained column. In
    the general case, this can be represented by the formula:

```
                     ┬-┬ ⎛ new_distinct(i) ⎞
      selectivity =  │ │ ⎜ --------------- ⎟
                     ┴ ┴ ⎝ old_distinct(i) ⎠
                    i in
                 {constrained
                   columns}
```

(2) If `useMultiCol` is true, we assume there is some correlation between
    columns. In this case, we calculate the selectivity using multi-column
    statistics.

```
                    ⎛ new_distinct({constrained columns}) ⎞
      selectivity = ⎜ ----------------------------------- ⎟
                    ⎝ old_distinct({constrained columns}) ⎠
```

    This formula looks simple, but the challenge is that it is difficult
    to determine the correct value for new_distinct({constrained columns})
    if each column is not constrained to a single value. For example, if
    new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2,
    3 or 4. We estimate the new distinct count as follows, using the concept
    of "soft functional dependency (FD) strength":

```
      new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled)

    where

      min_value = max(new_distinct(x), new_distinct(y))
      max_value = new_distinct(x) * new_distinct(y)
      range     = max_value - min_value

                    ⎛ max(old_distinct(x),old_distinct(y)) ⎞
      FD_strength = ⎜ ------------------------------------ ⎟
                    ⎝         old_distinct({x,y})          ⎠

                        ⎛ max(old_distinct(x), old_distinct(y)) ⎞
      min_FD_strength = ⎜ ------------------------------------- ⎟
                        ⎝   old_distinct(x) * old_distinct(y)   ⎠

                           ⎛ FD_strength - min_FD_strength ⎞
      FD_strength_scaled = ⎜ ----------------------------- ⎟
                           ⎝      1 - min_FD_strength      ⎠
```

    Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are
    perfectly correlated, old_distinct({x,y})=100. Using the example from
    above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values
    into the equation, we get:

```
      FD_strength_scaled  = 1
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2
```

    If x and y are completely independent, however, old_distinct({x,y})=1000.
    In this case, we get:

```
      FD_strength_scaled  = 0
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4
```

Note that even if `useMultiCol` is true and we calculate the selectivity
based on equation (2) above, we still want to take equation (1) into
account. This is because it is possible that there are two predicates that
each have selectivity s, but the multi-column selectivity is also s. In
order to ensure that the cost model considers the two predicates combined
to be more selective than either one individually, we must give some weight
to equation (1). Therefore, instead of equation (2) we actually return the
following selectivity:

```
  selectivity = (1 - w) * (eq. 1) + w * (eq. 2)
```

where w currently set to 0.9.

This selectivity is used update the row count and the distinct count for the
unconstrained columns.

Fixes #34422

Release note (performance improvement): Added support for calculating the
selectivity of filter predicates in the optimizer using multi-column
statistics. This improves the cardinality estimates of the optimizer when
a query has filter predicates constraining multiple columns. As a result,
the optimizer may choose a better query plan in some cases.

Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
@craig craig bot closed this as completed in df34c56 May 29, 2020
jbowens pushed a commit to jbowens/cockroach that referenced this issue Jun 1, 2020
This commit adds support for calculating selectivity from multi-column
statistics. It changes selectivityFromDistinctCounts to have the following
semantics:

selectivityFromDistinctCounts calculates the selectivity of a filter
by using estimated distinct counts of each constrained column before
and after the filter was applied. We can perform this calculation in
two different ways: (1) by treating the columns as completely independent,
or (2) by assuming they are correlated.

(1) Assuming independence between columns, we can calculate the selectivity
    by taking the product of selectivities of each constrained column. In
    the general case, this can be represented by the formula:
```
                     ┬-┬ ⎛ new_distinct(i) ⎞
      selectivity =  │ │ ⎜ --------------- ⎟
                     ┴ ┴ ⎝ old_distinct(i) ⎠
                    i in
                 {constrained
                   columns}
```
(2) If useMultiCol is true, we assume there is some correlation between
    columns. In this case, we calculate the selectivity using multi-column
    statistics.
```
                    ⎛ new_distinct({constrained columns}) ⎞
      selectivity = ⎜ ----------------------------------- ⎟
                    ⎝ old_distinct({constrained columns}) ⎠
```
    This formula looks simple, but the challenge is that it is difficult
    to determine the correct value for new_distinct({constrained columns})
    if each column is not constrained to a single value. For example, if
    new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2,
    3 or 4. We estimate the new distinct count as follows, using the concept
    of "soft functional dependency (FD) strength":
```
      new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled)

    where

      min_value = max(new_distinct(x), new_distinct(y))
      max_value = new_distinct(x) * new_distinct(y)
      range     = max_value - min_value

                    ⎛ max(old_distinct(x),old_distinct(y)) ⎞
      FD_strength = ⎜ ------------------------------------ ⎟
                    ⎝         old_distinct({x,y})          ⎠

                        ⎛ max(old_distinct(x), old_distinct(y)) ⎞
      min_FD_strength = ⎜ ------------------------------------- ⎟
                        ⎝   old_distinct(x) * old_distinct(y)   ⎠

                           ⎛ FD_strength - min_FD_strength ⎞
      FD_strength_scaled = ⎜ ----------------------------- ⎟
                           ⎝      1 - min_FD_strength      ⎠
```
    Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are
    perfectly correlated, old_distinct({x,y})=100. Using the example from
    above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values
    into the equation, we get:
```
      FD_strength_scaled  = 1
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2
```
    If x and y are completely independent, however, old_distinct({x,y})=1000.
    In this case, we get:
```
      FD_strength_scaled  = 0
      new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4
```
Note that even if useMultiCol is true and we calculate the selectivity
based on equation (2) above, we still want to take equation (1) into
account. This is because it is possible that there are two predicates that
each have selectivity s, but the multi-column selectivity is also s. In
order to ensure that the cost model considers the two predicates combined
to be more selective than either one individually, we must give some weight
to equation (1). Therefore, instead of equation (2) we actually return the
following selectivity:
```
  selectivity = (1 - w) * (eq. 1) + w * (eq. 2)
```
where w currently set to 0.9.

This selectivity will be used later to update the row count and the
distinct count for the unconstrained columns.

Fixes cockroachdb#34422

Release note (performance improvement): Added support for calculating the
selectivity of filter predicates in the optimizer using multi-column
statistics. This improves the cardinality estimates of the optimizer when
a query has filter predicates constraining multiple columns. As a result,
the optimizer may choose a better query plan in some cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-optimizer SQL logical planning and optimizations. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants