opt: add support for multi-column statistics #34422

rytaft · 2019-01-30T19:41:00Z

Currently we only support creation of table statistics on single columns. Although the syntax for creation of statistics with multiple columns is supported, execution of such a request causes an error. For example:

> CREATE STATISTICS employee_multi_col_stats ON first_name, last_name FROM employees;
ERROR:  multi-column sketches not supported yet

This issue will track support for this feature.

The text was updated successfully, but these errors were encountered:

34428: sql: Update error message for multi-column stats r=rmloveland a=rmloveland Updated the error message as follows: - Changed from referencing "multi-column sketches" (which are not mentioned in our docs anywhere) to "multi-column statistics" - Added link to CockroachDB issue discussing multi-column stats for the user's reference. See also: #34422. Release note: None 35410: sqlsmith: major refactor r=mjibson a=mjibson Refactor sqlsmith to use internal cockroach types for everything. Change reference handling to be more explicit and correct. These should enable us to more easily and correctly add 35427: roachtest: Fix blacklist lookups to handle change from v2.2 to v19.1 r=BramGruneir a=BramGruneir I've been working on this fix as part of #35423 but that's been taking too long and I don't want the tests to keep failing. So here is a direct fix that also updates the blacklist for psycopg as a new test is now passing. Fixes #35271. Fixes #35285. Release note: None Co-authored-by: Rich Loveland <rich@cockroachlabs.com> Co-authored-by: Matt Jibson <matt.jibson@gmail.com> Co-authored-by: Bram Gruneir <bram@cockroachlabs.com>

Prior to this commit, only single-column statistics were supported and collected. This commit adds support for collecting multi-column statistics, and by default collects stats on all prefixes of each index (in addition to all the single column statistics that were already collected). For example, if there is an index on (a, b, c), we now collect stats on (a, b) and (a, b, c) in addition to stats on each individual column. Note that we do not yet make full use of these statistics in the optimizer. That change will come in a future PR. Informs cockroachdb#34422 Release note (sql change): Added support for collection of multi- column statistics. By default, statistics are now collected on all prefixes of each index, in addition to 100 other individual columns.

Prior to this commit, only single-column statistics were supported and collected. This commit adds support for collecting multi-column statistics, and by default collects stats on all prefixes of each index (in addition to all the single column statistics that were already collected). For example, if there is an index on (a, b, c), we now collect stats on (a, b) and (a, b, c) in addition to stats on each individual column. A new cluster setting sql.stats.multi_column_collection.enabled controls this feature. It is currently enabled by default, but automatic collection of multi-column stats can be disabled by setting it to false. Note that we do not yet make full use of these statistics in the optimizer. That change will come in a future PR. Informs cockroachdb#34422 Release note (sql change): Added support for collection of multi- column statistics. By default, statistics are now collected on all prefixes of each index, in addition to 100 other individual columns. This feature can be disabled by setting the cluster setting sql.stats.multi_column_collection.enabled to false.

47729: sql: collect multi-column stats on index prefixes r=rytaft a=rytaft Prior to this commit, only single-column statistics were supported and collected. This commit adds support for collecting multi-column statistics, and by default collects stats on all prefixes of each index (in addition to all the single column statistics that were already collected). For example, if there is an index on (a, b, c), we now collect stats on (a, b) and (a, b, c) in addition to stats on each individual column. A new cluster setting `sql.stats.multi_column_collection.enabled` controls this feature. It is currently enabled by default, but automatic collection of multi-column stats can be disabled by setting it to false. Note that we do not yet make full use of these statistics in the optimizer. That change will come in a future PR. Informs #34422 Release note (sql change): Added support for collection of multi- column statistics. By default, statistics are now collected on all prefixes of each index, in addition to 100 other individual columns. This feature can be disabled by setting the cluster setting sql.stats.multi_column_collection.enabled to false. 47788: vendor: bump pebble to b976c257531658d4b96577393866c42d7b4ca801 r=petermattis a=petermattis * internal/manifest: Fix merge skew with Formatter -> FormatKey change * internal/manifest: Introduce L0SubLevels data structure * *: introduce Comparer.FormatValue * *: rename base.Formatter to base.FormatKey * tool: enhance `sstable check` to catch bloom filter problems * tool: add db properties subcommand * tool: fix sstable panic Enabled pretty-printing of values in `debug pebble` commands by hooking up `MVCCComparer.FormatValue`. Release note: None Co-authored-by: Rebecca Taft <becca@cockroachlabs.com> Co-authored-by: Peter Mattis <petermattis@gmail.com>

Prior to this commit, histograms and multi-column stats were always used by the optimizer for cardinality estimation if they were available. This commit adds two new session settings along with two corresponding cluster settings to control whether or not the optimizer should use histograms or multi-column stats. Note that even if the settings are enabled, the optimizer cannot use these statistics if they haven't been collected. Collection of histograms and multi-column stats is still controlled by the existing cluster settings `sql.stats.histogram_collection.enabled` and `sql.stats.multi_column_collection.enabled`. The new settings to control usage by the optimizer are: Cluster settings: ``` sql.defaults.optimizer_use_histograms.enabled sql.defaults.optimizer_use_multicol_stats.enabled ``` Session settings: ``` optimizer_use_histograms optimizer_use_multicol_stats ``` Both settings are enabled by default. Fixes cockroachdb#43308 Informs cockroachdb#38082 Informs cockroachdb#34422 Release note (sql change): Added two new session settings and corresponding cluster settings to control whether the optimizer uses histograms and multi-column statistics for cardinality estimation. The session settings are optimizer_use_histograms and optimizer_use_multicol_stats, with corresponding cluster settings sql.defaults.optimizer_use_histograms.enabled and sql.defaults.optimizer_use_multicol_stats.enabled. Both settings are enabled by default.

48105: opt: add settings to enable usage of histograms and multi-column stats r=rytaft a=rytaft Prior to this commit, histograms and multi-column stats were always used by the optimizer for cardinality estimation if they were available. This commit adds two new session settings along with two corresponding cluster settings to control whether or not the optimizer should use histograms or multi-column stats. Note that even if the settings are enabled, the optimizer cannot use these statistics if they haven't been collected. Collection of histograms and multi-column stats is still controlled by the existing cluster settings `sql.stats.histogram_collection.enabled` and `sql.stats.multi_column_collection.enabled`. The new settings to control usage by the optimizer are: Cluster settings: ``` sql.defaults.optimizer_use_histograms.enabled sql.defaults.optimizer_use_multicol_stats.enabled ``` Session settings: ``` optimizer_use_histograms optimizer_use_multicol_stats ``` Both settings are enabled by default. Fixes #43308 Informs #38082 Informs #34422 Release note (sql change): Added two new session settings and corresponding cluster settings to control whether the optimizer uses histograms and multi-column statistics for cardinality estimation. The session settings are optimizer_use_histograms and optimizer_use_multicol_stats, with corresponding cluster settings sql.defaults.optimizer_use_histograms.enabled and sql.defaults.optimizer_use_multicol_stats.enabled. Both settings are enabled by default. Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>

This commit updates the injected stats for the TPC-C and TPC-H stats quality tests to include multi-column stats. Since multi-column stats are currently only used by the optimizer in a few limited cases, none of the plans changed. This commit is in preparation for work that will be done in the optimizer to use multi-column stats for selectivity estimation. Informs cockroachdb#34422 Release note: None

48374: opt: update stats quality tests with multi-column stats r=rytaft a=rytaft This commit updates the injected stats for the TPC-C and TPC-H stats quality tests to include multi-column stats. Since multi-column stats are currently only used by the optimizer in a few limited cases, none of the plans changed. This commit is in preparation for work that will be done in the optimizer to use multi-column stats for selectivity estimation. Informs #34422 Release note: None Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>

This commit adds support for calculating selectivity from multi-column statistics. It changes selectivityFromDistinctCounts to have the following semantics: selectivityFromDistinctCounts calculates the selectivity of a filter by using estimated distinct counts of each constrained column before and after the filter was applied. We can perform this calculation in two different ways: (1) by treating the columns as completely independent, or (2) by assuming they are correlated. (1) Assuming independence between columns, we can calculate the selectivity by taking the product of selectivities of each constrained column. In the general case, this can be represented by the formula: ``` ┬-┬ ⎛ new_distinct(i) ⎞ selectivity = │ │ ⎜ --------------- ⎟ ┴ ┴ ⎝ old_distinct(i) ⎠ i in {constrained columns} ``` (2) If useMultiCol is true, we assume there is some correlation between columns. In this case, we calculate the selectivity using multi-column statistics. ``` ⎛ new_distinct({constrained columns}) ⎞ selectivity = ⎜ ----------------------------------- ⎟ ⎝ old_distinct({constrained columns}) ⎠ ``` This formula looks simple, but the challenge is that it is difficult to determine the correct value for new_distinct({constrained columns}) if each column is not constrained to a single value. For example, if new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2, 3 or 4. We estimate the new distinct count as follows, using the concept of "soft functional dependency (FD) strength": ``` new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled) where min_value = max(new_distinct(x), new_distinct(y)) max_value = new_distinct(x) * new_distinct(y) range = max_value - min_value ⎛ max(old_distinct(x),old_distinct(y)) ⎞ FD_strength = ⎜ ------------------------------------ ⎟ ⎝ old_distinct({x,y}) ⎠ ⎛ max(old_distinct(x), old_distinct(y)) ⎞ min_FD_strength = ⎜ ------------------------------------- ⎟ ⎝ old_distinct(x) * old_distinct(y) ⎠ ⎛ FD_strength - min_FD_strength ⎞ FD_strength_scaled = ⎜ ----------------------------- ⎟ ⎝ 1 - min_FD_strength ⎠ ``` Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are perfectly correlated, old_distinct({x,y})=100. Using the example from above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values into the equation, we get: ``` FD_strength_scaled = 1 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2 ``` If x and y are completely independent, however, old_distinct({x,y})=1000. In this case, we get: ``` FD_strength_scaled = 0 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4 ``` Note that even if useMultiCol is true and we calculate the selectivity based on equation (2) above, we still want to take equation (1) into account. This is because it is possible that there are two predicates that each have selectivity s, but the multi-column selectivity is also s. In order to ensure that the cost model considers the two predicates combined to be more selective than either one individually, we must give some weight to equation (1). Therefore, instead of equation (2) we actually return the following selectivity: ``` selectivity = (1 - w) * (eq. 1) + w * (eq. 2) ``` where w currently set to 0.9. This selectivity will be used later to update the row count and the distinct count for the unconstrained columns. Fixes cockroachdb#34422 Release note (performance improvement): Added support for calculating the selectivity of filter predicates in the optimizer using multi-column statistics. This improves the cardinality estimates of the optimizer when a query has filter predicates constraining multiple columns. As a result, the optimizer may choose a better query plan in some cases.

49134: opt: add support for calculating selectivity from multi-column stats r=rytaft a=rytaft **opt: fix bug with histograms and multi-span index constraints** Prior to this commit, filtering a histogram with a multi-span index constraint could lead to incorrect results if the histogram column was part of the exact prefix of the constraint. This was due to the fact that the same value appeared in multiple spans, breaking the assumption of the histogram code that values in spans were always ordered and non-overlapping. For example, filtering a histogram on column b with the constraint `/b/c/: [/1/2 - /1/2] [/1/4 - /1/4]` would return an incorrect result, because the values in the matching histogram bucket would be counted twice. This commit fixes the problem by only considering a single span if the column is part of the exact prefix. Release note (performance improvement): Fixed a bug in the histogram logic in the optimizer which was causing an over-estimate for the cardinality of constrained index scans in some cases when multiple columns of the index were constrained. This problem was introduced early in the development for the 20.2 release so should not have ever been part of a release. The fix improves the optimizer's cardinality estimates so may result in better query plan selection. **opt: fix distinct count estimates for index constraint columns** Prior to this commit, the statistics_builder was incorrectly estimating the distinct count of columns that were only slightly constrained as part of an index constraint. For example, it was estimating based on constraints such as `/a/b: [/1 - /5/6]` or `/a/b: [ - /5/6]` that the distinct count of column b should be reduced by 2/3. However, in reality, we cannot assume anything about the distinct count of column b based on those two constraints. This commit fixes the estimate by only reducing the distinct count for columns that are part of the prefix of the constraint (columns for which all the spans have the same start and end values) or the first column after. Release note (performance improvement): Fixed the optimizer's distinct count estimate for columns constrained by an index constraint, which was too low in some cases. The fix improves the optimizer's cardinality estimates, which can lead to better query plan selection. **opt: ensure multi-col stats are consistent with single-col stats** Prior to this commit, it was possible that the estimated distinct count of a multi-column statistic was larger than the product of the estimated distinct counts of all of its individual columns. This could happen because we reduce the distinct count of columns that are constrained by a predicate, but we don't update the distinct counts of any multi-column stats that contain that column. This is internally inconsistent and could lead to bad cardinality estimates. This commit fixes the problem by explicitly checking for such inconsistencies and fixing them after the stats for each operator are built or a column stat is calculated. If it is found that the estimated distinct count is too high, it is reset to the product of the distinct counts of its constituent columns. This commit also adds a check that the distinct count of the multi-column statistic is no smaller than the largest distinct count of its constituent columns (although I haven't yet found a test case where this was a problem). Release note (performance improvement): Fixed the optimizer's estimated distinct count for a multi-column statistic when all of the columns in the statistic are constrained by a filter predicate. The fix can lead to improved cardinality estimates, leading to better query plan selection in some cases. **opt: add support for calculating selectivity from multi-column stats** This commit adds support for calculating selectivity from multi-column statistics. It changes `selectivityFromDistinctCounts` to have the following semantics: `selectivityFromDistinctCounts` calculates the selectivity of a filter by using estimated distinct counts of each constrained column before and after the filter was applied. We can perform this calculation in two different ways: (1) by treating the columns as completely independent, or (2) by assuming they are correlated. (1) Assuming independence between columns, we can calculate the selectivity by taking the product of selectivities of each constrained column. In the general case, this can be represented by the formula: ``` ┬-┬ ⎛ new_distinct(i) ⎞ selectivity = │ │ ⎜ --------------- ⎟ ┴ ┴ ⎝ old_distinct(i) ⎠ i in {constrained columns} ``` (2) If `useMultiCol` is true, we assume there is some correlation between columns. In this case, we calculate the selectivity using multi-column statistics. ``` ⎛ new_distinct({constrained columns}) ⎞ selectivity = ⎜ ----------------------------------- ⎟ ⎝ old_distinct({constrained columns}) ⎠ ``` This formula looks simple, but the challenge is that it is difficult to determine the correct value for new_distinct({constrained columns}) if each column is not constrained to a single value. For example, if new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2, 3 or 4. We estimate the new distinct count as follows, using the concept of "soft functional dependency (FD) strength": ``` new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled) where min_value = max(new_distinct(x), new_distinct(y)) max_value = new_distinct(x) * new_distinct(y) range = max_value - min_value ⎛ max(old_distinct(x),old_distinct(y)) ⎞ FD_strength = ⎜ ------------------------------------ ⎟ ⎝ old_distinct({x,y}) ⎠ ⎛ max(old_distinct(x), old_distinct(y)) ⎞ min_FD_strength = ⎜ ------------------------------------- ⎟ ⎝ old_distinct(x) * old_distinct(y) ⎠ ⎛ FD_strength - min_FD_strength ⎞ FD_strength_scaled = ⎜ ----------------------------- ⎟ ⎝ 1 - min_FD_strength ⎠ ``` Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are perfectly correlated, old_distinct({x,y})=100. Using the example from above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values into the equation, we get: ``` FD_strength_scaled = 1 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2 ``` If x and y are completely independent, however, old_distinct({x,y})=1000. In this case, we get: ``` FD_strength_scaled = 0 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4 ``` Note that even if `useMultiCol` is true and we calculate the selectivity based on equation (2) above, we still want to take equation (1) into account. This is because it is possible that there are two predicates that each have selectivity s, but the multi-column selectivity is also s. In order to ensure that the cost model considers the two predicates combined to be more selective than either one individually, we must give some weight to equation (1). Therefore, instead of equation (2) we actually return the following selectivity: ``` selectivity = (1 - w) * (eq. 1) + w * (eq. 2) ``` where w currently set to 0.9. This selectivity is used update the row count and the distinct count for the unconstrained columns. Fixes #34422 Release note (performance improvement): Added support for calculating the selectivity of filter predicates in the optimizer using multi-column statistics. This improves the cardinality estimates of the optimizer when a query has filter predicates constraining multiple columns. As a result, the optimizer may choose a better query plan in some cases. Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>

This commit adds support for calculating selectivity from multi-column statistics. It changes selectivityFromDistinctCounts to have the following semantics: selectivityFromDistinctCounts calculates the selectivity of a filter by using estimated distinct counts of each constrained column before and after the filter was applied. We can perform this calculation in two different ways: (1) by treating the columns as completely independent, or (2) by assuming they are correlated. (1) Assuming independence between columns, we can calculate the selectivity by taking the product of selectivities of each constrained column. In the general case, this can be represented by the formula: ``` ┬-┬ ⎛ new_distinct(i) ⎞ selectivity = │ │ ⎜ --------------- ⎟ ┴ ┴ ⎝ old_distinct(i) ⎠ i in {constrained columns} ``` (2) If useMultiCol is true, we assume there is some correlation between columns. In this case, we calculate the selectivity using multi-column statistics. ``` ⎛ new_distinct({constrained columns}) ⎞ selectivity = ⎜ ----------------------------------- ⎟ ⎝ old_distinct({constrained columns}) ⎠ ``` This formula looks simple, but the challenge is that it is difficult to determine the correct value for new_distinct({constrained columns}) if each column is not constrained to a single value. For example, if new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2, 3 or 4. We estimate the new distinct count as follows, using the concept of "soft functional dependency (FD) strength": ``` new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled) where min_value = max(new_distinct(x), new_distinct(y)) max_value = new_distinct(x) * new_distinct(y) range = max_value - min_value ⎛ max(old_distinct(x),old_distinct(y)) ⎞ FD_strength = ⎜ ------------------------------------ ⎟ ⎝ old_distinct({x,y}) ⎠ ⎛ max(old_distinct(x), old_distinct(y)) ⎞ min_FD_strength = ⎜ ------------------------------------- ⎟ ⎝ old_distinct(x) * old_distinct(y) ⎠ ⎛ FD_strength - min_FD_strength ⎞ FD_strength_scaled = ⎜ ----------------------------- ⎟ ⎝ 1 - min_FD_strength ⎠ ``` Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are perfectly correlated, old_distinct({x,y})=100. Using the example from above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values into the equation, we get: ``` FD_strength_scaled = 1 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2 ``` If x and y are completely independent, however, old_distinct({x,y})=1000. In this case, we get: ``` FD_strength_scaled = 0 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4 ``` Note that even if useMultiCol is true and we calculate the selectivity based on equation (2) above, we still want to take equation (1) into account. This is because it is possible that there are two predicates that each have selectivity s, but the multi-column selectivity is also s. In order to ensure that the cost model considers the two predicates combined to be more selective than either one individually, we must give some weight to equation (1). Therefore, instead of equation (2) we actually return the following selectivity: ``` selectivity = (1 - w) * (eq. 1) + w * (eq. 2) ``` where w currently set to 0.9. This selectivity will be used later to update the row count and the distinct count for the unconstrained columns. Fixes cockroachdb#34422 Release note (performance improvement): Added support for calculating the selectivity of filter predicates in the optimizer using multi-column statistics. This improves the cardinality estimates of the optimizer when a query has filter predicates constraining multiple columns. As a result, the optimizer may choose a better query plan in some cases.

rytaft self-assigned this Jan 30, 2019

rmloveland mentioned this issue Jan 30, 2019

sql: Update error message for multi-column stats #34428

Merged

awoods187 added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-optimizer SQL logical planning and optimizations. labels Mar 6, 2019

knz mentioned this issue Mar 18, 2019

opt: support creating and using statistics on JSON columns #35844

Closed

rytaft mentioned this issue Apr 20, 2020

sql: collect multi-column stats on index prefixes #47729

Merged

rytaft mentioned this issue Apr 28, 2020

opt: add settings to enable usage of histograms and multi-column stats #48105

Merged

rytaft mentioned this issue May 4, 2020

opt: update stats quality tests with multi-column stats #48374

Merged

rytaft mentioned this issue May 15, 2020

opt: add support for calculating selectivity from multi-column stats #49134

Merged

craig bot closed this as completed in df34c56 May 29, 2020

ericharmeling mentioned this issue Jun 24, 2020

Multi-column stats cockroachdb/docs#4798

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt: add support for multi-column statistics #34422

opt: add support for multi-column statistics #34422

rytaft commented Jan 30, 2019

opt: add support for multi-column statistics #34422

opt: add support for multi-column statistics #34422

Comments

rytaft commented Jan 30, 2019