-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt: add support for multi-column statistics #34422
Labels
A-sql-optimizer
SQL logical planning and optimizations.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Comments
awoods187
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
A-sql-optimizer
SQL logical planning and optimizations.
labels
Mar 6, 2019
craig bot
pushed a commit
that referenced
this issue
Mar 6, 2019
34428: sql: Update error message for multi-column stats r=rmloveland a=rmloveland Updated the error message as follows: - Changed from referencing "multi-column sketches" (which are not mentioned in our docs anywhere) to "multi-column statistics" - Added link to CockroachDB issue discussing multi-column stats for the user's reference. See also: #34422. Release note: None 35410: sqlsmith: major refactor r=mjibson a=mjibson Refactor sqlsmith to use internal cockroach types for everything. Change reference handling to be more explicit and correct. These should enable us to more easily and correctly add 35427: roachtest: Fix blacklist lookups to handle change from v2.2 to v19.1 r=BramGruneir a=BramGruneir I've been working on this fix as part of #35423 but that's been taking too long and I don't want the tests to keep failing. So here is a direct fix that also updates the blacklist for psycopg as a new test is now passing. Fixes #35271. Fixes #35285. Release note: None Co-authored-by: Rich Loveland <rich@cockroachlabs.com> Co-authored-by: Matt Jibson <matt.jibson@gmail.com> Co-authored-by: Bram Gruneir <bram@cockroachlabs.com>
rytaft
added a commit
to rytaft/cockroach
that referenced
this issue
Apr 20, 2020
Prior to this commit, only single-column statistics were supported and collected. This commit adds support for collecting multi-column statistics, and by default collects stats on all prefixes of each index (in addition to all the single column statistics that were already collected). For example, if there is an index on (a, b, c), we now collect stats on (a, b) and (a, b, c) in addition to stats on each individual column. Note that we do not yet make full use of these statistics in the optimizer. That change will come in a future PR. Informs cockroachdb#34422 Release note (sql change): Added support for collection of multi- column statistics. By default, statistics are now collected on all prefixes of each index, in addition to 100 other individual columns.
rytaft
added a commit
to rytaft/cockroach
that referenced
this issue
Apr 21, 2020
Prior to this commit, only single-column statistics were supported and collected. This commit adds support for collecting multi-column statistics, and by default collects stats on all prefixes of each index (in addition to all the single column statistics that were already collected). For example, if there is an index on (a, b, c), we now collect stats on (a, b) and (a, b, c) in addition to stats on each individual column. A new cluster setting sql.stats.multi_column_collection.enabled controls this feature. It is currently enabled by default, but automatic collection of multi-column stats can be disabled by setting it to false. Note that we do not yet make full use of these statistics in the optimizer. That change will come in a future PR. Informs cockroachdb#34422 Release note (sql change): Added support for collection of multi- column statistics. By default, statistics are now collected on all prefixes of each index, in addition to 100 other individual columns. This feature can be disabled by setting the cluster setting sql.stats.multi_column_collection.enabled to false.
rytaft
added a commit
to rytaft/cockroach
that referenced
this issue
Apr 21, 2020
Prior to this commit, only single-column statistics were supported and collected. This commit adds support for collecting multi-column statistics, and by default collects stats on all prefixes of each index (in addition to all the single column statistics that were already collected). For example, if there is an index on (a, b, c), we now collect stats on (a, b) and (a, b, c) in addition to stats on each individual column. A new cluster setting sql.stats.multi_column_collection.enabled controls this feature. It is currently enabled by default, but automatic collection of multi-column stats can be disabled by setting it to false. Note that we do not yet make full use of these statistics in the optimizer. That change will come in a future PR. Informs cockroachdb#34422 Release note (sql change): Added support for collection of multi- column statistics. By default, statistics are now collected on all prefixes of each index, in addition to 100 other individual columns. This feature can be disabled by setting the cluster setting sql.stats.multi_column_collection.enabled to false.
rytaft
added a commit
to rytaft/cockroach
that referenced
this issue
Apr 22, 2020
Prior to this commit, only single-column statistics were supported and collected. This commit adds support for collecting multi-column statistics, and by default collects stats on all prefixes of each index (in addition to all the single column statistics that were already collected). For example, if there is an index on (a, b, c), we now collect stats on (a, b) and (a, b, c) in addition to stats on each individual column. A new cluster setting sql.stats.multi_column_collection.enabled controls this feature. It is currently enabled by default, but automatic collection of multi-column stats can be disabled by setting it to false. Note that we do not yet make full use of these statistics in the optimizer. That change will come in a future PR. Informs cockroachdb#34422 Release note (sql change): Added support for collection of multi- column statistics. By default, statistics are now collected on all prefixes of each index, in addition to 100 other individual columns. This feature can be disabled by setting the cluster setting sql.stats.multi_column_collection.enabled to false.
craig bot
pushed a commit
that referenced
this issue
Apr 22, 2020
47729: sql: collect multi-column stats on index prefixes r=rytaft a=rytaft Prior to this commit, only single-column statistics were supported and collected. This commit adds support for collecting multi-column statistics, and by default collects stats on all prefixes of each index (in addition to all the single column statistics that were already collected). For example, if there is an index on (a, b, c), we now collect stats on (a, b) and (a, b, c) in addition to stats on each individual column. A new cluster setting `sql.stats.multi_column_collection.enabled` controls this feature. It is currently enabled by default, but automatic collection of multi-column stats can be disabled by setting it to false. Note that we do not yet make full use of these statistics in the optimizer. That change will come in a future PR. Informs #34422 Release note (sql change): Added support for collection of multi- column statistics. By default, statistics are now collected on all prefixes of each index, in addition to 100 other individual columns. This feature can be disabled by setting the cluster setting sql.stats.multi_column_collection.enabled to false. 47788: vendor: bump pebble to b976c257531658d4b96577393866c42d7b4ca801 r=petermattis a=petermattis * internal/manifest: Fix merge skew with Formatter -> FormatKey change * internal/manifest: Introduce L0SubLevels data structure * *: introduce Comparer.FormatValue * *: rename base.Formatter to base.FormatKey * tool: enhance `sstable check` to catch bloom filter problems * tool: add db properties subcommand * tool: fix sstable panic Enabled pretty-printing of values in `debug pebble` commands by hooking up `MVCCComparer.FormatValue`. Release note: None Co-authored-by: Rebecca Taft <becca@cockroachlabs.com> Co-authored-by: Peter Mattis <petermattis@gmail.com>
rytaft
added a commit
to rytaft/cockroach
that referenced
this issue
Apr 28, 2020
Prior to this commit, histograms and multi-column stats were always used by the optimizer for cardinality estimation if they were available. This commit adds two new session settings along with two corresponding cluster settings to control whether or not the optimizer should use histograms or multi-column stats. Note that even if the settings are enabled, the optimizer cannot use these statistics if they haven't been collected. Collection of histograms and multi-column stats is still controlled by the existing cluster settings `sql.stats.histogram_collection.enabled` and `sql.stats.multi_column_collection.enabled`. The new settings to control usage by the optimizer are: Cluster settings: ``` sql.defaults.optimizer_use_histograms.enabled sql.defaults.optimizer_use_multicol_stats.enabled ``` Session settings: ``` optimizer_use_histograms optimizer_use_multicol_stats ``` Both settings are enabled by default. Fixes cockroachdb#43308 Informs cockroachdb#38082 Informs cockroachdb#34422 Release note (sql change): Added two new session settings and corresponding cluster settings to control whether the optimizer uses histograms and multi-column statistics for cardinality estimation. The session settings are optimizer_use_histograms and optimizer_use_multicol_stats, with corresponding cluster settings sql.defaults.optimizer_use_histograms.enabled and sql.defaults.optimizer_use_multicol_stats.enabled. Both settings are enabled by default.
rytaft
added a commit
to rytaft/cockroach
that referenced
this issue
Apr 28, 2020
Prior to this commit, histograms and multi-column stats were always used by the optimizer for cardinality estimation if they were available. This commit adds two new session settings along with two corresponding cluster settings to control whether or not the optimizer should use histograms or multi-column stats. Note that even if the settings are enabled, the optimizer cannot use these statistics if they haven't been collected. Collection of histograms and multi-column stats is still controlled by the existing cluster settings `sql.stats.histogram_collection.enabled` and `sql.stats.multi_column_collection.enabled`. The new settings to control usage by the optimizer are: Cluster settings: ``` sql.defaults.optimizer_use_histograms.enabled sql.defaults.optimizer_use_multicol_stats.enabled ``` Session settings: ``` optimizer_use_histograms optimizer_use_multicol_stats ``` Both settings are enabled by default. Fixes cockroachdb#43308 Informs cockroachdb#38082 Informs cockroachdb#34422 Release note (sql change): Added two new session settings and corresponding cluster settings to control whether the optimizer uses histograms and multi-column statistics for cardinality estimation. The session settings are optimizer_use_histograms and optimizer_use_multicol_stats, with corresponding cluster settings sql.defaults.optimizer_use_histograms.enabled and sql.defaults.optimizer_use_multicol_stats.enabled. Both settings are enabled by default.
craig bot
pushed a commit
that referenced
this issue
Apr 28, 2020
48105: opt: add settings to enable usage of histograms and multi-column stats r=rytaft a=rytaft Prior to this commit, histograms and multi-column stats were always used by the optimizer for cardinality estimation if they were available. This commit adds two new session settings along with two corresponding cluster settings to control whether or not the optimizer should use histograms or multi-column stats. Note that even if the settings are enabled, the optimizer cannot use these statistics if they haven't been collected. Collection of histograms and multi-column stats is still controlled by the existing cluster settings `sql.stats.histogram_collection.enabled` and `sql.stats.multi_column_collection.enabled`. The new settings to control usage by the optimizer are: Cluster settings: ``` sql.defaults.optimizer_use_histograms.enabled sql.defaults.optimizer_use_multicol_stats.enabled ``` Session settings: ``` optimizer_use_histograms optimizer_use_multicol_stats ``` Both settings are enabled by default. Fixes #43308 Informs #38082 Informs #34422 Release note (sql change): Added two new session settings and corresponding cluster settings to control whether the optimizer uses histograms and multi-column statistics for cardinality estimation. The session settings are optimizer_use_histograms and optimizer_use_multicol_stats, with corresponding cluster settings sql.defaults.optimizer_use_histograms.enabled and sql.defaults.optimizer_use_multicol_stats.enabled. Both settings are enabled by default. Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
rytaft
added a commit
to rytaft/cockroach
that referenced
this issue
May 4, 2020
This commit updates the injected stats for the TPC-C and TPC-H stats quality tests to include multi-column stats. Since multi-column stats are currently only used by the optimizer in a few limited cases, none of the plans changed. This commit is in preparation for work that will be done in the optimizer to use multi-column stats for selectivity estimation. Informs cockroachdb#34422 Release note: None
craig bot
pushed a commit
that referenced
this issue
May 5, 2020
48374: opt: update stats quality tests with multi-column stats r=rytaft a=rytaft This commit updates the injected stats for the TPC-C and TPC-H stats quality tests to include multi-column stats. Since multi-column stats are currently only used by the optimizer in a few limited cases, none of the plans changed. This commit is in preparation for work that will be done in the optimizer to use multi-column stats for selectivity estimation. Informs #34422 Release note: None Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
rytaft
added a commit
to rytaft/cockroach
that referenced
this issue
May 15, 2020
This commit adds support for calculating selectivity from multi-column statistics. It changes selectivityFromDistinctCounts to have the following semantics: selectivityFromDistinctCounts calculates the selectivity of a filter by using estimated distinct counts of each constrained column before and after the filter was applied. We can perform this calculation in two different ways: (1) by treating the columns as completely independent, or (2) by assuming they are correlated. (1) Assuming independence between columns, we can calculate the selectivity by taking the product of selectivities of each constrained column. In the general case, this can be represented by the formula: ``` ┬-┬ ⎛ new_distinct(i) ⎞ selectivity = │ │ ⎜ --------------- ⎟ ┴ ┴ ⎝ old_distinct(i) ⎠ i in {constrained columns} ``` (2) If useMultiCol is true, we assume there is some correlation between columns. In this case, we calculate the selectivity using multi-column statistics. ``` ⎛ new_distinct({constrained columns}) ⎞ selectivity = ⎜ ----------------------------------- ⎟ ⎝ old_distinct({constrained columns}) ⎠ ``` This formula looks simple, but the challenge is that it is difficult to determine the correct value for new_distinct({constrained columns}) if each column is not constrained to a single value. For example, if new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2, 3 or 4. We estimate the new distinct count as follows, using the concept of "soft functional dependency (FD) strength": ``` new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled) where min_value = max(new_distinct(x), new_distinct(y)) max_value = new_distinct(x) * new_distinct(y) range = max_value - min_value ⎛ max(old_distinct(x),old_distinct(y)) ⎞ FD_strength = ⎜ ------------------------------------ ⎟ ⎝ old_distinct({x,y}) ⎠ ⎛ max(old_distinct(x), old_distinct(y)) ⎞ min_FD_strength = ⎜ ------------------------------------- ⎟ ⎝ old_distinct(x) * old_distinct(y) ⎠ ⎛ FD_strength - min_FD_strength ⎞ FD_strength_scaled = ⎜ ----------------------------- ⎟ ⎝ 1 - min_FD_strength ⎠ ``` Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are perfectly correlated, old_distinct({x,y})=100. Using the example from above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values into the equation, we get: ``` FD_strength_scaled = 1 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2 ``` If x and y are completely independent, however, old_distinct({x,y})=1000. In this case, we get: ``` FD_strength_scaled = 0 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4 ``` Note that even if useMultiCol is true and we calculate the selectivity based on equation (2) above, we still want to take equation (1) into account. This is because it is possible that there are two predicates that each have selectivity s, but the multi-column selectivity is also s. In order to ensure that the cost model considers the two predicates combined to be more selective than either one individually, we must give some weight to equation (1). Therefore, instead of equation (2) we actually return the following selectivity: ``` selectivity = (1 - w) * (eq. 1) + w * (eq. 2) ``` where w currently set to 0.9. This selectivity will be used later to update the row count and the distinct count for the unconstrained columns. Fixes cockroachdb#34422 Release note (performance improvement): Added support for calculating the selectivity of filter predicates in the optimizer using multi-column statistics. This improves the cardinality estimates of the optimizer when a query has filter predicates constraining multiple columns. As a result, the optimizer may choose a better query plan in some cases.
rytaft
added a commit
to rytaft/cockroach
that referenced
this issue
May 20, 2020
This commit adds support for calculating selectivity from multi-column statistics. It changes selectivityFromDistinctCounts to have the following semantics: selectivityFromDistinctCounts calculates the selectivity of a filter by using estimated distinct counts of each constrained column before and after the filter was applied. We can perform this calculation in two different ways: (1) by treating the columns as completely independent, or (2) by assuming they are correlated. (1) Assuming independence between columns, we can calculate the selectivity by taking the product of selectivities of each constrained column. In the general case, this can be represented by the formula: ``` ┬-┬ ⎛ new_distinct(i) ⎞ selectivity = │ │ ⎜ --------------- ⎟ ┴ ┴ ⎝ old_distinct(i) ⎠ i in {constrained columns} ``` (2) If useMultiCol is true, we assume there is some correlation between columns. In this case, we calculate the selectivity using multi-column statistics. ``` ⎛ new_distinct({constrained columns}) ⎞ selectivity = ⎜ ----------------------------------- ⎟ ⎝ old_distinct({constrained columns}) ⎠ ``` This formula looks simple, but the challenge is that it is difficult to determine the correct value for new_distinct({constrained columns}) if each column is not constrained to a single value. For example, if new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2, 3 or 4. We estimate the new distinct count as follows, using the concept of "soft functional dependency (FD) strength": ``` new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled) where min_value = max(new_distinct(x), new_distinct(y)) max_value = new_distinct(x) * new_distinct(y) range = max_value - min_value ⎛ max(old_distinct(x),old_distinct(y)) ⎞ FD_strength = ⎜ ------------------------------------ ⎟ ⎝ old_distinct({x,y}) ⎠ ⎛ max(old_distinct(x), old_distinct(y)) ⎞ min_FD_strength = ⎜ ------------------------------------- ⎟ ⎝ old_distinct(x) * old_distinct(y) ⎠ ⎛ FD_strength - min_FD_strength ⎞ FD_strength_scaled = ⎜ ----------------------------- ⎟ ⎝ 1 - min_FD_strength ⎠ ``` Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are perfectly correlated, old_distinct({x,y})=100. Using the example from above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values into the equation, we get: ``` FD_strength_scaled = 1 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2 ``` If x and y are completely independent, however, old_distinct({x,y})=1000. In this case, we get: ``` FD_strength_scaled = 0 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4 ``` Note that even if useMultiCol is true and we calculate the selectivity based on equation (2) above, we still want to take equation (1) into account. This is because it is possible that there are two predicates that each have selectivity s, but the multi-column selectivity is also s. In order to ensure that the cost model considers the two predicates combined to be more selective than either one individually, we must give some weight to equation (1). Therefore, instead of equation (2) we actually return the following selectivity: ``` selectivity = (1 - w) * (eq. 1) + w * (eq. 2) ``` where w currently set to 0.9. This selectivity will be used later to update the row count and the distinct count for the unconstrained columns. Fixes cockroachdb#34422 Release note (performance improvement): Added support for calculating the selectivity of filter predicates in the optimizer using multi-column statistics. This improves the cardinality estimates of the optimizer when a query has filter predicates constraining multiple columns. As a result, the optimizer may choose a better query plan in some cases.
craig bot
pushed a commit
that referenced
this issue
May 29, 2020
49134: opt: add support for calculating selectivity from multi-column stats r=rytaft a=rytaft **opt: fix bug with histograms and multi-span index constraints** Prior to this commit, filtering a histogram with a multi-span index constraint could lead to incorrect results if the histogram column was part of the exact prefix of the constraint. This was due to the fact that the same value appeared in multiple spans, breaking the assumption of the histogram code that values in spans were always ordered and non-overlapping. For example, filtering a histogram on column b with the constraint `/b/c/: [/1/2 - /1/2] [/1/4 - /1/4]` would return an incorrect result, because the values in the matching histogram bucket would be counted twice. This commit fixes the problem by only considering a single span if the column is part of the exact prefix. Release note (performance improvement): Fixed a bug in the histogram logic in the optimizer which was causing an over-estimate for the cardinality of constrained index scans in some cases when multiple columns of the index were constrained. This problem was introduced early in the development for the 20.2 release so should not have ever been part of a release. The fix improves the optimizer's cardinality estimates so may result in better query plan selection. **opt: fix distinct count estimates for index constraint columns** Prior to this commit, the statistics_builder was incorrectly estimating the distinct count of columns that were only slightly constrained as part of an index constraint. For example, it was estimating based on constraints such as `/a/b: [/1 - /5/6]` or `/a/b: [ - /5/6]` that the distinct count of column b should be reduced by 2/3. However, in reality, we cannot assume anything about the distinct count of column b based on those two constraints. This commit fixes the estimate by only reducing the distinct count for columns that are part of the prefix of the constraint (columns for which all the spans have the same start and end values) or the first column after. Release note (performance improvement): Fixed the optimizer's distinct count estimate for columns constrained by an index constraint, which was too low in some cases. The fix improves the optimizer's cardinality estimates, which can lead to better query plan selection. **opt: ensure multi-col stats are consistent with single-col stats** Prior to this commit, it was possible that the estimated distinct count of a multi-column statistic was larger than the product of the estimated distinct counts of all of its individual columns. This could happen because we reduce the distinct count of columns that are constrained by a predicate, but we don't update the distinct counts of any multi-column stats that contain that column. This is internally inconsistent and could lead to bad cardinality estimates. This commit fixes the problem by explicitly checking for such inconsistencies and fixing them after the stats for each operator are built or a column stat is calculated. If it is found that the estimated distinct count is too high, it is reset to the product of the distinct counts of its constituent columns. This commit also adds a check that the distinct count of the multi-column statistic is no smaller than the largest distinct count of its constituent columns (although I haven't yet found a test case where this was a problem). Release note (performance improvement): Fixed the optimizer's estimated distinct count for a multi-column statistic when all of the columns in the statistic are constrained by a filter predicate. The fix can lead to improved cardinality estimates, leading to better query plan selection in some cases. **opt: add support for calculating selectivity from multi-column stats** This commit adds support for calculating selectivity from multi-column statistics. It changes `selectivityFromDistinctCounts` to have the following semantics: `selectivityFromDistinctCounts` calculates the selectivity of a filter by using estimated distinct counts of each constrained column before and after the filter was applied. We can perform this calculation in two different ways: (1) by treating the columns as completely independent, or (2) by assuming they are correlated. (1) Assuming independence between columns, we can calculate the selectivity by taking the product of selectivities of each constrained column. In the general case, this can be represented by the formula: ``` ┬-┬ ⎛ new_distinct(i) ⎞ selectivity = │ │ ⎜ --------------- ⎟ ┴ ┴ ⎝ old_distinct(i) ⎠ i in {constrained columns} ``` (2) If `useMultiCol` is true, we assume there is some correlation between columns. In this case, we calculate the selectivity using multi-column statistics. ``` ⎛ new_distinct({constrained columns}) ⎞ selectivity = ⎜ ----------------------------------- ⎟ ⎝ old_distinct({constrained columns}) ⎠ ``` This formula looks simple, but the challenge is that it is difficult to determine the correct value for new_distinct({constrained columns}) if each column is not constrained to a single value. For example, if new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2, 3 or 4. We estimate the new distinct count as follows, using the concept of "soft functional dependency (FD) strength": ``` new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled) where min_value = max(new_distinct(x), new_distinct(y)) max_value = new_distinct(x) * new_distinct(y) range = max_value - min_value ⎛ max(old_distinct(x),old_distinct(y)) ⎞ FD_strength = ⎜ ------------------------------------ ⎟ ⎝ old_distinct({x,y}) ⎠ ⎛ max(old_distinct(x), old_distinct(y)) ⎞ min_FD_strength = ⎜ ------------------------------------- ⎟ ⎝ old_distinct(x) * old_distinct(y) ⎠ ⎛ FD_strength - min_FD_strength ⎞ FD_strength_scaled = ⎜ ----------------------------- ⎟ ⎝ 1 - min_FD_strength ⎠ ``` Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are perfectly correlated, old_distinct({x,y})=100. Using the example from above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values into the equation, we get: ``` FD_strength_scaled = 1 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2 ``` If x and y are completely independent, however, old_distinct({x,y})=1000. In this case, we get: ``` FD_strength_scaled = 0 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4 ``` Note that even if `useMultiCol` is true and we calculate the selectivity based on equation (2) above, we still want to take equation (1) into account. This is because it is possible that there are two predicates that each have selectivity s, but the multi-column selectivity is also s. In order to ensure that the cost model considers the two predicates combined to be more selective than either one individually, we must give some weight to equation (1). Therefore, instead of equation (2) we actually return the following selectivity: ``` selectivity = (1 - w) * (eq. 1) + w * (eq. 2) ``` where w currently set to 0.9. This selectivity is used update the row count and the distinct count for the unconstrained columns. Fixes #34422 Release note (performance improvement): Added support for calculating the selectivity of filter predicates in the optimizer using multi-column statistics. This improves the cardinality estimates of the optimizer when a query has filter predicates constraining multiple columns. As a result, the optimizer may choose a better query plan in some cases. Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
jbowens
pushed a commit
to jbowens/cockroach
that referenced
this issue
Jun 1, 2020
This commit adds support for calculating selectivity from multi-column statistics. It changes selectivityFromDistinctCounts to have the following semantics: selectivityFromDistinctCounts calculates the selectivity of a filter by using estimated distinct counts of each constrained column before and after the filter was applied. We can perform this calculation in two different ways: (1) by treating the columns as completely independent, or (2) by assuming they are correlated. (1) Assuming independence between columns, we can calculate the selectivity by taking the product of selectivities of each constrained column. In the general case, this can be represented by the formula: ``` ┬-┬ ⎛ new_distinct(i) ⎞ selectivity = │ │ ⎜ --------------- ⎟ ┴ ┴ ⎝ old_distinct(i) ⎠ i in {constrained columns} ``` (2) If useMultiCol is true, we assume there is some correlation between columns. In this case, we calculate the selectivity using multi-column statistics. ``` ⎛ new_distinct({constrained columns}) ⎞ selectivity = ⎜ ----------------------------------- ⎟ ⎝ old_distinct({constrained columns}) ⎠ ``` This formula looks simple, but the challenge is that it is difficult to determine the correct value for new_distinct({constrained columns}) if each column is not constrained to a single value. For example, if new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2, 3 or 4. We estimate the new distinct count as follows, using the concept of "soft functional dependency (FD) strength": ``` new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled) where min_value = max(new_distinct(x), new_distinct(y)) max_value = new_distinct(x) * new_distinct(y) range = max_value - min_value ⎛ max(old_distinct(x),old_distinct(y)) ⎞ FD_strength = ⎜ ------------------------------------ ⎟ ⎝ old_distinct({x,y}) ⎠ ⎛ max(old_distinct(x), old_distinct(y)) ⎞ min_FD_strength = ⎜ ------------------------------------- ⎟ ⎝ old_distinct(x) * old_distinct(y) ⎠ ⎛ FD_strength - min_FD_strength ⎞ FD_strength_scaled = ⎜ ----------------------------- ⎟ ⎝ 1 - min_FD_strength ⎠ ``` Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are perfectly correlated, old_distinct({x,y})=100. Using the example from above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values into the equation, we get: ``` FD_strength_scaled = 1 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2 ``` If x and y are completely independent, however, old_distinct({x,y})=1000. In this case, we get: ``` FD_strength_scaled = 0 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4 ``` Note that even if useMultiCol is true and we calculate the selectivity based on equation (2) above, we still want to take equation (1) into account. This is because it is possible that there are two predicates that each have selectivity s, but the multi-column selectivity is also s. In order to ensure that the cost model considers the two predicates combined to be more selective than either one individually, we must give some weight to equation (1). Therefore, instead of equation (2) we actually return the following selectivity: ``` selectivity = (1 - w) * (eq. 1) + w * (eq. 2) ``` where w currently set to 0.9. This selectivity will be used later to update the row count and the distinct count for the unconstrained columns. Fixes cockroachdb#34422 Release note (performance improvement): Added support for calculating the selectivity of filter predicates in the optimizer using multi-column statistics. This improves the cardinality estimates of the optimizer when a query has filter predicates constraining multiple columns. As a result, the optimizer may choose a better query plan in some cases.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-sql-optimizer
SQL logical planning and optimizations.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Currently we only support creation of table statistics on single columns. Although the syntax for creation of statistics with multiple columns is supported, execution of such a request causes an error. For example:
This issue will track support for this feature.
The text was updated successfully, but these errors were encountered: