Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
opt: add support for calculating selectivity from multi-column stats
This commit adds support for calculating selectivity from multi-column statistics. It changes selectivityFromDistinctCounts to have the following semantics: selectivityFromDistinctCounts calculates the selectivity of a filter by using estimated distinct counts of each constrained column before and after the filter was applied. We can perform this calculation in two different ways: (1) by treating the columns as completely independent, or (2) by assuming they are correlated. (1) Assuming independence between columns, we can calculate the selectivity by taking the product of selectivities of each constrained column. In the general case, this can be represented by the formula: ``` ┬-┬ ⎛ new_distinct(i) ⎞ selectivity = │ │ ⎜ --------------- ⎟ ┴ ┴ ⎝ old_distinct(i) ⎠ i in {constrained columns} ``` (2) If useMultiCol is true, we assume there is some correlation between columns. In this case, we calculate the selectivity using multi-column statistics. ``` ⎛ new_distinct({constrained columns}) ⎞ selectivity = ⎜ ----------------------------------- ⎟ ⎝ old_distinct({constrained columns}) ⎠ ``` This formula looks simple, but the challenge is that it is difficult to determine the correct value for new_distinct({constrained columns}) if each column is not constrained to a single value. For example, if new_distinct(x)=2 and new_distinct(y)=2, new_distinct({x,y}) could be 2, 3 or 4. We estimate the new distinct count as follows, using the concept of "soft functional dependency (FD) strength": ``` new_distinct({x,y}) = min_value + range * (1 - FD_strength_scaled) where min_value = max(new_distinct(x), new_distinct(y)) max_value = new_distinct(x) * new_distinct(y) range = max_value - min_value ⎛ max(old_distinct(x),old_distinct(y)) ⎞ FD_strength = ⎜ ------------------------------------ ⎟ ⎝ old_distinct({x,y}) ⎠ ⎛ max(old_distinct(x), old_distinct(y)) ⎞ min_FD_strength = ⎜ ------------------------------------- ⎟ ⎝ old_distinct(x) * old_distinct(y) ⎠ ⎛ FD_strength - min_FD_strength ⎞ FD_strength_scaled = ⎜ ----------------------------- ⎟ ⎝ 1 - min_FD_strength ⎠ ``` Suppose that old_distinct(x)=100 and old_distinct(y)=10. If x and y are perfectly correlated, old_distinct({x,y})=100. Using the example from above, new_distinct(x)=2 and new_distinct(y)=2. Plugging in the values into the equation, we get: ``` FD_strength_scaled = 1 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 1) = 2 ``` If x and y are completely independent, however, old_distinct({x,y})=1000. In this case, we get: ``` FD_strength_scaled = 0 new_distinct({x,y}) = 2 + (4 - 2) * (1 - 0) = 4 ``` Note that even if useMultiCol is true and we calculate the selectivity based on equation (2) above, we still want to take equation (1) into account. This is because it is possible that there are two predicates that each have selectivity s, but the multi-column selectivity is also s. In order to ensure that the cost model considers the two predicates combined to be more selective than either one individually, we must give some weight to equation (1). Therefore, instead of equation (2) we actually return the following selectivity: ``` selectivity = (1 - w) * (eq. 1) + w * (eq. 2) ``` where w currently set to 0.9. This selectivity will be used later to update the row count and the distinct count for the unconstrained columns. Fixes #34422 Release note (performance improvement): Added support for calculating the selectivity of filter predicates in the optimizer using multi-column statistics. This improves the cardinality estimates of the optimizer when a query has filter predicates constraining multiple columns. As a result, the optimizer may choose a better query plan in some cases.
- Loading branch information