- Author(s): xuyifangreeneyes
- Discussion PR: pingcap#27223
- Tracking Issue: pingcap#26020
This proposal aims to improve rule based index selection including heuristics and skyline pruning, which helps avoid choosing the wrong index due to outdated statistics.
Currently, the choice of access path mainly depends on statistics. The wrong index may be chosen due to outdated or inaccurate statistics. In order to reduce the possibility of the situation, the optimizer introduces some rule based index selection mechanisms, including heuristics and skyline pruning. We try to improve rule based index selection to enhance robustness of index selection.
The improvement of rule based index selection consists of always-good heuristics, skyline pruning and maybe-good heuristics.
The rules of always-good heuristics are listed as following.
- If a unique index is covered by conditions and single scan is enough, we directly choose it.
- If a unique index is covered by conditions but it needs double scan(index scan + table scan), we push it into an array
uniqueIdxsWithDoubleScan
. After collecting all the unique indices satisfying the condition intouniqueIdxsWithDoubleScan
, we select the one that has the smallest number of rows to read asuniqueBest
. - If an index
idx1
only needs single scan, and there is a unique indexidx2
belonging touniqueIdxsWithDoubleScan
whose access condition is subset ofidx1
's, we can infer thatidx1
has limited rows to read. In this way, we refineidx2
to obtainidx1
so we push it into an arraysingleScanIdxs
. After collecting all the indices satisfying the condition intosingleScanIdxs
, we select the one that has the smallest number of rows to read asrefinedBest
. - If
uniqueBest
exists whilerefinedBest
doesn't, we chooseuniqueBest
. If bothuniqueBest
andrefinedBest
exist, we choose the one with the smaller number of rows to read.
In the above rules, an index being covered by conditions means for each index column there is an equal condition.
The following examples show how those rules work. The primary key a
in the first SELECT
statement matches rule 1, so it is chosen directly. For the second SELECT
statement, uniqueBest
is idx_b
and we refine idx_b
to obtain idx_b_c
which is refinedBest
. We choose idx_b_c
finally since it has the smaller number of rows to read.
CREATE TABLE t(a INT PRIMARY KEY, b INT, c INT, UNIQUE INDEX idx_b(b), UNIQUE INDEX idx_b_c(b, c));
SELECT * FROM t WHERE a = 2 OR a = 5;
SELECT b, c FROM t WHERE b = 5 AND c > 10;
TiDB has skyline pruning to prune the index which is strictly worse than another index. When comparing two indices, it considers three dimensions: single scan or double scan, the set of columns in the access condition, matching of the physical property. There are two improvements we can make for the first dimension and the third dimension respectively.
- If both indices need double scan, the current implementation regards the two indices equal on the first dimension. However, we can check the set of columns in
AccessPath.IndexFilters
to compare the number of table-read rows. For example, thoughidx_b
andidx_b_c
have the same access conditionb > 5
and both need double scan,idx_b_c
has the filter conditionc > 5
butidx_b
doesn't. Henceidx_b_c
is better thanidx_b
.
CREATE TABLE t(a INT, b INT, c INT, INDEX idx_b(b), INDEX idx_b_c(b, c));
SELECT * FROM t WHERE b > 5 and c > 5;
- The current implementation doesn't correctly identify whether the index matches the physical property in some cases and we can fix them. For example,
idx_a_b_c
matches the required orderORDER BY a, c
sinceb = 4
is constant.
CREATE TABLE t(a INT, b INT, c INT, d INT, INDEX idx_a_b_c(a, b, c))
SELECT * FROM t WHERE b = 4 ORDER BY a, c;
CREATE TABLE t(a INT PRIMARY KEY, b INT, c INT, INDEX idx_b(b));
SELECT * FROM t WHERE b = 2 ORDER BY a LIMIT 10;
In the above example, the optimizer may choose TableFullScan
which takes much more time than IndexRangeScan
. The cause is that it is hard to tell which is better between TableFullScan
matching the required order and fetching the limited number of rows and IndexRangeScan
. For the situation, TiDB has a switch tidb_opt_prefer_range_scan
. When the switch turns on, the optimizer always prefers range scan over full scan. There are three improvements we can do on the switch.
- The switch is session-level, which is not enough for use in real scenes. Hence we make the switch both session-level and global-level.
- For the unsigned handle of the table, the full range is
[0, +inf]
but the current implementation doesn't regard it as full range, which needs fixing.
Rules such as heuristics and skyline pruning can prevent the optimizer from choosing some obviously wrong index, which are implemented in many other databases.
It does not affect the compatibility.
Always-good heuristics have nothing to do with the physical property, so it will be implemented in (*DataSource).DeriveStats
after filling content for each (*DataSource).possibleAccessPaths
. Skyline pruning and prefer-range-scan are implemented in (*DataSource).findBestTask
and we just need to make the improvements in place.