Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large overestimation when where conditions contain OR and matches several indexes with different selectivity #54323

Closed
time-and-fate opened this issue Jun 28, 2024 · 4 comments · Fixed by #56001
Assignees
Labels
affects-5.4 This bug affects 5.4.x versions. affects-6.1 affects-6.5 affects-7.1 affects-7.5 affects-8.1 sig/planner SIG: Planner type/enhancement The issue or PR belongs to an enhancement.

Comments

@time-and-fate
Copy link
Member

time-and-fate commented Jun 28, 2024

Reproduce

create table t(a int, b int, c int, d int, index iabc(a,b,c), index ib(b));

-- run this many times
insert into t value (rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000),(rand()*100000, rand()*10, rand()*1000, rand()*1000);

-- run this many times
insert into t select * from t;

analyze table t;
explain analyze select * from t where (a = 1 and b = 1 and c = 1) or (a = 2 and b = 2 and c = 2) or (a = 3 and b = 3 and c = 3);
explain analyze select * from t use index (iabc) where (a = 1 and b = 1 and c = 1) or (a = 2 and b = 2 and c = 2) or (a = 3 and b = 3 and c = 3);

There is a big overestimation

> explain analyze select * from t where (a = 1 and b = 1 and c = 1) or (a = 2 and b = 2 and c = 2) or (a = 3 and b = 3 and c = 3);
+-------------------------+----------+---------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+------+
| id                      | estRows  | actRows | task      | access object | execution info                                                                                                                                                                                                                                                                                                                                                                                  | operator info                                                                                                                                                                                 | memory    | disk |
+-------------------------+----------+---------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+------+
| TableReader_7           | 7014.40  | 0       | root      |               | time:36.3ms, loops:1, RU:36.071812, cop_task: {num: 1, max: 36.2ms, proc_keys: 29440, tot_proc: 35.2ms, tot_wait: 70.6µs, copr_cache_hit_ratio: 0.00, build_task_duration: 10.9µs, max_distsql_concurrency: 1}, rpc_info:{Cop:{num_rpc:1, total_time:36.1ms}}                                                                                                                                   | data:Selection_6                                                                                                                                                                              | 289 Bytes | N/A  |
| └─Selection_6           | 7014.40  | 0       | cop[tikv] |               | tikv_task:{time:37ms, loops:33}, scan_detail: {total_process_keys: 29440, total_process_keys_size: 1564192, total_keys: 29441, get_snapshot_time: 27.9µs, rocksdb: {delete_skipped_count: 28520, key_skipped_count: 57960, block: {}}}, time_detail: {total_process_time: 35.2ms, total_suspend_time: 202µs, total_wait_time: 70.6µs, total_kv_read_wall_time: 27ms, tikv_wall_time: 35.7ms}    | or(and(eq(test.t.a, 1), and(eq(test.t.b, 1), eq(test.t.c, 1))), or(and(eq(test.t.a, 2), and(eq(test.t.b, 2), eq(test.t.c, 2))), and(eq(test.t.a, 3), and(eq(test.t.b, 3), eq(test.t.c, 3))))) | N/A       | N/A  |
|   └─TableFullScan_5     | 29440.00 | 29440   | cop[tikv] | table:t       | tikv_task:{time:27ms, loops:33}                                                                                                                                                                                                                                                                                                                                                                 | keep order:false                                                                                                                                                                              | N/A       | N/A  |
+-------------------------+----------+---------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+------+
3 rows in set (0.04 sec)

> explain analyze select * from t use index (iabc) where (a = 1 and b = 1 and c = 1) or (a = 2 and b = 2 and c = 2) or (a = 3 and b = 3 and c = 3);
+-------------------------------+---------+---------+-----------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+------+
| id                            | estRows | actRows | task      | access object                | execution info                                                                                                                                                                                                                                                                                                                                                                                                                                                 | operator info                                                       | memory    | disk |
+-------------------------------+---------+---------+-----------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+------+
| IndexLookUp_7                 | 8768.00 | 0       | root      |                              | time:762.9µs, loops:1, RU:0.496309                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                     | 257 Bytes | N/A  |
| ├─IndexRangeScan_5(Build)     | 8768.00 | 0       | cop[tikv] | table:t, index:iabc(a, b, c) | time:607.9µs, loops:1, cop_task: {num: 1, max: 532.8µs, proc_keys: 0, tot_proc: 63.9µs, tot_wait: 47.2µs, copr_cache_hit_ratio: 0.00, build_task_duration: 28.2µs, max_distsql_concurrency: 1}, rpc_info:{Cop:{num_rpc:1, total_time:509.9µs}}, tikv_task:{time:0s, loops:1}, scan_detail: {total_keys: 3, get_snapshot_time: 24µs, rocksdb: {block: {}}}, time_detail: {total_process_time: 63.9µs, total_wait_time: 47.2µs, tikv_wall_time: 206µs}           | range:[1 1 1,1 1 1], [2 2 2,2 2 2], [3 3 3,3 3 3], keep order:false | N/A       | N/A  |
| └─TableRowIDScan_6(Probe)     | 8768.00 | 0       | cop[tikv] | table:t                      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                | keep order:false                                                    | N/A       | N/A  |
+-------------------------------+---------+---------+-----------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+------+
@time-and-fate time-and-fate added the type/enhancement The issue or PR belongs to an enhancement. label Jun 28, 2024
@time-and-fate
Copy link
Member Author

This corresponds to two places that need to be improved:

  1. The root cause
    In this case, Selectivity() can't correctly match the stats to the filters. It should prefer stats on (iabc) for estimation, but finally chooses stats on (ib).
  2. The root cause results in the overestimation of DataSource.stats.RowCount, and then the logic here brings the overestimation into the index range scan of index (iabc), then causes choosing the bad plan.
    if path.CountAfterAccess < ds.StatsInfo().RowCount && !isIm {
    path.CountAfterAccess = math.Min(ds.StatsInfo().RowCount/cost.SelectionFactor, float64(ds.StatisticTable.RealtimeCount))
    }

@Rustin170506
Copy link
Member

A simple rust script to reproduce it:

#!/usr/bin/env -S cargo +nightly -Zscript
---cargo
[dependencies]
sqlx = { version = "0.7", features = ["mysql", "runtime-tokio-native-tls"] }
tokio = { version = "1", features = ["full"] }
rand = "0.8"
---
use sqlx::mysql::MySqlPool;
use rand::Rng;
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    // Replace with your MySQL connection string
    let pool = MySqlPool::connect("mysql://root@localhost:4000/test").await?;

    // Create table
    sqlx::query(
        "CREATE TABLE IF NOT EXISTS t(
            a INT,
            b INT,
            c INT,
            d INT,
            INDEX iabc(a,b,c),
            INDEX ib(b)
        )"
    )
    .execute(&pool)
    .await?;

    // Function to generate random data
    fn generate_data() -> (i32, i32, i32, i32) {
        let mut rng = rand::thread_rng();
        (
            rng.gen_range(0..100000),
            rng.gen_range(0..10),
            rng.gen_range(0..1000),
            rng.gen_range(0..1000),
        )
    }

    // Insert initial data
    for _ in 0..200 {
        let data: Vec<_> = (0..20).map(|_| generate_data()).collect();

        for (a, b, c, d) in data {
            sqlx::query("INSERT INTO t (a, b, c, d) VALUES (?, ?, ?, ?)")
                .bind(a)
                .bind(b)
                .bind(c)
                .bind(d)
                .execute(&pool)
                .await?;
        }
    }

    // Double the data multiple times
    for _ in 0..5 {  // Adjust this number based on how much data you want
        sqlx::query("INSERT INTO t SELECT * FROM t")
            .execute(&pool)
            .await?;
    }

    // Analyze table
    sqlx::query("ANALYZE TABLE t")
        .execute(&pool)
        .await?;

    println!("Script completed successfully.");
    Ok(())
}

@Rustin170506
Copy link
Member

  1. In this case, Selectivity() can't correctly match the stats to the filters. It should prefer stats on (iabc) for estimation, but finally chooses stats on (ib).

GetUsableSetsByGreedy

// We greedy select the stats info based on:
// (1): The stats type, always prefer the primary key or index.
// (2): The number of expression that it covers, the more the better.
// (3): The number of columns that it contains, the less the better.
// (4): The selectivity of the covered conditions, the less the better.
//      The rationale behind is that lower selectivity tends to reflect more functional dependencies
//      between columns. It's hard to decide the priority of this rule against rule 2 and 3, in order
//      to avoid massive plan changes between tidb-server versions, I adopt this conservative strategy
//      to impose this rule after rule 2 and 3.
if (bestTp == ColType && set.Tp != ColType) ||
	bestCount < bits ||
	(bestCount == bits && bestNumCols > set.numCols) ||
	(bestCount == bits && bestNumCols == set.numCols && bestSel > set.Selectivity) {
	bestID, bestCount, bestTp, bestNumCols, bestMask, bestSel = i, bits, set.Tp, set.numCols, curMask, set.Selectivity
}

This is because that the column number of the stats on (ib) is less than the stats on (iabc), so it chooses the stats on (ib) first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.4 This bug affects 5.4.x versions. affects-6.1 affects-6.5 affects-7.1 affects-7.5 affects-8.1 sig/planner SIG: Planner type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
4 participants