Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze is too slow when run dml in million tables scenario #57631

Closed
lilinghai opened this issue Nov 22, 2024 · 1 comment · Fixed by #57756
Closed

Analyze is too slow when run dml in million tables scenario #57631

lilinghai opened this issue Nov 22, 2024 · 1 comment · Fixed by #57756
Assignees
Labels
affects-8.5 This bug affects the 8.5.x(LTS) versions. severity/major sig/planner SIG: Planner type/bug The issue is confirmed as a bug.

Comments

@lilinghai
Copy link
Contributor

lilinghai commented Nov 22, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. about 100k partition tables and each 10k rows
CREATE TABLE `sbtest1` (
  `id` int NOT NULL AUTO_INCREMENT,
  `k` int NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  `ec1` varchar(40) DEFAULT NULL,
  `ec2` varchar(40) DEFAULT NULL,
  `ec3` varchar(40) DEFAULT NULL,
  `ec4` varchar(40) DEFAULT NULL,
  `ec5` varchar(40) DEFAULT NULL,
  `ec6` varchar(40) DEFAULT NULL,
  `ec7` varchar(40) DEFAULT NULL,
  `ec8` varchar(40) DEFAULT NULL,
  `ec9` varchar(40) DEFAULT NULL,
  `ec10` varchar(40) DEFAULT NULL,
  PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,
  KEY `k_2` (`k`),
  KEY `ek1` (`ec1`(30)),
  KEY `ek2` (`ec2`(30)),
  KEY `ek3` (`ec3`(30)),
  KEY `ek4` (`ec4`(30))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin AUTO_INCREMENT=1087002
PARTITION BY RANGE (`id`)
(PARTITION `p1` VALUES LESS THAN (1001),
 PARTITION `p2` VALUES LESS THAN (2001),
 PARTITION `p3` VALUES LESS THAN (3001),
 PARTITION `p4` VALUES LESS THAN (4001),
 PARTITION `p5` VALUES LESS THAN (5001),
 PARTITION `p6` VALUES LESS THAN (6001),
 PARTITION `p7` VALUES LESS THAN (7001),
 PARTITION `p8` VALUES LESS THAN (8001),
 PARTITION `p9` VALUES LESS THAN (9001),
 PARTITION `p10` VALUES LESS THAN (MAXVALUE))
  1. run dml on some of the tables
  2. execute analyze table
    It takes about 3 seconds if not having dml workload
mysql> analyze table sbtest100.sbtest1;
Query OK, 0 rows affected, 10 warnings (21 min 48.62 sec)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

master

Release Version: v8.5.0
Edition: Community
Git Commit Hash: eb871f8
Git Branch: heads/refs/tags/v8.5.0
UTC Build Time: 2024-11-22 03:01:24
GoVersion: go1.23.3
Race Enabled: false
Check Table Before Drop: false
Store: tikv

@lilinghai lilinghai added the type/bug The issue is confirmed as a bug. label Nov 22, 2024
@jebter jebter added the affects-8.5 This bug affects the 8.5.x(LTS) versions. label Nov 22, 2024
@Rustin170506
Copy link
Member

Rustin170506 commented Nov 26, 2024

During the analysis process, stats are expected to be collected from TiKV and stored in TiDB. Additionally, the stats cache needs to be updated to ensure it reflects the latest stats flushed to TiDB. This guarantees that users can execute SQL queries on the table immediately after the stats collection is completed.

For this issue, the problem originates from the stats cache updating process.

As you can tell from here:

"SELECT version, table_id, modify_count, count, snapshot from mysql.stats_meta where version > %? order by version",

TiDB queries all updated stats metadata from the system table and processes them one by one. In this case, with more than 100k tables being updated simultaneously, this results in a bulk read. Processing them one by one significantly impacts performance, making it even worse. As a result, you can see that analyzing a table takes a very long time, as it gets stuck at the final step of updating the stats cache.

To address this problem, we can do two things to reduce the duration.

We can limit the number of tables processed at a time. In this case, throughput isn’t a major concern, and a delay of a few minutes is acceptable. Adding a LIMIT clause to control how many tables are handled per batch can solve the issue without requiring significant changes in version 8.5. Additionally, we could use multiple threads to process these tables concurrently—for instance, leveraging half the number of CPU cores. This would help speed up the process and improve efficiency.

The reason we need to load the latest stats into the cache is to ensure that the most relevant stats are immediately available after the analysis. We don’t want to compromise this guarantee. However, we can improve the performance of this process by updating only the stats for tables that have been modified, rather than reloading stats for all tables. This approach will significantly enhance performance.
For example, if we analyze tables t1 and t2 using ANALYZE TABLE t1, t2, we can call the Update function with t1ID and t2ID. This way, we only update the stats for these two tables, significantly improving the efficiency of the process.

I have verified the short-term solution in this cluster, it worked very well.
Before:
Image

After
Image

Check my changes at #57638.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-8.5 This bug affects the 8.5.x(LTS) versions. severity/major sig/planner SIG: Planner type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants