ddl: improve the reorg task scheduling #38646

tangenta · 2022-10-26T06:55:10Z

What problem does this PR solve?

Issue Number: ref #35983

Problem Summary:

In the data reorganization stage of adding index, we wrap the table ranges to a few tasks and then send them to several workers. These workers named "add index workers" or "backfill workers", which run in parallel to improve the performance of creating index records.

The main thread organizes the tasks in batch. Each batch contains @@tidb_ddl_reorg_worker_cnt tasks. The tasks in batch are sent to the backfill workers one by one:

tidb/ddl/backfilling.go

Lines 393 to 396 in ac0d36b

    
           func (dc *ddlCtx) sendTasksAndWait(sessPool *sessionPool, reorgInfo *reorgInfo, totalAddedCount *int64, workers []*backfillWorker, batchTasks []*reorgBackfillTask) error { 
        
           	for i, task := range batchTasks { 
        
           		workers[i].taskCh <- task 
        
           	}

We cannot proceed with the next batch until all the workers finish:

tidb/ddl/backfilling.go

Lines 367 to 369 in ac0d36b

    
           for i := 0; i < taskCnt; i++ { 
        
           	worker := workers[i] 
        
           	result := <-worker.resultCh

As a result, the time consumed on a batch is determined by the slowest backfill worker. The CPU utilization is not good.

What is changed and how it works?

This PR proposes a better model: All backfill workers share the same task channel and the same result channel. Once a worker finishes a task, it could pick up the next one instantly without waiting.

However, this change breaks the order of the task execution. For example, task 6 may be handled earlier than task 5. We need another way to determine the "next handle", which is persisted to the storage as a check point. In this PR, doneTaskKeeper solves this problem.

Check List

Tests

Unit test
Integration test

Manual test (add detailed scripts or steps below)

Local environment

Sysbench table sbtest1

sbtest1 | CREATE TABLE `sbtest1` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`k` int(11) NOT NULL DEFAULT '0',
`c` char(120) NOT NULL DEFAULT '',
`pad` char(60) NOT NULL DEFAULT '',
PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,
KEY `k_1` (`k`),
KEY `idx` (`k`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin AUTO_INCREMENT=10224220

10 millions records
@@tidb_ddl_enable_fast_reorg = 1
@@tidb_ddl_reorg_worker_cnt = 4
@@tidb_ddl_reorg_batch_size = 256

Before this PR:

mysql> alter table sbtest1 add index idx(k);
Query OK, 0 rows affected (36.71 sec)

[2022/10/26 13:54:49.549 +08:00] [INFO] [backfilling.go:713] ["[ddl] start backfill workers to reorg record"] [type="add index"] [workerCnt=4] [regionCnt=65] [startKey=7480000000000000485f728000000000000001] [endKey=7480000000000000485f72800000000098f84d]
[2022/10/26 13:54:49.549 +08:00] [INFO] [backfilling.go:289] ["[ddl] backfill worker start"] [type="add index"] [workerID=3]
...
[2022/10/26 13:55:10.579 +08:00] [INFO] [reorg.go:237] ["[ddl] run reorg job done"] ["handled rows"=10000000]
[2022/10/26 13:55:10.579 +08:00] [INFO] [backfilling.go:326] ["[ddl] backfill worker exit"] [type="add index"] [workerID=0]

It takes 21 seconds to finish the backfilling stage.

After this PR:

mysql> alter table sbtest1 add index idx(k);
Query OK, 0 rows affected (30.44 sec)

[2022/10/26 13:52:41.490 +08:00] [INFO] [backfilling.go:736] ["[ddl] start backfill workers to reorg record"] [type="add index"] [workerCnt=4] [regionCnt=65] [startKey=7480000000000000485f728000000000000001] [endKey=7480000000000000485f72800000000098f84d]
[2022/10/26 13:52:41.490 +08:00] [INFO] [backfilling.go:296] ["[ddl] backfill worker start"] [type="add index"] [workerID=3]
...
[2022/10/26 13:52:56.673 +08:00] [INFO] [backfilling.go:343] ["[ddl] backfill worker exit"] [type="add index"] [workerID=3]
[2022/10/26 13:52:56.673 +08:00] [INFO] [reorg.go:237] ["[ddl] run reorg job done"] ["handled rows"=10000000]
[2022/10/26 13:52:56.673 +08:00] [INFO] [backfilling.go:343] ["[ddl] backfill worker exit"] [type="add index"] [workerID=0]
[2022/10/26 13:52:56.673 +08:00] [INFO] [backfilling.go:343] ["[ddl] backfill worker exit"] [type="add index"] [workerID=1]
[2022/10/26 13:52:56.673 +08:00] [INFO] [backfilling.go:343] ["[ddl] backfill worker exit"] [type="add index"] [workerID=2]

It takes 15 seconds to finish the backfill stage.

No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

ti-chi-bot · 2022-10-26T06:55:11Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

Defined2014
zimulala

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

ddl/main_test.go

ddl/backfilling.go

…hedule

ddl/backfilling.go

ddl/backfilling_test.go

ddl/backfilling.go

zimulala

LGTM

tangenta · 2022-11-09T06:47:55Z

/hold because the unit test failed.

tangenta · 2022-11-09T08:20:10Z

/merge

ti-chi-bot · 2022-11-09T08:20:14Z

This pull request has been accepted and is ready to merge.

Commit hash: cb06517

hawkingrei · 2022-11-09T16:02:59Z

/merge

ti-chi-bot · 2022-11-09T16:03:04Z

This pull request has been accepted and is ready to merge.

Commit hash: 8099780

tangenta · 2022-11-10T02:22:30Z

/unhold

sre-bot · 2022-11-10T03:41:55Z

TiDB MergeCI notify

🔴 Bad News! New failing [1] after this pr merged.
These new failed integration tests seem to be caused by the current PR, please try to fix these new failed integration tests, thanks!

CI Name	Result	Duration	Compare with Parent commit
idc-jenkins-ci-tidb/integration-compatibility-test	🟥 failed 1, success 0, total 1	2 min 25 sec	New failing
idc-jenkins-ci-tidb/integration-ddl-test	🔴 failed 1, success 5, total 6	44 min	Existing failure
idc-jenkins-ci-tidb/mybatis-test	🔴 failed 1, success 0, total 1	11 min	Existing failure
idc-jenkins-ci/integration-cdc-test	🟢 all 39 tests passed	25 min	Existing passed
idc-jenkins-ci-tidb/tics-test	🟢 all 1 tests passed	16 min	Existing passed
idc-jenkins-ci-tidb/integration-common-test	🟢 all 17 tests passed	12 min	Existing passed
idc-jenkins-ci-tidb/common-test	🟢 all 11 tests passed	11 min	Existing passed
idc-jenkins-ci-tidb/sqllogic-test-1	🟢 all 26 tests passed	5 min 33 sec	Existing passed
idc-jenkins-ci-tidb/sqllogic-test-2	🟢 all 28 tests passed	5 min 14 sec	Existing passed
idc-jenkins-ci-tidb/plugin-test	🟢 build success, plugin test success	4min	Existing passed

ddl: improve the reorg task scheduling

3963614

ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 26, 2022

tangenta mentioned this pull request Oct 26, 2022

Improve the performance of adding index #35983

Closed

20 tasks

tangenta requested review from zimulala and xiongjiwei October 26, 2022 07:04

hawkingrei reviewed Oct 26, 2022

View reviewed changes

ddl/main_test.go Outdated Show resolved Hide resolved

ddl: add backfill task scheduler

22edf5b

ti-chi-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 26, 2022

Benjamin2037 self-requested a review October 27, 2022 01:08

Benjamin2037 reviewed Oct 27, 2022

View reviewed changes

ddl/backfilling.go Outdated Show resolved Hide resolved

Benjamin2037 reviewed Oct 27, 2022

View reviewed changes

ddl/backfilling.go Outdated Show resolved Hide resolved

refine code

0ea52c3

Defined2014 reviewed Oct 28, 2022

View reviewed changes

ddl/backfilling.go Outdated Show resolved Hide resolved

ddl/backfilling.go Outdated Show resolved Hide resolved

ddl/backfilling.go Show resolved Hide resolved

ddl/backfilling.go Show resolved Hide resolved

tangenta added 2 commits November 7, 2022 15:18

Merge remote-tracking branch 'upstream/master' into add-index-task-sc…

5887906

…hedule

drain tasks if fail

ed9a260

Defined2014 reviewed Nov 8, 2022

View reviewed changes

ddl/backfilling.go Show resolved Hide resolved

ddl/backfilling.go Show resolved Hide resolved

refine waitTaskResults

06bbddc

Defined2014 approved these changes Nov 8, 2022

View reviewed changes

ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Nov 8, 2022

zimulala reviewed Nov 9, 2022

View reviewed changes

ddl/backfilling.go Show resolved Hide resolved

ddl/backfilling_test.go Show resolved Hide resolved

ddl/backfilling_test.go Show resolved Hide resolved

zimulala reviewed Nov 9, 2022

View reviewed changes

ddl/backfilling.go Outdated Show resolved Hide resolved

zimulala reviewed Nov 9, 2022

View reviewed changes

ddl/backfilling.go Outdated Show resolved Hide resolved

tangenta added 2 commits November 9, 2022 11:02

address comment

64946ca

refine log

ef125bd

zimulala approved these changes Nov 9, 2022

View reviewed changes

ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Nov 9, 2022

ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Nov 9, 2022

ti-chi-bot added 3 commits November 9, 2022 12:24

Merge branch 'master' into add-index-task-schedule

d3bb32a

Merge branch 'master' into add-index-task-schedule

dbd09ea

Merge branch 'master' into add-index-task-schedule

be43257

ti-chi-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2022

fix send on closed channel

cb06517

ti-chi-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed status/can-merge Indicates a PR has been approved by a committer. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 9, 2022

ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Nov 9, 2022

fix data race

a8883fd

ti-chi-bot removed the status/can-merge Indicates a PR has been approved by a committer. label Nov 9, 2022

fix task discard issue

0746e5c

ti-chi-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 9, 2022

fix linter

8099780

ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Nov 9, 2022

ti-chi-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 10, 2022

Merge branch 'master' into add-index-task-schedule

b09c1b6

Defined2014 approved these changes Nov 10, 2022

View reviewed changes

zimulala approved these changes Nov 10, 2022

View reviewed changes

ti-chi-bot merged commit cfbe3c9 into pingcap:master Nov 10, 2022

tangenta mentioned this pull request Oct 9, 2024

reorg handle not resumed after changing DDL owner #56506

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddl: improve the reorg task scheduling #38646

ddl: improve the reorg task scheduling #38646

tangenta commented Oct 26, 2022 •

edited

Loading

ti-chi-bot commented Oct 26, 2022 •

edited

Loading

zimulala left a comment

tangenta commented Nov 9, 2022

tangenta commented Nov 9, 2022

ti-chi-bot commented Nov 9, 2022

hawkingrei commented Nov 9, 2022

ti-chi-bot commented Nov 9, 2022

tangenta commented Nov 10, 2022

sre-bot commented Nov 10, 2022

	func (dc ddlCtx) sendTasksAndWait(sessPool sessionPool, reorgInfo reorgInfo, totalAddedCount int64, workers []backfillWorker, batchTasks []reorgBackfillTask) error {
	for i, task := range batchTasks {
	workers[i].taskCh <- task
	}

	for i := 0; i < taskCnt; i++ {
	worker := workers[i]
	result := <-worker.resultCh

ddl: improve the reorg task scheduling #38646

ddl: improve the reorg task scheduling #38646

Conversation

tangenta commented Oct 26, 2022 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

ti-chi-bot commented Oct 26, 2022 • edited Loading

zimulala left a comment

Choose a reason for hiding this comment

tangenta commented Nov 9, 2022

tangenta commented Nov 9, 2022

ti-chi-bot commented Nov 9, 2022

hawkingrei commented Nov 9, 2022

ti-chi-bot commented Nov 9, 2022

tangenta commented Nov 10, 2022

sre-bot commented Nov 10, 2022

TiDB MergeCI notify

tangenta commented Oct 26, 2022 •

edited

Loading

ti-chi-bot commented Oct 26, 2022 •

edited

Loading