-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ddl: improve the reorg task scheduling #38646
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/hold because the unit test failed. |
/merge |
This pull request has been accepted and is ready to merge. Commit hash: cb06517
|
/merge |
This pull request has been accepted and is ready to merge. Commit hash: 8099780
|
/unhold |
TiDB MergeCI notify🔴 Bad News! New failing [1] after this pr merged.
|
What problem does this PR solve?
Issue Number: ref #35983
Problem Summary:
In the data reorganization stage of adding index, we wrap the table ranges to a few tasks and then send them to several workers. These workers named "add index workers" or "backfill workers", which run in parallel to improve the performance of creating index records.
The main thread organizes the tasks in batch. Each batch contains
@@tidb_ddl_reorg_worker_cnt
tasks. The tasks in batch are sent to the backfill workers one by one:tidb/ddl/backfilling.go
Lines 393 to 396 in ac0d36b
We cannot proceed with the next batch until all the workers finish:
tidb/ddl/backfilling.go
Lines 367 to 369 in ac0d36b
As a result, the time consumed on a batch is determined by the slowest backfill worker. The CPU utilization is not good.
What is changed and how it works?
This PR proposes a better model: All backfill workers share the same task channel and the same result channel. Once a worker finishes a task, it could pick up the next one instantly without waiting.
However, this change breaks the order of the task execution. For example, task 6 may be handled earlier than task 5. We need another way to determine the "next handle", which is persisted to the storage as a check point. In this PR,
doneTaskKeeper
solves this problem.Check List
Tests
Manual test (add detailed scripts or steps below)
sbtest1
@@tidb_ddl_enable_fast_reorg
= 1@@tidb_ddl_reorg_worker_cnt
= 4@@tidb_ddl_reorg_batch_size
= 256Before this PR:
It takes 21 seconds to finish the backfilling stage.
After this PR:
It takes 15 seconds to finish the backfill stage.
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.