- Author: dhysum
- Tracking Issue:
DDL jobs are running asynchronously in the background, which may take a very long time to execute. The Admin could cancel the running DDL job if necessary, for example, out of resource.
As a DDL job may take a lot of effort, it could be a big waste to just cancel and restart the job. It has several stages and steps back in the yard to finish a DDL job, we may just pause it at some step and resume it right from that place.
Also, such feature will benefit the Upgrade of the TiDB Cluster.
Add two commands to pause and resume a long-running DDL job. In particularly:
-
admin pause ddl jobs 3,5;
The jobs (here are job 3 and job 5) should be in Running or Queueing(Wait) state. Other states should be rejected.
-
admin resume ddl jobs 3,5;
Only Paused jobs could be resumed. Other states should be rejected.
There is no change on the architecture, just following the current one.
The whole design is simple:
-
Add pause/resume in parser
-
Add builder in planner
-
Add builder in executor
-
Valid the job's state, and turn it to be Pausing, if
admin pause ...
by end user4.1 Valid the state first, only Running job could be paused, just like
admin cancel ...
4.2 Turn the state to be Pausing
4.3 Background worker will check the state, and turn the state to be Paused , then just return and stay in current stage; And this is what is different from
admin cancel ...
4.4 Background worker will continue fetching the job and check the state, and keep doing nothing until the state changed
4.5 Specially, the Reorg state could take a long time (maybe minutes, even hours), we also need to check the state of the job, and stop the reorg accordingly
-
Validate the job's state, and turn it to be Queueing, if
admin resume ...
by end user5.1 After the job's state changes, the background worker should check the job's state and continue the work
5.2 No other actions should be taken
-
Create an index
create index idx_usr on t_user(name);
-
In another session, show the DDL jobs
admin show ddl jobs 15;
It will show 15 DDL jobs, change the number if you want to see more. And then, find the running one (job_id) you want to pause.
-
Pause the job by
admin pause ddl jobs $job_id
---- $job_id is what you get from upper step -
Resume the paused job
Also, you can find the Paused jobs by (change the number if you want to see more):
admin show ddl jobs 15;
Then, reumse it by:
admin resume ddl jobs $job_id
None.