DDL progress can be blocked due to high concurrency #30400

tangenta · 2021-12-03T09:53:48Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

The original scenario is using sysbench to create 10000 tables through a load balancer. In a cluster with more than 10 TiDB instances, it is super easy to reproduce.

When the DDL job ID allocating transaction keeps rolling back because of the write conflict(like more than 100 times), an error is sent back from another goroutine. However, this error is not properly handled. To reproduce it locally, we need to inject a failpoint:

+++ b/ddl/ddl_worker.go
@@ -275,6 +275,8 @@ func (d *ddl) limitDDLJobs() {
        }
 }
 
+var firstTime = true
+
 // addBatchDDLJobs gets global job IDs and puts the DDL jobs in the DDL queue.
 func (d *ddl) addBatchDDLJobs(tasks []*limitJobTask) {
        startTime := time.Now()
@@ -300,6 +302,12 @@ func (d *ddl) addBatchDDLJobs(tasks []*limitJobTask) {
                        if err = t.EnQueueDDLJob(job, jobListKey); err != nil {
                                return errors.Trace(err)
                        }
+                       failpoint.Inject("mockAddBatchDDLJobsErr", func(val failpoint.Value) {
+                               if val.(bool) && job.SchemaName == "boom" && firstTime {
+                                       firstTime = false
+                                       failpoint.Return(errors.Errorf("mockAddBatchDDLJobsErr"))
+                               }
+                       })
                }
                return nil
        })

make failpoint-enable
make
GO_FAILPOINTS="github.com/pingcap/tidb/ddl/mockAddBatchDDLJobsErr=return(true)" ./bin/tidb-server

mysql> use test
Database changed
mysql> create database boom;
-- no response
^C^C -- query aborted
^C^C -- query aborted
^C^C -- query aborted
-- cannot aborted

[2021/12/03 18:11:17.822 +08:00] [INFO] [ddl_worker.go:318] ["[ddl] add DDL jobs"] ["batch count"=1] [jobs="ID:59, Type:create schema, State:none, SchemaState:queueing, SchemaID:58, TableID:0, RowCount:0, ArgLen:1, start time: 2021-12-03 18:11:17.821 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0; "]
[2021/12/03 18:11:17.822 +08:00] [INFO] [ddl.go:553] ["[ddl] start DDL job"] [job="ID:59, Type:create schema, State:none, SchemaState:queueing, SchemaID:58, TableID:0, RowCount:0, ArgLen:1, start time: 2021-12-03 18:11:17.821 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [query="create database boom"]

The connection is leaking.

[2021/12/03 18:15:15.903 +08:00] [ERROR] [http_status.go:465] ["http server error"] [error="http: Server closed"]
[2021/12/03 18:15:15.905 +08:00] [ERROR] [http_status.go:460] ["grpc server error"] [error="mux: listener closed"]
[2021/12/03 18:15:15.905 +08:00] [INFO] [server.go:732] ["[server] graceful shutdown."]
[2021/12/03 18:15:15.905 +08:00] [INFO] [server.go:745] ["graceful shutdown..."] ["conn count"=1]

Fortunately, this does not affect the DDL/DML from another session.

2. What did you expect to see? (Required)

Query OK, 1 row affected (0.06 sec)

3. What did you see instead (Required)

It hangs constantly.

4. What is your TiDB version? (Required)

commit a04601477600b6804d7a4a2bd31a923bed7817c7 (HEAD, upstream/master)
Author: Song Gao <disxiaofei@163.com>
Date:   Wed Dec 1 11:23:53 2021 +0800

    planner: Add trace for proj elimination rule (#30275)

The text was updated successfully, but these errors were encountered:

github-actions · 2021-12-06T06:56:13Z

Please check whether the issue should be labeled with 'affects-x.y' or 'fixes-x.y.z', and then remove 'needs-more-info' label.

tangenta added the type/bug The issue is confirmed as a bug. label Dec 3, 2021

tangenta mentioned this issue Dec 3, 2021

ddl: handle the error from addBatchDDLJobs() correctly #30401

Merged

12 tasks

tangenta self-assigned this Dec 3, 2021

aytrack added sig/sql-infra SIG: SQL Infra severity/major labels Dec 3, 2021

ti-chi-bot closed this as completed in #30401 Dec 6, 2021

github-actions bot added the needs-more-info label Dec 6, 2021

tangenta added affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. and removed needs-more-info labels Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDL progress can be blocked due to high concurrency #30400

DDL progress can be blocked due to high concurrency #30400

tangenta commented Dec 3, 2021 •

edited

Loading

github-actions bot commented Dec 6, 2021

DDL progress can be blocked due to high concurrency #30400

DDL progress can be blocked due to high concurrency #30400

Comments

tangenta commented Dec 3, 2021 • edited Loading

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

github-actions bot commented Dec 6, 2021

tangenta commented Dec 3, 2021 •

edited

Loading