dm-worker keeps retrying to execute ddl when encounter "invalid connection" error #4689

sleepymole · 2022-02-24T08:32:57Z

What did you do?

Replicate data from one MySQL to TiDB.
Execute ddl on upstream MySQL:

ALTER TABLE xxx MODIFY COLUMN xxx SMALLINT(4) NOT NULL DEFAULT _UTF8MB4'0'

What did you expect to see?

No error is reported.

What did you see instead?

DM encountered "invalid connection" error and keeps retrying to execute ddl for every 5 minutes.

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

v2.0.6

current status of DM cluster (execute `query-status <task-name>` in dmctl)

(paste current status of DM cluster here)

The text was updated successfully, but these errors were encountered:

jiyfhust · 2022-02-24T11:03:40Z

I meet this problem a few days ago，i found it is
MaxDDLConnectionTimeoutMinute set 5min cause the problem.

jiyfhust · 2022-02-24T11:05:28Z

can i make a pr to solve the problem?

sleepymole · 2022-02-24T12:19:22Z

/cc @lance6716

lance6716 · 2022-02-24T12:24:46Z

can i make a pr to solve the problem?

welcome! before writing codes, can you give some brief introduction of your fixing? we can discuss the effects in advance.

jiyfhust · 2022-02-25T02:03:48Z

tiflow/dm/syncer/syncer.go

Line 3324 in c345857

dbCfg.RawDBCfg = config.DefaultRawDBConfig().SetReadTimeout(maxDDLConnectionTimeout)

SetReadTimeout(maxDDLConnectionTimeout)

2.

tiflow/dm/pkg/conn/basedb.go

Line 97 in c345857

if rawCfg.ReadTimeout != "" {

		dsn += fmt.Sprintf("&readTimeout=%s", rawCfg.ReadTimeout)

3.https://github.com/go-sql-driver/mysql/blob/217d05049e5a88d529b9a2d5fe5675120831efab/dsn.go#L51

Timeout          time.Duration     // Dial timeout
ReadTimeout      time.Duration     // I/O read timeout
WriteTimeout     time.Duration     // I/O write timeout

4.https://github.com/go-sql-driver/mysql/blob/217d05049e5a88d529b9a2d5fe5675120831efab/packets.go#L115

   conn.SetReadDeadline(time.Now().Add(mc.cfg.ReadTimeout))

Because ddl reorganization like add index,modify column may take a long time, is this proper to set ReadTimeout ulimited?
this invalid connection error is the ReadTimeout problem.

but The MaxDDLConnectionTimeoutMinute is connection timeout argument, go-sql-driver packet should set the "Timeout time.Duration // Dial timeout"

Is dm not right set the argument?

5.

tiflow/dm/syncer/error.go

Line 75 in c345857

 invalidConnF := func(tctx *tcontext.Context, err error, ddls []string, index int, conn *dbconn.DBConn) error { 

    // it only ignore `invalid connection` error (timeout or other causes) for `ADD INDEX`.
// `invalid connection` means some data already sent to the server,
// and we assume that the whole SQL statement has already sent to the server for this error.
// if we have other methods to judge the DDL dispatched but timeout for executing, we can update this method.
// NOTE: we must ensure other PK/UK exists for correctness.
// NOTE: when we are refactoring the shard DDL algorithm, we also need to consider supporting non-blocking `ADD 
INDEX`.

there is a handler to ignore the invalid connection error when add index, is it also needed if set ReadTimeout unlimited?

so, maybe there is there step:

set ReadTimeout ulimited
add set connection timeout?
remove invalid connetion ignore logic

/cc @lance6716

lance6716 · 2022-02-25T04:33:17Z

@jiyfhust
I have ask the reason why we add a ReadTimeout here, the main scenario is DM needs to distinguish a slow DDL (no COM_QUERY_Response for a long time) from a dead downstream. We can agree that for a slow DDL, DM should not retry it again; for a dead downstream, DM should do something rather than waiting.

"1. set ReadTimeout ulimited" itself can't know a downstream is dead. And I'm not sure how to "2. add set connection timeout". There's are some linux kernel feature about TCP keepalive, if this feature is available, enabling it turns the problem into this: we can know the TCP connection is dead or not, but will the downstream MySQL/TiDB/RDS responses to the query in future when the TCP connection is not dead? For example, will MySQL/TiDB/RDS sliently drop the query after receiving it? (I guess no but haven't check it by MySQL Client/Server Protocol) Will MySQL/TiDB/RDS drop some query when reading from the socket and treat it as not received because of something? Will the MySQL/TiDB/RDS fail to send COM_QUERY_Response and not retry?

If you can find some proof about above question and correctly set the TCP keepalive feature, I think it's OK to totally remove the ReadTimeout from the application layer.

Another solution is after the ReadTimeout, we can use ADMIN SHOW DDL to check if downstream has really received the query, this is more safe IMO but may need more code work.

Feel free to discuss!

jiyfhust · 2022-02-25T09:55:47Z

"2. add set connection timeout" i mean it is the timeout when dm connecting the downstreams，not like tcp keepalive.

there seems no good method to check downstream alive through mysql protocol by a connection Executing sql query. Is the problem "no COM_QUERY_Response for a long time" occurred from some mysql proxy or lvs?

if we use ADMIN SHOW DDL or query information_schema.ddl_jobs, by what method to judge the dm ddl sql? May be ddl job_id or the query sql or some way else?
hi @lance6716

jiyfhust · 2022-02-25T10:13:34Z

I think it will take a long time to fix it by myself. Maybe some one who is familiar with dm to fix it is a better choice.

If dm syncer a ddl like "modify column", it may trigger a serious TiDB bug which is fixed and mergered to 5.3.0 just three days before.

ddl: fix concurrent column type changes(with changing data) that cause schema and data inconsistencies

lance6716 · 2022-02-27T03:50:57Z

In fact I haven't experienced "no COM_QUERY_Response for a long time", I guess it can be caused by any components in the network link, for example the router is down.

DM can know the DDL in invalidConnF. Through ADMIN SHOW DDL JOB QUERIES <ID> or other ways it can check if downstream has received the DDL, and through ADMIN SHOW DDL JOBS it can check if the DDL of DM is finished.

Don't worry, any kind of contribution is good!

ref #4689

…ring "invalid connection" error (#6848) close #4689

…e when encountering "invalid connection" error (#7104) ref #4689

sleepymole added type/bug The issue is confirmed as a bug. area/dm Issues or PRs related to DM. labels Feb 24, 2022

niubell self-assigned this Mar 3, 2022

lance6716 mentioned this issue Jun 21, 2022

DM-worker should not retry on long time ddls directly #5962

Closed

lance6716 assigned lyzx2001 and unassigned niubell Jul 7, 2022

lyzx2001 mentioned this issue Jul 13, 2022

Use 'ADMIN SHOW DDL JOB QUERIES LIMIT m OFFSET n' to retrieve DDL commands' content within a certain range (n+1, n+m) pingcap/tidb#36198

Closed

lyzx2001 mentioned this issue Aug 2, 2022

syncer(dm): add the function GetDDLStatusFromTidb() #6573

Merged

lance6716 added type/feature Issues about a new feature and removed type/bug The issue is confirmed as a bug. labels Aug 4, 2022

ti-chi-bot pushed a commit that referenced this issue Aug 12, 2022

syncer(dm): add the function GetDDLStatusFromTidb() (#6573)

2df4ad6

ref #4689

lyzx2001 mentioned this issue Aug 22, 2022

syncer(dm): Fix dm-worker keeps retrying to execute ddl when encountering "invalid connection" error #6848

Merged

ti-chi-bot closed this as completed in #6848 Sep 13, 2022

ti-chi-bot pushed a commit that referenced this issue Sep 13, 2022

syncer(dm): Fix dm-worker keeps retrying to execute ddl when encounte…

fd69a24

…ring "invalid connection" error (#6848) close #4689

lyzx2001 mentioned this issue Sep 16, 2022

syncer(dm): Add unit test and integration test for multi-schema change when encountering "invalid connection" error #7104

Merged

This was referenced Sep 26, 2022

add 6.3.0 release notes pingcap/docs#10249

Merged

add 6.3.0 release notes pingcap/docs-cn#11115

Merged

ti-chi-bot pushed a commit that referenced this issue Oct 27, 2022

syncer(dm): Add unit test and integration test for multi-schema chang…

b34ecff

…e when encountering "invalid connection" error (#7104) ref #4689

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dm-worker keeps retrying to execute ddl when encounter "invalid connection" error #4689

dm-worker keeps retrying to execute ddl when encounter "invalid connection" error #4689

sleepymole commented Feb 24, 2022

jiyfhust commented Feb 24, 2022 •

edited

Loading

jiyfhust commented Feb 24, 2022

sleepymole commented Feb 24, 2022

lance6716 commented Feb 24, 2022 •

edited

Loading

jiyfhust commented Feb 25, 2022

lance6716 commented Feb 25, 2022 •

edited

Loading

jiyfhust commented Feb 25, 2022 •

edited

Loading

jiyfhust commented Feb 25, 2022

lance6716 commented Feb 27, 2022

dm-worker keeps retrying to execute ddl when encounter "invalid connection" error #4689

dm-worker keeps retrying to execute ddl when encounter "invalid connection" error #4689

Comments

sleepymole commented Feb 24, 2022

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

current status of DM cluster (execute query-status <task-name> in dmctl)

jiyfhust commented Feb 24, 2022 • edited Loading

jiyfhust commented Feb 24, 2022

sleepymole commented Feb 24, 2022

lance6716 commented Feb 24, 2022 • edited Loading

jiyfhust commented Feb 25, 2022

lance6716 commented Feb 25, 2022 • edited Loading

jiyfhust commented Feb 25, 2022 • edited Loading

jiyfhust commented Feb 25, 2022

lance6716 commented Feb 27, 2022

current status of DM cluster (execute `query-status <task-name>` in dmctl)

jiyfhust commented Feb 24, 2022 •

edited

Loading

lance6716 commented Feb 24, 2022 •

edited

Loading

lance6716 commented Feb 25, 2022 •

edited

Loading

jiyfhust commented Feb 25, 2022 •

edited

Loading