Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dm-worker keeps retrying to execute ddl when encounter "invalid connection" error #4689

Closed
sleepymole opened this issue Feb 24, 2022 · 9 comments · Fixed by #6848
Closed
Assignees
Labels
area/dm Issues or PRs related to DM. type/feature Issues about a new feature

Comments

@sleepymole
Copy link
Contributor

What did you do?

  1. Replicate data from one MySQL to TiDB.
  2. Execute ddl on upstream MySQL:
ALTER TABLE xxx MODIFY COLUMN xxx SMALLINT(4) NOT NULL DEFAULT _UTF8MB4'0'

What did you expect to see?

No error is reported.

What did you see instead?

DM encountered "invalid connection" error and keeps retrying to execute ddl for every 5 minutes.

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

v2.0.6

current status of DM cluster (execute query-status <task-name> in dmctl)

(paste current status of DM cluster here)
@sleepymole sleepymole added type/bug The issue is confirmed as a bug. area/dm Issues or PRs related to DM. labels Feb 24, 2022
@jiyfhust
Copy link

jiyfhust commented Feb 24, 2022

I meet this problem a few days ago,i found it is
MaxDDLConnectionTimeoutMinute set 5min cause the problem.

@jiyfhust
Copy link

can i make a pr to solve the problem?

@sleepymole
Copy link
Contributor Author

/cc @lance6716

@lance6716
Copy link
Contributor

lance6716 commented Feb 24, 2022

can i make a pr to solve the problem?

welcome! before writing codes, can you give some brief introduction of your fixing? we can discuss the effects in advance.

@jiyfhust
Copy link

  1. dbCfg.RawDBCfg = config.DefaultRawDBConfig().SetReadTimeout(maxDDLConnectionTimeout)

SetReadTimeout(maxDDLConnectionTimeout)

2.

if rawCfg.ReadTimeout != "" {

		dsn += fmt.Sprintf("&readTimeout=%s", rawCfg.ReadTimeout)

3.https://github.com/go-sql-driver/mysql/blob/217d05049e5a88d529b9a2d5fe5675120831efab/dsn.go#L51

Timeout          time.Duration     // Dial timeout
ReadTimeout      time.Duration     // I/O read timeout
WriteTimeout     time.Duration     // I/O write timeout

4.https://github.com/go-sql-driver/mysql/blob/217d05049e5a88d529b9a2d5fe5675120831efab/packets.go#L115

   conn.SetReadDeadline(time.Now().Add(mc.cfg.ReadTimeout))

Because ddl reorganization like add index,modify column may take a long time, is this proper to set ReadTimeout ulimited?
this invalid connection error is the ReadTimeout problem.

but The MaxDDLConnectionTimeoutMinute is connection timeout argument, go-sql-driver packet should set the "Timeout time.Duration // Dial timeout"

Is dm not right set the argument?

5.

invalidConnF := func(tctx *tcontext.Context, err error, ddls []string, index int, conn *dbconn.DBConn) error {

    // it only ignore `invalid connection` error (timeout or other causes) for `ADD INDEX`.
// `invalid connection` means some data already sent to the server,
// and we assume that the whole SQL statement has already sent to the server for this error.
// if we have other methods to judge the DDL dispatched but timeout for executing, we can update this method.
// NOTE: we must ensure other PK/UK exists for correctness.
// NOTE: when we are refactoring the shard DDL algorithm, we also need to consider supporting non-blocking `ADD 
INDEX`.

there is a handler to ignore the invalid connection error when add index, is it also needed if set ReadTimeout unlimited?

so, maybe there is there step:

  1. set ReadTimeout ulimited
  2. add set connection timeout?
  3. remove invalid connetion ignore logic

/cc @lance6716

@lance6716
Copy link
Contributor

lance6716 commented Feb 25, 2022

@jiyfhust
I have ask the reason why we add a ReadTimeout here, the main scenario is DM needs to distinguish a slow DDL (no COM_QUERY_Response for a long time) from a dead downstream. We can agree that for a slow DDL, DM should not retry it again; for a dead downstream, DM should do something rather than waiting.

"1. set ReadTimeout ulimited" itself can't know a downstream is dead. And I'm not sure how to "2. add set connection timeout". There's are some linux kernel feature about TCP keepalive, if this feature is available, enabling it turns the problem into this: we can know the TCP connection is dead or not, but will the downstream MySQL/TiDB/RDS responses to the query in future when the TCP connection is not dead? For example, will MySQL/TiDB/RDS sliently drop the query after receiving it? (I guess no but haven't check it by MySQL Client/Server Protocol) Will MySQL/TiDB/RDS drop some query when reading from the socket and treat it as not received because of something? Will the MySQL/TiDB/RDS fail to send COM_QUERY_Response and not retry?

If you can find some proof about above question and correctly set the TCP keepalive feature, I think it's OK to totally remove the ReadTimeout from the application layer.

Another solution is after the ReadTimeout, we can use ADMIN SHOW DDL to check if downstream has really received the query, this is more safe IMO but may need more code work.

Feel free to discuss!

@jiyfhust
Copy link

jiyfhust commented Feb 25, 2022

"2. add set connection timeout" i mean it is the timeout when dm connecting the downstreams,not like tcp keepalive.

there seems no good method to check downstream alive through mysql protocol by a connection Executing sql query. Is the problem "no COM_QUERY_Response for a long time" occurred from some mysql proxy or lvs?

if we use ADMIN SHOW DDL or query information_schema.ddl_jobs, by what method to judge the dm ddl sql? May be ddl job_id or the query sql or some way else?
hi @lance6716

@jiyfhust
Copy link

I think it will take a long time to fix it by myself. Maybe some one who is familiar with dm to fix it is a better choice.

If dm syncer a ddl like "modify column", it may trigger a serious TiDB bug which is fixed and mergered to 5.3.0 just three days before.

ddl: fix concurrent column type changes(with changing data) that cause schema and data inconsistencies

@lance6716
Copy link
Contributor

In fact I haven't experienced "no COM_QUERY_Response for a long time", I guess it can be caused by any components in the network link, for example the router is down.

DM can know the DDL in invalidConnF. Through ADMIN SHOW DDL JOB QUERIES <ID> or other ways it can check if downstream has received the DDL, and through ADMIN SHOW DDL JOBS it can check if the DDL of DM is finished.

Don't worry, any kind of contribution is good!

@niubell niubell self-assigned this Mar 3, 2022
@lance6716 lance6716 assigned lyzx2001 and unassigned niubell Jul 7, 2022
@lance6716 lance6716 added type/feature Issues about a new feature and removed type/bug The issue is confirmed as a bug. labels Aug 4, 2022
ti-chi-bot pushed a commit that referenced this issue Sep 13, 2022
ti-chi-bot pushed a commit that referenced this issue Oct 27, 2022
…e when encountering "invalid connection" error (#7104)

ref #4689
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dm Issues or PRs related to DM. type/feature Issues about a new feature
Projects
None yet
5 participants