Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-worker gets stuck at baseDB.GetBaseConn #3733

Closed
GMHDBJD opened this issue Aug 25, 2021 · 1 comment · Fixed by #6055
Closed

DM-worker gets stuck at baseDB.GetBaseConn #3733

GMHDBJD opened this issue Aug 25, 2021 · 1 comment · Fixed by #6055
Assignees
Labels

Comments

@GMHDBJD
Copy link
Contributor

GMHDBJD commented Aug 25, 2021

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.
# config haproxy for tidb-servers
backend tidb_peers
    balance roundrobin
    mode tcp
    server tidb1 host:port check
    server tidb2 host:port check
start-task
# scale out tidb, misconfigured haproxy
balance roundrobin
    mode tcp
    server tidb1 host:port check
    server tidb2 host:port check
    server tidb3 host:wrong_port check
    server tidb4 host:wrong_port check
# restart haproxy
  1. What did you expect to see?
    task paused by invalid connection error
  2. What did you see instead?
    task block after failed to reset connection for a long time
[2021/08/25 02:16:48.578 -04:00] [ERROR] [db.go:205] ["execute statements failed after retry"] [task=test] [unit="binlog replication"] [queries="[INSERT INTO `dm_meta`.`test_syncer_checkpoint`\n\t\t(id, cp_schema, cp_table, binlog_name, binlog_pos, binlog_gtid, exit_safe_binlog_name, exit_safe_binlog_pos, exit_safe_binlog_gtid, table_info, is_global) VALUES\n\t\t(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)\n\t\tON DUPLICATE KEY UPDATE\n\t\t\tbinlog_name = VALUES(binlog_name),\n\t\t\tbinlog_pos = VALUES(binlog_pos),\n\t\t\tbinlog_gtid = VALUES(binlog_gtid),\n\t\t\texit_safe_binlog_name = VALUES(exit_safe_binlog_name),\n\t\t\texit_safe_binlog_pos = VALUES(exit_safe_binlog_pos),\n\t\t\texit_safe_binlog_gtid = VALUES(exit_safe_binlog_gtid),\n\t\t\ttable_info = VALUES(table_info),\n\t\t\tis_global = VALUES(is_global);\n\t INSERT INTO `dm_meta`.`test_syncer_checkpoint`\n\t\t(id, cp_schema, cp_table, binlog_name, binlog_pos, binlog_gtid, exit_safe_binlog_name, exit_safe_binlog_pos, exit_safe_binlog_gtid, table_info, is_global) VALUES\n\t\t(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)\n\t\tON DUPLICATE KEY UPDATE\n\t\t\tbinlog_name = VALUES(binlog_name),\n\t\t\tbinlog_pos = VALUES(binlog_pos),\n\t\t\tbinlog_gtid = VALUES(binlog_gtid),\n\t\t\texit_s..."] [arguments="[[mysql-replica-01   mysql-bin.000001 1511   0  null true] [mysql-replica-01 shardddl1 tb1 mysql-bin.000001 1480   0  {\"id\":56,\"name\":{\"O\":\"tb1\",\"L\":\"tb1\"},\"charset\":\"utf8mb4\",\"collate\":\"utf8mb4_bin\",\"cols\":[{\"id\":1,\"name\":{\"O\":\"id\",\"L\":\"id\"},\"offset\":0,\"origin_default\":null,\"origin_default_bit\":null,\"default\":null,\"default_bit\":null,\"default_is_expr\":false,\"generated_expr_string\":\"\",\"generated_stored\":false,\"dependences\":null,\"type\":{\"Tp\":3,\"Flag\":4099,\"Flen\":11,\"Decimal\":0,\"Charset\":\"binary\",\"Collate\":\"binary\",\"Elems\":null},\"state\":5,\"comment\":\"\",\"hidden\":false,\"change_state_info\":null,\"version\":2}],\"index_info\":null,\"constraint_info\":null,\"fk_info\":null,\"state\":5,\"pk_is_handle\":true,\"is_common_handle\":false,\"common_handle_version\":0,\"comment\":\"\",\"auto_inc_id\":0,\"auto_id_cache\":0,\"auto_rand_id\":0,\"max_col_id\":1,\"max_idx_id\":0,\"max_cst_id\":0,\"update_timestamp\":427261212416278528,\"ShardRowIDBits\":0,\"max_shard_row_id_bits\":0,\"auto_random_bits\":0,\"pre_split_regions\":0,\"partition\":null,\"compression\":\"\",\"view\":nu..."] [error="[code=10006:class=database:scope=not-set:level=high], Message: execute statement failed: begin, RawCause: invalid connection"]
[2021/08/25 02:16:48.580 -04:00] [WARN] [db.go:237] ["failed to close BaseConn in reset"] [task=test] [unit="binlog replication"]
goroutine 866 [IO wait]:
internal/poll.runtime_pollWait(0x7eff0c98ef18, 0x72, 0x0)
	/usr/local/go/src/runtime/netpoll.go:222 +0x65
internal/poll.(*pollDesc).wait(0xc0070cd298, 0x72, 0x127a900, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x11a
internal/poll.(*pollDesc).waitRead(0xc0070cd298, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92 +0x74
internal/poll.(*FD).Read(0xc0070cd280, 0xc005e40000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:166 +0x55f
net.(*netFD).Read(0xc0070cd280, 0xc005e40000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/fd_posix.go:55 +0xc5
net.(*conn).Read(0xc005e04060, 0xc005e40000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:183 +0x115
github.com/go-sql-driver/mysql.(*buffer).fill(0xc0083478c0, 0x4, 0x0, 0x0)
	/home/gmhdbjd/develop/environment/go/pkg/mod/github.com/go-sql-driver/mysql@v1.5.0/buffer.go:90 +0x7f8
github.com/go-sql-driver/mysql.(*buffer).readNext(0xc0083478c0, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/gmhdbjd/develop/environment/go/pkg/mod/github.com/go-sql-driver/mysql@v1.5.0/buffer.go:119 +0xca
github.com/go-sql-driver/mysql.(*mysqlConn).readPacket(0xc0083478c0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/gmhdbjd/develop/environment/go/pkg/mod/github.com/go-sql-driver/mysql@v1.5.0/packets.go:31 +0xdf
github.com/go-sql-driver/mysql.(*mysqlConn).readHandshakePacket(0xc0083478c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/gmhdbjd/develop/environment/go/pkg/mod/github.com/go-sql-driver/mysql@v1.5.0/packets.go:187 +0xb0
github.com/go-sql-driver/mysql.(*connector).Connect(0xc000010330, 0x8562ac0, 0xc00060e6c0, 0x0, 0x0, 0x0, 0x0)
	/home/gmhdbjd/develop/environment/go/pkg/mod/github.com/go-sql-driver/mysql@v1.5.0/connector.go:81 +0xe71
database/sql.(*DB).conn(0xc000bb8c30, 0x8562ac0, 0xc00060e6c0, 0xc006cc6901, 0x0, 0x0, 0x0)
	/usr/local/go/src/database/sql/sql.go:1301 +0x1622
database/sql.(*DB).Conn(0xc000bb8c30, 0x8562ac0, 0xc00060e6c0, 0x0, 0x0, 0x0)
	/usr/local/go/src/database/sql/sql.go:1810 +0x105
github.com/pingcap/dm/pkg/conn.(*BaseDB).GetBaseConn(0xc0047143c0, 0x8562ac0, 0xc00060e6c0, 0x0, 0x0, 0x0)
	/home/gmhdbjd/develop/code/dm/pkg/conn/basedb.go:162 +0x113
github.com/pingcap/dm/syncer/dbconn.CreateConns.func1(0xc0074d50b0, 0xc000c20888, 0x0, 0x0, 0x0)
	/home/gmhdbjd/develop/code/dm/syncer/dbconn/db.go:237 +0x298
github.com/pingcap/dm/syncer/dbconn.(*DBConn).ResetConn(0xc000c208a0, 0xc0074d50b0, 0x0, 0x0)
	/home/gmhdbjd/develop/code/dm/syncer/dbconn/db.go:55 +0x10e
github.com/pingcap/dm/syncer/dbconn.(*DBConn).ExecuteSQLWithIgnore.func1(0x0, 0x851de00, 0xc0070cd000, 0x851de00)
	/home/gmhdbjd/develop/code/dm/syncer/dbconn/db.go:157 +0x16f
github.com/pingcap/dm/pkg/retry.(*FiniteRetryStrategy).Apply(0xaa6f5f8, 0xc0074d50b0, 0x64, 0xb2d05e00, 0x1, 0xc00512a550, 0xc00512a5a0, 0x60c5780, 0x9e06580, 0x0, ...)
	/home/gmhdbjd/develop/code/dm/pkg/retry/strategy.go:72 +0x228
github.com/pingcap/dm/pkg/conn.(*BaseConn).ApplyRetryStrategy(0xc000c20888, 0xc0074d50b0, 0x64, 0xb2d05e00, 0x851e001, 0xc00512a550, 0xc00512a5a0, 0x0, 0x0, 0x0, ...)
	/home/gmhdbjd/develop/code/dm/pkg/conn/baseconn.go:224 +0x192
github.com/pingcap/dm/syncer/dbconn.(*DBConn).ExecuteSQLWithIgnore(0xc000c208a0, 0xc0074d50b0, 0x0, 0xc0000db500, 0x2, 0x64, 0xc000470000, 0x2, 0x64, 0x0, ...)
	/home/gmhdbjd/develop/code/dm/syncer/dbconn/db.go:184 +0xa8f
github.com/pingcap/dm/syncer/dbconn.(*DBConn).ExecuteSQL(0xc000c208a0, 0xc0074d50b0, 0xc0000db500, 0x2, 0x64, 0xc000470000, 0x2, 0x64, 0x0, 0x0, ...)
	/home/gmhdbjd/develop/code/dm/syncer/dbconn/db.go:216 +0x185
github.com/pingcap/dm/syncer.(*RemoteCheckPoint).FlushPointsExcept(0xc0009d37a0, 0xc000c30ee8, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/gmhdbjd/develop/code/dm/syncer/checkpoint.go:556 +0x28c5
github.com/pingcap/dm/syncer.(*Syncer).flushCheckPoints(0xc000bf2400, 0x0, 0x0)
	/home/gmhdbjd/develop/code/dm/syncer/syncer.go:1154 +0x161a
github.com/pingcap/dm/syncer.(*Syncer).Run.func8(0xc004e486c0, 0xc005e9f768, 0xc000bf2400, 0xc0049aafc0, 0xc004806b40)
	/home/gmhdbjd/develop/code/dm/syncer/syncer.go:1663 +0xfde
github.com/pingcap/dm/syncer.(*Syncer).Run(0xc000bf2400, 0x8562a50, 0xc005125980, 0x851de00, 0xc0070ccd80)
	/home/gmhdbjd/develop/code/dm/syncer/syncer.go:1992 +0x82c6
github.com/pingcap/dm/syncer.(*Syncer).Process(0xc000bf2400, 0x8562a50, 0xc000a1e300, 0xc004bd5b60)
	/home/gmhdbjd/develop/code/dm/syncer/syncer.go:660 +0xa05
created by github.com/pingcap/dm/dm/worker.(*SubTask).run
	/home/gmhdbjd/develop/code/dm/dm/worker/subtask.go:230 +0xbcb
  1. Root Cause.
    We should use context with DefaultDBTimeout rather than parent ctx in DB.GetBaseConn, which makes db.Conn block for a long time
@lance6716 lance6716 transferred this issue from pingcap/dm Dec 6, 2021
@lance6716 lance6716 added area/dm Issues or PRs related to DM. type/bug The issue is confirmed as a bug. labels Dec 6, 2021
@GMHDBJD GMHDBJD self-assigned this Apr 11, 2022
@lance6716
Copy link
Contributor

Since this issue is related to cloud users, we should improve the priority. Hope we can fix it before the DM is publicly used in cloud

@lance6716 lance6716 assigned lance6716 and unassigned GMHDBJD Jun 27, 2022
ti-chi-bot pushed a commit that referenced this issue Jun 28, 2022
ti-chi-bot pushed a commit to ti-chi-bot/tiflow that referenced this issue Jun 28, 2022
ti-chi-bot pushed a commit to ti-chi-bot/tiflow that referenced this issue Jun 28, 2022
ti-chi-bot pushed a commit to ti-chi-bot/tiflow that referenced this issue Jun 28, 2022
ti-chi-bot added a commit that referenced this issue Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants