Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oracle syncer potential deadlock problems #1201

Open
Rustin170506 opened this issue Aug 25, 2022 · 5 comments
Open

Oracle syncer potential deadlock problems #1201

Rustin170506 opened this issue Aug 25, 2022 · 5 comments

Comments

@Rustin170506
Copy link
Member

Bug Report

  1. syncer.go line 279 calls OracleCheckPoint's Save method, then executes _, err = sp.db.Exec(sql) and keeps blocking on a call to cgo. This function gets a write lock on OracleCheckPoint. It blocks the syncer's Run main process.

  2. server.go line 437 calls Syncer's updateStatus method, and then it also calls OracleCheckPoint's TS function, which wants to get OracleCheckPoint's read lock, but can't get the lock because of the above stuck problem, so it leads to deadlock.

Because the lock is not released, the server will not be able to call updateStatus, so it seems to be stuck and not advancing.

@Rustin170506
Copy link
Member Author

The core reason for this problem is that we are using the API improperly.

When I use ORacle's underlying driver https://github.com/godror/godror, we call _, err = sp.db.Exec(sql) directly when executing sql. Since there is no ctx passed, there is no way to interrupt this execution if the lower level gets stuck or has a problem when executing sql. That's why the above problem is caused.

To solve this problem we can call db.ExecContext function and pass the upper ctx to the lower level so that the execution can be cancelled when the upper ctx has a timeout. In turn, the deadlock problem will no longer occur.

@Rustin170506
Copy link
Member Author

Also according to the underlying driver documentation.

// ExecContext executes a query that doesn't return rows, such as an INSERT or UPDATE.
//
// ExecContext must honor the context timeout and return when it is canceled.
//
// Cancelation/timeout is honored, execution is broken, but you may have to disable out-of-bound execution - see https://github.com/oracle/odpi/issues/116 for details.
func (st *statement) ExecContext(ctx context.Context, args []driver.NamedValue) (driver.Result, error) {

We should always pass a cancelable ctx.

@Rustin170506 Rustin170506 added affects-6.1 type/bug This issue is a bug labels Aug 25, 2022
@Rustin170506
Copy link
Member Author

@cartersz Could you please take a look and help to fix it?

@Rustin170506
Copy link
Member Author

some goroutine stack:

goroutine 8757414 [select, 2284 minutes]: github.com/godror/godror.(*conn).handleDeadline.func3() /go/pkg/mod/github.com/godror/godror@v0.33.0/conn.go:128 +0xab created by github.com/godror/godror.(*conn).handleDeadline /go/pkg/mod/github.com/godror/godror@v0.33.0/conn.go:127 +0x33b

goroutine 456 [syscall, 2284 minutes, locked to thread]: github.com/godror/godror._Cfunc_dpiStmt_execute(0x7f3428046b00, 0x20, 0x0) _cgo_gotypes.go:4695 +0x4c github.com/godror/godror.(*statement).ExecContext.func4.1(0x0?, 0x1de1118?) /go/pkg/mod/github.com/godror/godror@v0.33.0/stmt.go:452 +0x53 github.com/godror/godror.(*statement).ExecContext.func4() /go/pkg/mod/github.com/godror/godror@v0.33.0/stmt.go:452 +0x25 github.com/godror/godror.(*drv).checkExecNoLOT(0x0?, 0x13edfff?) /go/pkg/mod/github.com/godror/godror@v0.33.0/drv.go:269 +0x22 github.com/godror/godror.(*drv).checkExec(0xc002c534a0?, 0x61367f0?) /go/pkg/mod/github.com/godror/godror@v0.33.0/drv.go:263 +0x34 github.com/godror/godror.(*conn).checkExec(...) /go/pkg/mod/github.com/godror/godror@v0.33.0/conn.go:78 github.com/godror/godror.(*statement).ExecContext(0xc002c534a0, {0x4032b20, 0xc000052068}, {0x61367f0?, 0x0, 0x0}) /go/pkg/mod/github.com/godror/godror@v0.33.0/stmt.go:458 +0xc17 database/sql.ctxDriverStmtExec({0x4032b20, 0xc000052068}, {0x4032d50, 0xc002c534a0}, {0x61367f0?, 0x0, 0x0}) /usr/local/go/src/database/sql/ctxutil.go:65 +0xcc database/sql.resultFromStatement({0x4032b20, 0xc000052068}, {0x4025ed0, 0xc00a24c000}, 0xc001de1818, {0x0, 0x0, 0x0}) /usr/local/go/src/database/sql/sql.go:2663 +0x145 database/sql.(*DB).execDC(0x1?, {0x4032b20, 0xc000052068}, 0xc00bce0b40, 0xc001de19a0?, {0xc001e6cc40, 0x1af}, {0x0, 0x0, 0x0}) /usr/local/go/src/database/sql/sql.go:1700 +0x473 database/sql.(*DB).exec(0x13c521e?, {0x4032b20, 0xc000052068}, {0xc001e6cc40, 0x1af}, {0x0, 0x0, 0x0}, 0x0?) /usr/local/go/src/database/sql/sql.go:1661 +0xe8 database/sql.(*DB).ExecContext(0x7f352b1413c8?, {0x4032b20, 0xc000052068}, {0xc001e6cc40, 0x1af}, {0x0, 0x0, 0x0}) /usr/local/go/src/database/sql/sql.go:1635 +0xe5 database/sql.(*DB).Exec(...) /usr/local/go/src/database/sql/sql.go:1653 github.com/pingcap/tidb-binlog/drainer/checkpoint.(*OracleCheckPoint).Save.func1({0xc001de1ad0, 0x14c384a}) /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb-binlog/drainer/checkpoint/oracle.go:163 +0x57 github.com/pingcap/tidb-binlog/pkg/util.RetryContext({0x4032b20, 0xc000052080}, 0x5, 0x4?, 0x1, 0xc001de1bb0) /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb-binlog/pkg/util/util.go:186 +0xbb github.com/pingcap/tidb-binlog/drainer/checkpoint.(*OracleCheckPoint).Save(0xc0010b3700, 0x60af1078e900001, 0x0, 0x0, 0x2988) /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb-binlog/drainer/checkpoint/oracle.go:162 +0x386 github.com/pingcap/tidb-binlog/drainer.(*Syncer).savePoint(0xc005a8ac80, 0x60af1078e900001, 0x6102c40?, 0x2988) /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb-binlog/drainer/syncer.go:279 +0x3b3 github.com/pingcap/tidb-binlog/drainer.(*Syncer).handleSuccess(0xc005a8ac80, 0xc00340f1a0, 0xc00adacbb0) /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb-binlog/drainer/syncer.go:253 +0x365 github.com/pingcap/tidb-binlog/drainer.(*Syncer).run.func1() /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb-binlog/drainer/syncer.go:299 +0x65 created by github.com/pingcap/tidb-binlog/drainer.(*Syncer).run /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb-binlog/drainer/syncer.go:297 +0x193

goroutine 470 [semacquire, 2284 minutes]: sync.runtime_SemacquireMutex(0xc0ac21366c8c4279?, 0x2c?, 0x6102c40?) /usr/local/go/src/runtime/sema.go:71 +0x25 sync.(*RWMutex).RLock(...) /usr/local/go/src/sync/rwmutex.go:63 github.com/pingcap/tidb-binlog/drainer/checkpoint.(*OracleCheckPoint).TS(0xc0010b3700?) /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb-binlog/drainer/checkpoint/oracle.go:181 +0x59 github.com/pingcap/tidb-binlog/drainer.(*Syncer).GetLatestCommitTS(...) /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb-binlog/drainer/syncer.go:735 github.com/pingcap/tidb-binlog/drainer.(*Server).updateStatus(0xc001440000) /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb-binlog/drainer/server.go:437 +0x82 github.com/pingcap/tidb-binlog/drainer.(*Server).heartbeat(0x6e6f6974706f5f73?, {0x4032b90, 0xc0009ad950}) /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb-binlog/drainer/server.go:240 +0x56 github.com/pingcap/tidb-binlog/drainer.(*Server).Start.func1() /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb-binlog/drainer/server.go:272 +0x7d github.com/pingcap/tidb-binlog/drainer.(*taskGroup).start.func1() /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb-binlog/drainer/util.go:79 +0xa7 created by github.com/pingcap/tidb-binlog/drainer.(*taskGroup).start /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb-binlog/drainer/util.go:65 +0x171

@Rustin170506
Copy link
Member Author

By the way, I think we should set a logger for the driver.
We can use godror.SetLogger to set a logger for the driver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants