Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A context canceled error is not treated as undetermined #20733

Closed
zyguan opened this issue Oct 30, 2020 · 2 comments · Fixed by #20857
Closed

A context canceled error is not treated as undetermined #20733

zyguan opened this issue Oct 30, 2020 · 2 comments · Fixed by #20857
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. severity/major sig/transaction SIG:Transaction type/bug The issue is confirmed as a bug.

Comments

@zyguan
Copy link
Contributor

zyguan commented Oct 30, 2020

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Run rw-register case with nemesis of random_kill,kill_pd_leader_5min,partition_one,subcritical_skews,big_skews,shuffle-leader-scheduler,shuffle-region-scheduler,random-merge-scheduler. Here is the test record.

We found that one of transaction failed with context canceled:

[2020/10/29 09:08:46.035 +00:00] [WARN] [txn.go:514] ["wait tso failed"] [error="context canceled"]
[2020/10/29 09:08:46.035 +00:00] [INFO] [tidb.go:218] ["rollbackTxn for ddl/autocommit failed"]
[2020/10/29 09:08:46.035 +00:00] [WARN] [session.go:1066] ["run statement failed"] [conn=5] [schemaVersion=31] [error="[executor:1317]Query execution was interrupted"] [session="{\n  \"currDBName\": \"test\",\n  \"id\": 5,\n  \"status\": 2,\n  \"strictMode\": true,\n  \"user\": {\n    \"Username\": \"root\",\n    \"Hostname\": \"192.168.165.64\",\n    \"CurrentUser\": false,\n    \"AuthUsername\": \"root\",\n    \"AuthHostname\": \"%\"\n  }\n}"]
[2020/10/29 09:08:46.035 +00:00] [INFO] [conn.go:787] ["command dispatched failed"] [conn=5] [connInfo="id:5, addr:192.168.165.64:14786 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="START TRANSACTION"] [txn_mode=PESSIMISTIC] [err="[executor:1317]Query execution was interrupted"]
[2020/10/29 09:08:46.035 +00:00] [ERROR] [terror.go:272] ["encountered error"] [error="write tcp 192.168.100.120:4000->192.168.165.64:14786: use of closed network connection"] [stack="github.com/pingcap/parser/terror.Log\n\t/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20201022083903-fbe80b0c40bb/terror/terror.go:272\ngithub.com/pingcap/tidb/server.(*clientConn).Run\n\t/home/jenkins/agent/workspace/build-tidb/tidb/server/conn.go:796\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-tidb/tidb/server/server.go:421"]
[2020/10/29 09:08:46.035 +00:00] [WARN] [client_batch.go:632] ["wait response is cancelled"] [to=tipocket-rw-register-qa-release-tikv-2.tipocket-rw-register-qa-release-tikv-peer.tipocket-rw-register-qa-release.svc:20160] [cause="context canceled"]
[2020/10/29 09:08:46.035 +00:00] [WARN] [session.go:472] ["can not retry txn"] [conn=7] [label=general] [error="context canceled"] [IsBatchInsert=false] [IsPessimistic=true] [InRestrictedSQL=false] [tidb_retry_limit=10] [tidb_disable_txn_auto_retry=true]
[2020/10/29 09:08:46.035 +00:00] [WARN] [session.go:487] ["commit failed"] [conn=7] ["finished txn"="Txn{state=invalid}"] [error="context canceled"]
[2020/10/29 09:08:46.035 +00:00] [WARN] [session.go:1066] ["run statement failed"] [conn=7] [schemaVersion=31] [error="previous statement: insert into register(id, sk, val) values (?, ?, ?) on duplicate key update val = ? [arguments: (380, 380, 593, 593)]: context canceled"] [session="{\n  \"currDBName\": \"test\",\n  \"id\": 7,\n  \"status\": 2,\n  \"strictMode\": true,\n  \"user\": {\n    \"Username\": \"root\",\n    \"Hostname\": \"192.168.165.64\",\n    \"CurrentUser\": false,\n    \"AuthUsername\": \"root\",\n    \"AuthHostname\": \"%\"\n  }\n}"]
[2020/10/29 09:08:46.035 +00:00] [INFO] [conn.go:787] ["command dispatched failed"] [conn=7] [connInfo="id:7, addr:192.168.165.64:9402 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql=COMMIT] [txn_mode=PESSIMISTIC] [err="context canceled\ngithub.com/pingcap/errors.AddStack\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20200917111840-a15ef68f753d/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20200917111840-a15ef68f753d/juju_adaptor.go:15\ngithub.com/pingcap/tidb/store/tikv.(*RegionRequestSender).sendReqToRegion\n\t/home/jenkins/agent/workspace/build-tidb/tidb/store/tikv/region_request.go:415\ngithub.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReqCtx\n\t/home/jenkins/agent/workspace/build-tidb/tidb/store/tikv/region_request.go:293\ngithub.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReq\n\t/home/jenkins/agent/workspace/build-tidb/tidb/store/tikv/region_request.go:202\ngithub.com/pingcap/tidb/store/tikv.actionCommit.handleSingleBatch\n\t/home/jenkins/agent/workspace/build-tidb/tidb/store/tikv/2pc.go:1122\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnBatches\n\t/home/jenkins/agent/workspace/build-tidb/tidb/store/tikv/2pc.go:665\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnGroupMutations\n\t/home/jenkins/agent/workspace/build-tidb/tidb/store/tikv/2pc.go:619\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnMutations\n\t/home/jenkins/agent/workspace/build-tidb/tidb/store/tikv/2pc.go:509\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).commitMutations\n\t/home/jenkins/agent/workspace/build-tidb/tidb/store/tikv/2pc.go:1270\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).execute\n\t/home/jenkins/agent/workspace/build-tidb/tidb/store/tikv/2pc.go:1409\ngithub.com/pingcap/tidb/store/tikv.(*tikvTxn).Commit\n\t/home/jenkins/agent/workspace/build-tidb/tidb/store/tikv/txn.go:308\ngithub.com/pingcap/tidb/session.(*TxnState).Commit\n\t/home/jenkins/agent/workspace/build-tidb/tidb/session/txn.go:279\ngithub.com/pingcap/tidb/session.(*session).doCommit\n\t/home/jenkins/agent/workspace/build-tidb/tidb/session/session.go:432\ngithub.com/pingcap/tidb/session.(*session).doCommitWithRetry\n\t/home/jenkins/agent/workspace/build-tidb/tidb/session/session.go:452\ngithub.com/pingcap/tidb/session.(*session).CommitTxn\n\t/home/jenkins/agent/workspace/build-tidb/tidb/session/session.go:510\ngithub.com/pingcap/tidb/session.autoCommitAfterStmt\n\t/home/jenkins/agent/workspace/build-tidb/tidb/session/tidb.go:230\ngithub.com/pingcap/tidb/session.finishStmt\n\t/home/jenkins/agent/workspace/build-tidb/tidb/session/tidb.go:196\ngithub.com/pingcap/tidb/session.runStmt\n\t/home/jenkins/agent/workspace/build-tidb/tidb/session/tidb.go:316\ngithub.com/pingcap/tidb/session.(*session).executeStatement\n\t/home/jenkins/agent/workspace/build-tidb/tidb/session/session.go:1063\ngithub.com/pingcap/tidb/session.(*session).execute\n\t/home/jenkins/agent/workspace/build-tidb/tidb/session/session.go:1175\ngithub.com/pingcap/tidb/session.(*session).Execute\n\t/home/jenkins/agent/workspace/build-tidb/tidb/session/session.go:1106\ngithub.com/pingcap/tidb/server.(*TiDBContext).Execute\n\t/home/jenkins/agent/workspace/build-tidb/tidb/server/driver_tidb.go:248\ngithub.com/pingcap/tidb/server.(*clientConn).handleQuery\n\t/home/jenkins/agent/workspace/build-tidb/tidb/server/conn.go:1354\ngithub.com/pingcap/tidb/server.(*clientConn).dispatch\n\t/home/jenkins/agent/workspace/build-tidb/tidb/server/conn.go:985\ngithub.com/pingcap/tidb/server.(*clientConn).Run\n\t/home/jenkins/agent/workspace/build-tidb/tidb/server/conn.go:772\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-tidb/tidb/server/server.go:421\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\nprevious statement: insert into register(id, sk, val) values (?, ?, ?) on duplicate key update val = ? [arguments: (380, 380, 593, 593)]"]
[2020/10/29 09:08:46.036 +00:00] [ERROR] [terror.go:272] ["encountered error"] [error="write tcp 192.168.100.120:4000->192.168.165.64:9402: use of closed network connection"] [stack="github.com/pingcap/parser/terror.Log\n\t/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20201022083903-fbe80b0c40bb/terror/terror.go:272\ngithub.com/pingcap/tidb/server.(*clientConn).Run\n\t/home/jenkins/agent/workspace/build-tidb/tidb/server/conn.go:796\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-tidb/tidb/server/server.go:421"]

But actually the transaction is committed successfully becasue w(380,593) has been read by other transactions.

2020/10/29 09:08:45 suit.go:197: [info] 3 tipocket-rw-register-qa-release 172.16.5.96:32377: call {:type :invoke, :value [[:r 380 nil] [:w 380 592] [:r 383 nil] [:w 380 593]], :time 1603962525911018729}
2020/10/29 09:08:45 suit.go:204: [info] 3 tipocket-rw-register-qa-release 172.16.5.96:32377: return {:type :fail, :value [[:r 380 591] [:w 380 592] [:r 383 272] [:w 380 593]], :time 1603962525927760326, :error ["invalid connection"]}
2020/10/29 09:08:45 suit.go:197: [info] 4 tipocket-rw-register-qa-release 172.16.5.96:32377: call {:type :invoke, :value [[:r 379 nil] [:r 383 nil] [:r 380 nil] [:r 379 nil]], :time 1603962525940628688}
2020/10/29 09:08:45 suit.go:204: [info] 4 tipocket-rw-register-qa-release 172.16.5.96:32377: return {:type :ok, :value [[:r 379 953] [:r 383 274] [:r 380 593] [:r 379 953]], :time 1603962525961446789}

Thus the test failed because a G1a anomaly was found.

2. What did you expect to see? (Required)

TiDB should mark such an error as undetermined and return it to client.

3. What did you see instead (Required)

From the log of tidb-server, it didn't return a undetermined error.

4. What is your TiDB version? (Required)

release-4.0 @ 66ac9fc

@zyguan zyguan added the type/bug The issue is confirmed as a bug. label Oct 30, 2020
@cfzjywxk cfzjywxk added sig/transaction SIG:Transaction help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Oct 30, 2020
@MyonKeminta
Copy link
Contributor

This is introduced by #20030 . I'm thinking of better way to solve the problem.

@ti-srebot
Copy link
Contributor

ti-srebot commented Nov 9, 2020

Please edit this comment to complete the following information

Bug

Note: Make Sure that 'component', and 'severity' labels are added
Example for how to fill out the template: #20100

1. Root Cause Analysis (RCA) (optional)

After #20030 , context canceled error is returned before setting rpcError field of RegionRequestSender, and that field is used in 2pc to check whether the RPC is in a undetermined state, so that it can check if the transaction is in undetermined state. Therefore context canceled error is not properly checked.

2. Symptom (optional)

In some rare cases, TiDB may report some transaction to be failed, however it might be succedded.

3. All Trigger Conditions (optional)

Context canceled during commit RPC (like disconnecting or RPC timeout)

4. Workaround (optional)

No

5. Affected versions

[v4.0.7:v4.0.8]

6. Fixed versions

v4.0.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. severity/major sig/transaction SIG:Transaction type/bug The issue is confirmed as a bug.
Projects
None yet
5 participants