TiDB panics when calling RegionStore.accessStore #20181

aylei · 2020-09-23T10:37:18Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

I'm not able to reproduce this issue.

2. What did you expect to see? (Required)

TiDB runs normally

3. What did you see instead (Required)

TiDB panics when calling RegionStore.accessStore

[2020/09/23 09:15:43.977 +00:00] [INFO] [region_cache.go:829] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=47144] [currIdx=0] [leaderStoreID=66]
[2020/09/23 09:15:43.977 +00:00] [INFO] [region_cache.go:609] ["switch region peer to next due to send request fail"] [current="region ID: 50568, meta: id:50568 start_key:\"t\\200\\000\\000\\000\\000\\000\\0001_i\\200\\000\\000\\000\\000\\000\\000\\001\\00186864293\\377-f4f6-59\\377f6-9d07-\\37731903362\\3778caf\\000\\000\\000\\000\\373\" end_key:\"t\\200\\000\\000\\000\\000\\000\\0001_i\\200\\000\\000\\000\\000\\000\\000\\001\\001869cf186\\377-162e-5c\\377bd-a0f0-\\3777549f6c7\\3779f13\\000\\000\\000\\000\\373\" region_epoch:<conf_ver:5 version:36 > peers:<id:50570 store_id:4 > peers:<id:50571 store_id:66 > , peer: id:50571 store_id:66 , addr: db-tikv-2.db-tikv-peer.tidb1306611994220040192.svc:20160, idx: 1, reqStoreType: TiKvOnly, runStoreType: tikv"] [needReload=false] [error="wait recvLoop: context deadline exceeded"]
[2020/09/23 09:15:43.978 +00:00] [INFO] [region_cache.go:414] ["invalidate current region, because others failed on same store"] [region=5600] [store=db-tikv-2.db-tikv-peer.tidb1306611994220040192.svc:20160]
[2020/09/23 09:15:43.980 +00:00] [INFO] [region_cache.go:829] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=11412] [currIdx=0] [leaderStoreID=66]
[2020/09/23 09:15:43.981 +00:00] [INFO] [region_cache.go:1630] ["[liveness] request kv status fail"] [store=0.0.0.0:20180] [error="Get https://0.0.0.0:20180/status: dial tcp 0.0.0.0:20180: connect: connection refused"]
panic: runtime error: index out of range [2] with length 2

goroutine 156282637 [running]:
github.com/pingcap/tidb/store/tikv.(*RegionStore).accessStore(...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/region_cache.go:117
github.com/pingcap/tidb/store/tikv.(*RegionCache).OnSendFail(0xc0001ba2a0, 0xc18dc16b80, 0xc13ca4ed70, 0xc17619a000, 0x36cf480, 0xc1f42b9e40)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/region_cache.go:578 +0x93a
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).onSendFail(0xc0b05ae050, 0xc18dc16b80, 0xc13ca4ed70, 0x36cf480, 0xc1f42b9e40, 0xc13c675220, 0x4a817c800)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:455 +0x196
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).sendReqToRegion(0xc0b05ae050, 0xc18dc16b80, 0xc13ca4ed70, 0xc13c675220, 0x4a817c800, 0x0, 0x0, 0x36cf480, 0xc1f42b9e40)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:404 +0x261
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReqCtx(0xc0b05ae050, 0xc18dc16b80, 0xc13c675220, 0x1878, 0x5, 0x23, 0x4a817c800, 0x15c3800, 0xc13c904f70, 0xc1a67ca410, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:293 +0x2d0
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReq(...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:202
github.com/pingcap/tidb/store/tikv.actionCommit.handleSingleBatch(0x1, 0xc1afd88420, 0xc18dc16b80, 0x1878, 0x5, 0x23, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:1076 +0x285
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnBatches(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc13c761e88, 0xc1bce2d580, 0x1, 0x1, 0xc1b1184254, 0x1)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:628 +0x110
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnGroupMutations(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc13c761e88, 0xc1bce2d500, 0x1, 0x1, 0x353, 0xc1b128adf8)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:609 +0x661
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnMutations(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc13c761e88, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, 0x1, 0x353, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:478 +0x1c2
github.com/pingcap/tidb/store/tikv.actionCommit.handleSingleBatch(0x1, 0xc1afd88420, 0xc18dc16b80, 0x1878, 0x5, 0x23, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:1100 +0x19aa
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnBatches(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc16d12d2a8, 0xc1346efa00, 0x1, 0x1, 0xc1b1184254, 0x1)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:628 +0x110
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnGroupMutations(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc16d12d2a8, 0xc1346ef980, 0x1, 0x1, 0x353, 0xc1b128adf8)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:609 +0x661
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnMutations(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc16d12d2a8, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, 0x1, 0x353, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:478 +0x1c2
github.com/pingcap/tidb/store/tikv.actionCommit.handleSingleBatch(0x1, 0xc1afd88420, 0xc18dc16b80, 0x1878, 0x5, 0x23, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:1100 +0x19aa
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnBatches(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc0de7f6648, 0xc0941a5700, 0x1, 0x1, 0xc1b1184254, 0x1)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:628 +0x110
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnGroupMutations(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc0de7f6648, 0xc0941a5680, 0x1, 0x1, 0x353, 0xc1b128adf8)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:609 +0x661
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnMutations(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc0de7f6648, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, 0x1, 0x353, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:478 +0x1c2
github.com/pingcap/tidb/store/tikv.actionCommit.handleSingleBatch(0x309b200, 0xc1afd88420, 0xc18dc16b80, 0x1878, 0x5, 0x23, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:1100 +0x19aa
github.com/pingcap/tidb/store/tikv.(*batchExecutor).startWorker.func1(0xc2631814c0, 0xc016cb3080, 0x1878, 0x5, 0x23, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, 0x1, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:1547 +0x17c
created by github.com/pingcap/tidb/store/tikv.(*batchExecutor).startWorker
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:1529 +0x1b7

4. What is your TiDB version? (Required)

v4.0.6

The text was updated successfully, but these errors were encountered:

lysu · 2020-09-23T12:58:35Z

@aylei does this cluster has TiFlash node or using the follow-read feature?

aylei · 2020-09-23T13:12:15Z

Related logs:

tidb.log.tar.gz
pd2.log.tar.gz
pd1.log.tar.gz
pd0.log.tar.gz

lysu · 2020-09-24T13:00:06Z

after split tidb log like this

[2020/09/23 09:15:43.977 +00:00] [INFO] [region_cache.go:609] ["switch region peer to next due to send request fail"] [current="region ID: 50568, meta: id:50568 start_key:"t\200\000\000\000\000\000\0001_i\200\000\000\000\000\000\000\001\00186864293\377-f4f6-59\377f6-9d07-\37731903362\3778caf\000\000\000\000\373" end_key:"t\200\000\000\000\000\000\0001_i\200\000\000\000\000\000\000\001\001869cf186\377-162e-5c\377bd-a0f0-\3777549f6c7\3779f13\000\000\000\000\373" region_epoch:<conf_ver:5 version:36 > peers:<id:50570 store_id:4 > peers:<id:50571 store_id:66 > , peer: id:50571 store_id:66 , addr: db-tikv-2.db-tikv-peer.tidb1306611994220040192.svc:20160, idx: 1, reqStoreType: TiKvOnly, runStoreType: tikv"] [needReload=false] [error="wait recvLoop: context deadline exceeded"]

grep TiKvOnly tidb.log | awk -F'peers' '{print NF}' | sort -n | uniq -c
  84351 3
  15104 4

we can see many region only have 2 peers (so it has 3 in awk result)

this question is introduced by #17337, after it, TiDB will filter unavailable peers after load region from PD and insert the new region with filtered peers into region cache.

but the DownPeers changed didn't take effect on the region's Epoch, (kv collect down_peers before heartbeat PD, as the peer long time didn't send a heartbeat, https://github.com/tikv/tikv/blob/b7b0105d6e1bf889f35f67419c21dbbcdc041f07/components/raftstore/src/store/peer.rs#L818)

then TiDB has chance to fetch different result with same regionID(id+confver+ver) from PD as different time https://github.com/pingcap/pd/blob/bcfa77a7a593d519575f6e9be8cf3b6793c65a40/client/client.go#L35.

so peers that used to build RpcCtx can be changed before OnSendFail(e.g. before sending request region1 has 3 alive peers, after send fail region1 has other 1 down peer) without region epoch change, so OnSendFail handle will get a new region with different length that not match to the RpcCtx's world.

and this cause panic

sre-bot · 2020-09-27T02:19:18Z

Integrity check:
component RCA symptom trigger_condition affect_version fix_version fields are empty
@lysu
Please comment /info to get template

ti-srebot · 2020-09-27T02:19:20Z

Please edit this comment to complete the following information

Bug

1. Root Cause Analysis (RCA)

this question is introduced by #17337, after it, TiDB will filter unavailable peers after load region from PD and insert the new region with filtered peers into region cache.

but the DownPeers changed didn't take effect on the region's Epoch, (kv collect down_peers before heartbeat PD, as the peer long time didn't send a heartbeat, https://github.com/tikv/tikv/blob/b7b0105d6e1bf889f35f67419c21dbbcdc041f07/components/raftstore/src/store/peer.rs#L818)

then TiDB has chance to fetch different result with same regionID(id+confver+ver) from PD as different time https://github.com/pingcap/pd/blob/bcfa77a7a593d519575f6e9be8cf3b6793c65a40/client/client.go#L35.

so peers that used to build RpcCtx can be changed before OnSendFail(e.g. before sending request region1 has 3 alive peers, after send fail region1 has other 1 down peer) without region epoch change, so OnSendFail handle will get a new region with different length that not match to the RpcCtx's world.

and this cause panic

2. Symptom

user will see panic with panic: runtime error: index out of range + RegionStore).accessStore in tidb-stderr.log

3. All Trigger Conditions

after kv node became down_peers, before pd remove peers remove from peer list, TiDB meet send fail error to tikv as last item in peer list, and another SQL trigger load new region from pd between send request and receive send fail result

4. Workaround (optional)

reboot

5. Affected versions

[v4.0.0:v4.0.7]

6. Fixed versions

v4.0.8

aylei added the type/bug The issue is confirmed as a bug. label Sep 23, 2020

lysu self-assigned this Sep 23, 2020

lysu added the severity/critical label Sep 23, 2020

lysu mentioned this issue Sep 24, 2020

tikv: fix panic in RegionStore.accessStore #20204

Merged

lysu closed this as completed in #20204 Sep 25, 2020

lysu mentioned this issue Sep 25, 2020

tikv: fix panic in RegionStore.accessStore (#20204) #20210

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiDB panics when calling RegionStore.accessStore #20181

TiDB panics when calling RegionStore.accessStore #20181

aylei commented Sep 23, 2020

lysu commented Sep 23, 2020

aylei commented Sep 23, 2020 •

edited

Loading

lysu commented Sep 24, 2020 •

edited

Loading

sre-bot commented Sep 27, 2020

ti-srebot commented Sep 27, 2020 •

edited by lysu

Loading

TiDB panics when calling RegionStore.accessStore #20181

TiDB panics when calling RegionStore.accessStore #20181

Comments

aylei commented Sep 23, 2020

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

lysu commented Sep 23, 2020

aylei commented Sep 23, 2020 • edited Loading

lysu commented Sep 24, 2020 • edited Loading

sre-bot commented Sep 27, 2020

ti-srebot commented Sep 27, 2020 • edited by lysu Loading

Please edit this comment to complete the following information

Bug

1. Root Cause Analysis (RCA)

2. Symptom

3. All Trigger Conditions

4. Workaround (optional)

5. Affected versions

6. Fixed versions

aylei commented Sep 23, 2020 •

edited

Loading

lysu commented Sep 24, 2020 •

edited

Loading

ti-srebot commented Sep 27, 2020 •

edited by lysu

Loading