Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Region heartbeat doesn't update region info properly #2149

Closed
youjiali1995 opened this issue Feb 25, 2020 · 0 comments · Fixed by #2198
Closed

Region heartbeat doesn't update region info properly #2149

youjiali1995 opened this issue Feb 25, 2020 · 0 comments · Fixed by #2198
Assignees
Labels
type/bug The issue is confirmed as a bug.

Comments

@youjiali1995
Copy link
Contributor

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
    In the Schrodinger test of the deadlock detector, I found it spent about 40s to find the true leader of the leader region from PD:
[2020/02/21 10:00:48.478 +00:00] [INFO] [deadlock.rs:593] ["changed from the leader of deadlock detector to follower!"] [self_id=6]
[2020/02/21 10:00:48.487 +00:00] [INFO] [deadlock.rs:574] ["stale leader info"]
...
[2020/02/21 10:01:28.040 +00:00] [INFO] [deadlock.rs:574] ["stale leader info"]
[2020/02/21 10:01:28.675 +00:00] [INFO] [deadlock.rs:576] ["leader changed"] [leader_addr=tidb-cluster-tikv-1.tidb-cluster-tikv-peer.deadlock-detector-exp28-cat0-tidb-cluster.svc:20160] [leader_id=4]
[2020/02/21 10:01:28.675 +00:00] [INFO] [deadlock.rs:638] ["reconnect leader succeeded"] [leader_id=4]

There is a leader shuffler used to shuffle the leader fo the leader region(leader region is the region containing the key ""). As you can see from the log, region #11920 is the source region and #11860 is the target region. After 10s, it still think the leader region is #11920.

[case-id: 210657] 2020/02/21 18:00:48 leader_shuffler.go:383: �[0;37m[info] [leader shuffler] [leader=6] Merge leader region #11920 and #11860�[0m
[case-id: 210657] 2020/02/21 18:00:58 leader_shuffler.go:282: �[0;37m[info] [leader shuffler] [leader=6] Transfer leader region #11920 from [5 4 6] to [1 4 5]�[0m

It's the log of the leader of region #11860. The region heartbeat is sent successfully.

[2020/02/21 10:00:48.497 +00:00] [INFO] [apply.rs:1825] ["execute CommitMerge"] [source_region="id: 11920 end_key: 7480000000000000FF0500000000000000F8 region_epoch { conf_ver: 435 version: 4777 } peers { id: 11921 store_id: 5 } peers { id: 11922 store_id: 4 } peers { id: 11923 store_id: 6 }"] [index=15] [term=6] [entries=1] [commit=10] [peer_id=11862] [region_id=11860]
[2020/02/21 10:00:48.497 +00:00] [INFO] [deadlock.rs:589] ["became the leader of deadlock detector!"] [self_id=4]
[2020/02/21 10:00:48.497 +00:00] [INFO] [peer.rs:1890] ["notify pd with merge"] [target_region="id: 11860 end_key: 7480000000000000FF0F00000000000000F8 region_epoch { conf_ver: 434 version: 4778 } peers { id: 11861 store_id: 5 } peers { id: 11862 store_id: 4 } peers { id: 11863 store_id: 6 }"] [source_region="id: 11920 end_key: 7480000000000000FF0500000000000000F8 region_epoch { conf_ver: 435 version: 4777 } peers { id: 11921 store_id: 5 } peers { id: 11922 store_id: 4 } peers { id: 11923 store_id: 6 }"] [peer_id=11862] [region_id=11860]

Log of PD. PD received the heartbeat of region #11860 and updated its key range.

[2020/02/21 10:00:48.494 +00:00] [INFO] [operator_controller.go:293] ["add operator"] [region-id=11860] [operator="\"admin-merge-region (kind:admin,merge, region:11860(4755,434), createAt:2020-02-21 10:00:48.494316899 +0000 UTC m=+91025.692975359, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, steps:[merge region 11920 into region 11860]) \""]
[2020/02/21 10:00:48.498 +00:00] [INFO] [operator_controller.go:103] ["operator finish"] [region-id=11860] [operator="\"admin-merge-region (kind:admin,merge, region:11860(4755,434), createAt:2020-02-21 10:00:48.494316899 +0000 UTC m=+91025.692975359, startAt:2020-02-21 10:00:48.494511375 +0000 UTC m=+91025.693169846, currentStep:1, steps:[merge region 11920 into region 11860]) finished\""]
[2020/02/21 10:00:48.497 +00:00] [INFO] [cluster_info.go:551] ["region Version changed"] [region-id=11920] [detail="StartKey:{}, EndKey:{7480000000000000FF0500000000000000F8}"] [old-version=4776] [new-version=4777]
[2020/02/21 10:00:48.498 +00:00] [INFO] [cluster_info.go:551] ["region Version changed"] [region-id=11860] [detail="StartKey Changed:{7480000000000000FF0500000000000000F8} -> {}, EndKey:{7480000000000000FF0F00000000000000F8}"] [old-version=4755] [new-version=4778]

Log of TiDB. You can see the region 11920 was destroyed at 10:00:48, but after 20s, it still existed in PD.

[2020/02/21 10:01:09.016 +00:00] [WARN] [backoff.go:305] ["regionMiss backoffer.maxSleep 20000ms is exceeded, errors:\nmessage:\"region 11920 is missing\" region_not_found:<region_id:11920 >  at 2020-02-21T10:01:08.01143584Z\nmessage:\"region 11920 is missing\" region_not_found:<region_id:11920 >  at 2020-02-21T10:01:08.512857692Z\nmessage:\"region 11920 is missing\" region_not_found:<region_id:11920 >  at 2020-02-21T10:01:09.016322964Z"]

You can see all logs here: https://schrodinger.pingcap.net/?#/experimentdetail?experimentid=408178&boxid=210001

  1. What did you expect to see?
    PD update region info correctly and timely.

  2. What version of PD are you using (pd-server -V)?
    branch: release-3.0
    hash: 1b67bbb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants