Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tikv error when demote follower to learner #11463

Closed
dbsid opened this issue Nov 24, 2021 · 13 comments
Closed

tikv error when demote follower to learner #11463

dbsid opened this issue Nov 24, 2021 · 13 comments
Assignees
Labels
affects-5.3 This bug affects 5.3.x versions. severity/major type/bug The issue is confirmed as a bug.

Comments

@dbsid
Copy link
Contributor

dbsid commented Nov 24, 2021

Bug Report

What version of TiKV are you using?

[root@tikv-1-0 bin]# ./tikv-server -V
TiKV
Release Version:   5.3.0
Edition:           Community
Git Commit Hash:   2bba8cce69b1da1cfb1181ed4630b084ad7a542b
Git Commit Branch: heads/refs/tags/v5.2.3
UTC Build Time:    2021-11-20 04:29:29
Rust Version:      rustc 1.56.0-nightly (2faabf579 2021-07-27)
Enable Features:   jemalloc mem-profiling portable sse protobuf-codec test-engines-rocksdb cloud-aws cloud-gcp
Profile:           dist_release

What operating system and CPU are you using?

x86 linux

Steps to reproduce

  1. disable balance pd-ctl config set region-schedule-limit 0
  2. enable placement rules pd-ctl config placement-rules enable
  3. prepared rules.json
  {
    "group_id": "pd",
    "rules": [
      {
        "group_id": "pd",
        "id": "default",
        "start_key": "",
        "end_key": "",
        "role": "voter",
        "count": 1
      },
      {
        "group_id": "pd",
        "id": "learner",
        "start_key": "",
        "end_key": "",
        "role": "learner",
        "count": 2,
        "location_labels": [ "zone", "host"]
      }
    ]
  }
]
  1. apply the rules pd-ctl config placement-rules rule-bundle save --in="rules.json"

What did you expect?

followers demoted to learner

What did happened?

no learners from the pd monitor panel.
pd error observed

"fix-demote-voter {demote peer: store [2]} (kind:leader, region:27297(575,5), createAt:2021-11-24 14:13:03.029484698 +0800 CST m=+1723.126547708, startAt:2021-11-24 14:13:03.02956604 +0800 CST m=+1723.126629057, currentStep:1, steps:[transfer leader from store 2 to store 1, demote follower peer 27299 on store 2 to learner]) timeout",
  "fix-demote-voter {demote peer: store [2]} (kind:leader, region:27241(575,5), createAt:2021-11-24 14:13:03.026463593 +0800 CST m=+1723.123526603, startAt:2021-11-24 14:13:03.026487511 +0800 CST m=+1723.123550523, currentStep:1, steps:[transfer leader from store 2 to store 1, demote follower peer 27243 on store 2 to learner]) timeout",
  "fix-demote-voter {demote peer: store [2]} (kind:unknown, region:28201(531,5), createAt:2021-11-24 14:13:03.083497632 +0800 CST m=+1723.180560641, startAt:2021-11-24 14:13:03.083519035 +0800 CST m=+1723.180582046, currentStep:0, steps:[demote follower peer 28203 on store 2 to learner]) timeout",
  "fix-demote-voter {demote peer: store [2]} (kind:unknown, region:29229(470,5), createAt:2021-11-24 14:13:03.123877868 +0800 CST m=+1723.220940885, startAt:2021-11-24 14:13:03.123914335 +0800 CST m=+1723.220977375, currentStep:0, steps:[demote follower peer 29231 on store 2 to learner]) timeout",
  "fix-demote-voter {demote peer: store [2]} (kind:unknown, region:29541(580,5), createAt:2021-11-24 14:13:03.138161506 +0800 CST m=+1723.235224516, startAt:2021-11-24 14:13:03.138180225 +0800 CST m=+1723.235243236, currentStep:0, steps:[demote follower peer 29543 on store 2 to learner]) timeout"

tikv error observed

[2021/11/24 14:28:32.985 +08:00] [INFO] [apply.rs:1755] ["exec ConfChange"] [epoch="conf_ver: 5 version: 363"] [type=AddLearner] [peer_id=26238] [region_id=26237]
[2021/11/24 14:28:32.985 +08:00] [ERROR] [apply.rs:1248] ["execute raft command"] [err_code=KV:Raftstore:Unknown] [err="Other(\"[components/raftstore/src/store/fsm/apply.rs:1884]: can't add duplicated learner id: 26223 store_id: 2 role: Learner to region id: 26221 start_key: 7480000000000000FF415F728000000000FF1F9C1A0000000000FA end_key: 7480000000000000FF415F728000000000FF25FD970000000000FA region_epoch { conf_ver: 5 version: 363 } peers { id: 26222 store_id: 1 } peers { id: 26223 store_id: 2 } peers { id: 26224 store_id: 5 }\")"] [peer_id=26222] [region_id=26221]
[2021/11/24 14:28:32.985 +08:00] [ERROR] [apply.rs:1877] ["can't add duplicated learner"] [region="id: 26237 start_key: 7480000000000000FF415F728000000000FF394C840000000000FA end_key: 7480000000000000FF415F728000000000FF3FB3090000000000FA region_epoch { conf_ver: 5 version: 363 } peers { id: 26238 store_id: 1 } peers { id: 26239 store_id: 2 } peers { id: 26240 store_id: 5 }"] [peer="id: 26239 store_id: 2 role: Learner"] [peer_id=26238] [region_id=26237]
[2021/11/24 14:28:32.985 +08:00] [INFO] [apply.rs:1383] ["execute admin command"] [command="cmd_type: ChangePeer change_peer { change_type: AddLearnerNode peer { id: 26283 store_id: 2 role: Learner } }"] [index=933] [term=29] [peer_id=26282] [region_id=26281]
[2021/11/24 14:28:32.985 +08:00] [INFO] [apply.rs:1755] ["exec ConfChange"] [epoch="conf_ver: 5 version: 363"] [type=AddLearner] [peer_id=26282] [region_id=26281]
[2021/11/24 14:28:32.985 +08:00] [ERROR] [apply.rs:1877] ["can't add duplicated learner"] [region="id: 26281 start_key: 7480000000000000FF415F728000000000FF7FA3860000000000FA end_key: 7480000000000000FF415F728000000000FF8602350000000000FA region_epoch { conf_ver: 5 version: 363 } peers { id: 26282 store_id: 1 } peers { id: 26283 store_id: 2 } peers { id: 26284 store_id: 5 }"] [peer="id: 26283 store_id: 2 role: Learner"] [peer_id=26282] [region_id=26281]
[2021/11/24 14:28:32.985 +08:00] [ERROR] [apply.rs:1248] ["execute raft command"] [err_code=KV:Raftstore:Unknown] [err="Other(\"[components/raftstore/src/store/fsm/apply.rs:1884]: can't add duplicated learner id: 26239 store_id: 2 role: Learner to region id: 26237 start_key: 7480000000000000FF415F728000000000FF394C840000000000FA end_key: 7480000000000000FF415F728000000000FF3FB3090000000000FA region_epoch { conf_ver: 5 version: 363 } peers { id: 26238 store_id: 1 } peers { id: 26239 store_id: 2 } peers { id: 26240 store_id: 5 }\")"] [peer_id=26238] [region_id=26237]
[2021/11/24 14:28:32.985 +08:00] [ERROR] [apply.rs:1248] ["execute raft command"] [err_code=KV:Raftstore:Unknown] [err="Other(\"[components/raftstore/src/store/fsm/apply.rs:1884]: can't add duplicated learner id: 26283 store_id: 2 role: Learner to region id: 26281 start_key: 7480000000000000FF415F728000000000FF7FA3860000000000FA end_key: 7480000000000000FF415F728000000000FF8602350000000000FA region_epoch { conf_ver: 5 version: 363 } peers { id: 26282 store_id: 1 } peers { id: 26283 store_id: 2 } peers { id: 26284 store_id: 5 }\")"] [peer_id=26282] [region_id=26281]
[2021/11/24 14:28:32.985 +08:00] [INFO] [apply.rs:1383] ["execute admin command"] [command="cmd_type: ChangePeer change_peer { change_type: AddLearnerNode peer { id: 26503 store_id: 2 role: Learner } }"] [index=904] [term=23] [peer_id=26502] [region_id=26501]
[2021/11/24 14:28:32.985 +08:00] [INFO] [apply.rs:1755] ["exec ConfChange"] [epoch="conf_ver: 5 version: 363"] [type=AddLearner] [peer_id=26502] [region_id=26501]
[2021/11/24 14:28:32.985 +08:00] [ERROR] [apply.rs:1877] ["can't add duplicated learner"] [region="id: 26501 start_key: 7480000000000000FF4B5F728000000000FF2C6AC20000000000FA end_key: 7480000000000000FF4B5F728000000000FF32DDE10000000000FA region_epoch { conf_ver: 5 version: 363 } peers { id: 26502 store_id: 1 } peers { id: 26503 store_id: 2 } peers { id: 26504 store_id: 5 }"] [peer="id: 26503 store_id: 2 role: Learner"] [peer_id=26502] [region_id=26501]
[2021/11/24 14:28:32.985 +08:00] [INFO] [apply.rs:1383] ["execute admin command"] [command="cmd_type: ChangePeer change_peer { change_type: AddLearnerNode peer { id: 26495 store_id: 2 role: Learner } }"] [index=920] [term=27] [peer_id=26494] [region_id=26493]
[2021/11/24 14:28:32.985 +08:00] [INFO] [apply.rs:1755] ["exec ConfChange"] [epoch="conf_ver: 5 version: 363"] [type=AddLearner] [peer_id=26494] [region_id=26493]
[2021/11/24 14:28:32.985 +08:00] [ERROR] [apply.rs:1877] ["can't add duplicated learner"] [region="id: 26493 start_key: 7480000000000000FF4B5F728000000000FF1FA3EC0000000000FA end_key: 7480000000000000FF4B5F728000000000FF2610740000000000FA region_epoch { conf_ver: 5 version: 363 } peers { id: 26494 store_id: 1 } peers { id: 26495 store_id: 2 } peers { id: 26496 store_id: 5 }"] [peer="id: 26495 store_id: 2 role: Learner"] [peer_id=26494] [region_id=26493]
[2021/11/24 14:28:32.985 +08:00] [ERROR] [apply.rs:1248] ["execute raft command"] [err_code=KV:Raftstore:Unknown] [err="Other(\"[components/raftstore/src/store/fsm/apply.rs:1884]: can't add duplicated learner id: 26503 store_id: 2 role: Learner to region id: 26501 start_key: 7480000000000000FF4B5F728000000000FF2C6AC20000000000FA end_key: 7480000000000000FF4B5F728000000000FF32DDE10000000000FA region_epoch { conf_ver: 5 version: 363 } peers { id: 26502 store_id: 1 } peers { id: 26503 store_id: 2 } peers { id: 26504 store_id: 5 }\")"] [peer_id=26502] [region_id=26501]
[2021/11/24 14:28:32.985 +08:00] [ERROR] [apply.rs:1248] ["execute raft command"] [err_code=KV:Raftstore:Unknown] [err="Other(\"[components/raftstore/src/store/fsm/apply.rs:1884]: can't add duplicated learner id: 26495 store_id: 2 role: Learner to region id: 26493 start_key: 7480000000000000FF4B5F728000000000FF1FA3EC0000000000FA end_key: 7480000000000000FF4B5F728000000000FF2610740000000000FA region_epoch { conf_ver: 5 version: 363 } peers { id: 26494 store_id: 1 } peers { id: 26495 store_id: 2 } peers { id: 26496 store_id: 5 }\")"] [peer_id=26494] [region_id=26493]
@dbsid
Copy link
Contributor Author

dbsid commented Nov 24, 2021

/assign NingLin-P

@NingLin-P
Copy link
Member

NingLin-P commented Nov 24, 2021

When demoting a voter to a learner with simple confchange, the following check will fail because L1876 only checks whether the current region peer list contains the incoming learner peer id but not whether its role is the requested role.

Previously there are no such schedule (demoting a voter to a learner with simple confchange) but it is supported recently by PD and hence lead to this issue.

ConfChangeType::AddLearnerNode => {
PEER_ADMIN_CMD_COUNTER_VEC
.with_label_values(&["add_learner", "all"])
.inc();
if util::find_peer(&region, store_id).is_some() {
error!(
"can't add duplicated learner";
"region_id" => self.region_id(),
"peer_id" => self.id(),
"peer" => ?peer,
"region" => ?&self.region
);
return Err(box_err!(
"can't add duplicated learner {:?} to region {:?}",
peer,
self.region
));
}
region.mut_peers().push(peer.clone());
PEER_ADMIN_CMD_COUNTER_VEC
.with_label_values(&["add_learner", "success"])
.inc();
info!(
"add learner successfully";
"region_id" => self.region_id(),
"peer_id" => self.id(),
"peer" => ?peer,

@BusyJay
Copy link
Member

BusyJay commented Nov 24, 2021

Why PD wants to demote follower to learner with single step?

@HunDunDM
Copy link
Member

HunDunDM commented Dec 3, 2021

After PD enables joint consensus, joint consensus will not be used for changes that only have a single demote. But PD did not produce such scheduling before. DemoteFollower must be used as recorded in pd#4140, which is supported in v5.3.0.

@HunDunDM
Copy link
Member

HunDunDM commented Dec 3, 2021

DemoteFollower was tested when joint consensus was introduced in v5.0.0.

@w41ter
Copy link
Contributor

w41ter commented Dec 6, 2021

@BusyJay @NingLin-P
Do you have any suggestions for fixing this problem? It seems that may cause incompatibility or even insecurity if modify the logic in apply.rs directly.

@w41ter
Copy link
Contributor

w41ter commented Dec 6, 2021

HeartbeatResponse::change_peer_v2 would demoting a voter directly, how about use change_peer_v2 instead of change_peer to demote a voter to learner?

@BusyJay
Copy link
Member

BusyJay commented Dec 6, 2021

Isn't just rejecting the proposal should be enough?

@Lily2025
Copy link

Lily2025 commented Dec 6, 2021

/type bug

@ti-chi-bot ti-chi-bot added the type/bug The issue is confirmed as a bug. label Dec 6, 2021
@BusyJay
Copy link
Member

BusyJay commented Dec 7, 2021

HeartbeatResponse::change_peer_v2 would demoting a voter directly, how about use change_peer_v2 instead of change_peer to demote a voter to learner?

I see your question. It feels good to me. /cc @NingLin-P

@w41ter
Copy link
Contributor

w41ter commented Dec 9, 2021

@HunDunDM I have submit a PR to solve this problem, would you like to help me review the code?

@NingLin-P
Copy link
Member

HeartbeatResponse::change_peer_v2 would demoting a voter directly, how about use change_peer_v2 instead of change_peer to demote a voter to learner?

lgtm

@HunDunDM
Copy link
Member

HunDunDM commented Jan 6, 2022

fixed by tikv/pd#4534

@jebter jebter added the affects-5.3 This bug affects 5.3.x versions. label Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.3 This bug affects 5.3.x versions. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

8 participants