Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regions get stuck in 2 voters, 1 down peer, 1 learner state #6559

Closed
v01dstar opened this issue Jun 6, 2023 · 1 comment · Fixed by #6831
Closed

Regions get stuck in 2 voters, 1 down peer, 1 learner state #6559

v01dstar opened this issue Jun 6, 2023 · 1 comment · Fixed by #6831
Assignees
Labels
affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. report/customer Customers have encountered this bug. severity/critical type/bug The issue is confirmed as a bug.

Comments

@v01dstar
Copy link
Contributor

v01dstar commented Jun 6, 2023

Bug Report

What did you do?

In a 3 nodes cluster, replace a broken store with a new one.

What did you expect to see?

The cluster returns to normal after the operation.

What did you see instead?

TiKVRegionPendingPeerTooLong alarm is fired.

There are 3 regions that experience "pending-peer" problem for 2 days. They all have 4 peers: 2 regular healthy voters, 1 healthy learner (located in the new store 2751139) 1 down peer (in the manually deleted store 4).

Example region info, click me
{
  "id": 55929554,
  "epoch": {
    "conf_ver": 6,
    "version": 109399
  },
  "peers": [
    {
      "id": 55929555,
      "store_id": 1,
      "role_name": "Voter"
    },
    {
      "id": 55929556,
      "store_id": 4,
      "role_name": "Voter"
    },
    {
      "id": 55929557,
      "store_id": 5,
      "role_name": "Voter"
    },
    {
      "id": 55929558,
      "store_id": 2751139,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    }
  ],
  "leader": {
    "id": 55929555,
    "store_id": 1,
    "role_name": "Voter"
  },
  "down_peers": [
    {
      "down_seconds": 40307,
      "peer": {
        "id": 55929556,
        "store_id": 4,
        "role_name": "Voter"
      }
    }
  ],
  "pending_peers": [
    {
      "id": 55929556,
      "store_id": 4,
      "role_name": "Voter"
    }
  ],
  "cpu_usage": 0,
  "written_bytes": 0,
  "read_bytes": 0,
  "written_keys": 0,
  "read_keys": 0,
  "approximate_size": 1,
  "approximate_keys": 40960
}

This state is probably due to an unfinished recovery process. Usually, this intermediate state can be resolved by PD automatically in 2 ways:

  • This state does not comply with the 3 replicas rule. So, PD tried to remove one replica, the peer with "unusual role" (in this case, the learner) is preferred in this case. But, to proceed with this operation, it requires all other peers to be healthy, which is not true in this case. So, this is skipped. This can be confirmed by PD metric "skip-remove-orphan-peer".
  • This state does not comply with the "no down peer" rule. So, PD tried to remove the down peer and add a new peer, this is done through: 1. add a learner. 2. promote learner + demote voter through joint consensus 3. remove demoted learner. But, since this cluster only has 3 nodes, and all of them already have a peer belonging to these regions, so this operation is also not able to proceed. This can be confirmed by PD metrics "replace-down" and "no-store-replace".

Because of above constraints, these 3 regions get stuck in this state.

PD should be able to handle this case. e.g. When find a region with 4 peers, 2 voters + 1 down peer + 1 learner. It promotes the learner to be a voter and removes the down peer.

What version of PD are you using (pd-server -V)?

6.5.0

@v01dstar v01dstar added the type/bug The issue is confirmed as a bug. label Jun 6, 2023
@nolouch nolouch self-assigned this Jul 13, 2023
@nolouch nolouch added affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. and removed may-affects-5.2 may-affects-5.3 may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. labels Jul 18, 2023
@nolouch nolouch added the affects-6.5 This bug affects the 6.5.x(LTS) versions. label Jul 21, 2023
ti-chi-bot bot added a commit that referenced this issue Jul 26, 2023
close #6559

add logic try to replace unhealthy peer with orphan peer

Signed-off-by: nolouch <nolouch@gmail.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Jul 26, 2023
close tikv#6559

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Jul 26, 2023
close tikv#6559

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot bot pushed a commit that referenced this issue Jul 26, 2023
close #6559

add logic try to replace unhealthy peer with orphan peer

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: nolouch <nolouch@gmail.com>

Co-authored-by: ShuNing <nolouch@gmail.com>
Co-authored-by: nolouch <nolouch@gmail.com>
ti-chi-bot bot pushed a commit that referenced this issue Aug 2, 2023
close #6559

add logic try to replace unhealthy peer with orphan peer

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: nolouch <nolouch@gmail.com>

Co-authored-by: ShuNing <nolouch@gmail.com>
Co-authored-by: nolouch <nolouch@gmail.com>
@ti-chi-bot ti-chi-bot added the affects-6.1 This bug affects the 6.1.x(LTS) versions. label Apr 8, 2024
@seiya-annie
Copy link

/found customer

@ti-chi-bot ti-chi-bot bot added the report/customer Customers have encountered this bug. label Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. report/customer Customers have encountered this bug. severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
5 participants