Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PD keeps transfering leader to a down store #3353

Closed
BusyJay opened this issue Jan 12, 2021 · 2 comments · Fixed by #4223
Closed

PD keeps transfering leader to a down store #3353

BusyJay opened this issue Jan 12, 2021 · 2 comments · Fixed by #4223
Assignees
Labels
severity/moderate status/TODO The issue will be done in the future. type/bug The issue is confirmed as a bug.

Comments

@BusyJay
Copy link
Member

BusyJay commented Jan 12, 2021

Bug Report

What did you do?

I used nightly build of PD to test joint consensus.

I enabled shuffle region scheduling and set max store down time to 30s. After killing two stores in the same label, a region will stuck and keep 4 replicas until 10min. One example looked like following:

"region_id": 129,
"peers": [
    {
      "id": 180171,
      "store_id": 1
    },
    {
      "id": 181496,
      "store_id": 4
    },
    {
      "id": 181653,
      "store_id": 13
    },
    {
      "id": 181850,
      "store_id": 5,
      "role": 1
    }
  ],
  "leader": {
    "id": 180171,
    "store_id": 1
  },
  "down_peers": [
    {
      "peer": {
        "id": 181496,
        "store_id": 4
      },
      "down_seconds": 360
    }
  ],

, no pending peer was reported. Store 4 was killed and PD's log kept report

[2021/01/12 12:29:23.613 +08:00] [INFO] [operator_controller.go:626] ["send schedule command"] [region-id=129] [step="transfer leader from store 1 to store 4"] [source="active push"]
@BusyJay BusyJay added the type/bug The issue is confirmed as a bug. label Jan 12, 2021
@HunDunDM
Copy link
Member

The current PD uses a timeout of 10 minutes (not configurable) for the entire operator including peer movement. It does not use a separate timeout for each step, which is the main cause of this bug.

@HunDunDM HunDunDM self-assigned this Jan 12, 2021
@HunDunDM
Copy link
Member

Maybe we should increase the judgment of the store state during the scheduling.

@nolouch nolouch added status/TODO The issue will be done in the future. and removed status/TODO The issue will be done in the future. labels Oct 14, 2021
@nolouch nolouch assigned disksing and unassigned HunDunDM Oct 14, 2021
@nolouch nolouch added the status/TODO The issue will be done in the future. label Oct 14, 2021
disksing added a commit to oh-my-tidb/pd that referenced this issue Oct 19, 2021
fix tikv#3353

Signed-off-by: disksing <i@disksing.com>
disksing added a commit to oh-my-tidb/pd that referenced this issue Oct 19, 2021
close tikv#3353

Signed-off-by: disksing <i@disksing.com>
ti-chi-bot pushed a commit that referenced this issue Nov 23, 2021
* operator: check store status for running operators

close #3353

Signed-off-by: disksing <i@disksing.com>

* add test

Signed-off-by: disksing <i@disksing.com>

* add tests

Signed-off-by: disksing <i@disksing.com>

* address comment

Signed-off-by: disksing <i@disksing.com>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 23, 2021
close tikv#3353

Signed-off-by: disksing <i@disksing.com>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 23, 2021
close tikv#3353

Signed-off-by: disksing <i@disksing.com>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 23, 2021
close tikv#3353

Signed-off-by: disksing <i@disksing.com>
IcePigZDB pushed a commit to IcePigZDB/pd that referenced this issue Nov 29, 2021
* operator: check store status for running operators

close tikv#3353

Signed-off-by: disksing <i@disksing.com>

* add test

Signed-off-by: disksing <i@disksing.com>

* add tests

Signed-off-by: disksing <i@disksing.com>

* address comment

Signed-off-by: disksing <i@disksing.com>
disksing pushed a commit that referenced this issue Nov 30, 2021
* operator: check store status for running operators

close #3353

Signed-off-by: disksing <i@disksing.com>
disksing pushed a commit that referenced this issue Dec 1, 2021
* operator: check store status for running operators

close #3353

Signed-off-by: disksing <i@disksing.com>
ti-chi-bot added a commit that referenced this issue Dec 1, 2021
* operator: check store status for running operators

close #3353

Signed-off-by: disksing <i@disksing.com>

* add test

Signed-off-by: disksing <i@disksing.com>

* add tests

Signed-off-by: disksing <i@disksing.com>

* address comment

Signed-off-by: disksing <i@disksing.com>

* fix build

Signed-off-by: disksing <i@disksing.com>

* fix ci (try)

Signed-off-by: disksing <i@disksing.com>

Co-authored-by: disksing <i@disksing.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
severity/moderate status/TODO The issue will be done in the future. type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants