PD keeps transfering leader to a down store #3353

BusyJay · 2021-01-12T04:51:19Z

Bug Report

What did you do?

I used nightly build of PD to test joint consensus.

I enabled shuffle region scheduling and set max store down time to 30s. After killing two stores in the same label, a region will stuck and keep 4 replicas until 10min. One example looked like following:

"region_id": 129,
"peers": [
    {
      "id": 180171,
      "store_id": 1
    },
    {
      "id": 181496,
      "store_id": 4
    },
    {
      "id": 181653,
      "store_id": 13
    },
    {
      "id": 181850,
      "store_id": 5,
      "role": 1
    }
  ],
  "leader": {
    "id": 180171,
    "store_id": 1
  },
  "down_peers": [
    {
      "peer": {
        "id": 181496,
        "store_id": 4
      },
      "down_seconds": 360
    }
  ],

, no pending peer was reported. Store 4 was killed and PD's log kept report

[2021/01/12 12:29:23.613 +08:00] [INFO] [operator_controller.go:626] ["send schedule command"] [region-id=129] [step="transfer leader from store 1 to store 4"] [source="active push"]

The text was updated successfully, but these errors were encountered:

HunDunDM · 2021-01-12T05:04:06Z

The current PD uses a timeout of 10 minutes (not configurable) for the entire operator including peer movement. It does not use a separate timeout for each step, which is the main cause of this bug.

HunDunDM · 2021-01-12T05:05:44Z

Maybe we should increase the judgment of the store state during the scheduling.

fix tikv#3353 Signed-off-by: disksing <i@disksing.com>

close tikv#3353 Signed-off-by: disksing <i@disksing.com>

* operator: check store status for running operators close #3353 Signed-off-by: disksing <i@disksing.com> * add test Signed-off-by: disksing <i@disksing.com> * add tests Signed-off-by: disksing <i@disksing.com> * address comment Signed-off-by: disksing <i@disksing.com>

close tikv#3353 Signed-off-by: disksing <i@disksing.com>

* operator: check store status for running operators close tikv#3353 Signed-off-by: disksing <i@disksing.com> * add test Signed-off-by: disksing <i@disksing.com> * add tests Signed-off-by: disksing <i@disksing.com> * address comment Signed-off-by: disksing <i@disksing.com>

* operator: check store status for running operators close #3353 Signed-off-by: disksing <i@disksing.com>

* operator: check store status for running operators close #3353 Signed-off-by: disksing <i@disksing.com> * add test Signed-off-by: disksing <i@disksing.com> * add tests Signed-off-by: disksing <i@disksing.com> * address comment Signed-off-by: disksing <i@disksing.com> * fix build Signed-off-by: disksing <i@disksing.com> * fix ci (try) Signed-off-by: disksing <i@disksing.com> Co-authored-by: disksing <i@disksing.com>

BusyJay added the type/bug The issue is confirmed as a bug. label Jan 12, 2021

HunDunDM self-assigned this Jan 12, 2021

jebter added the severity/major label Jan 16, 2021

nolouch added severity/moderate and removed severity/major labels Mar 22, 2021

nolouch added status/TODO The issue will be done in the future. and removed status/TODO The issue will be done in the future. labels Oct 14, 2021

nolouch assigned disksing and unassigned HunDunDM Oct 14, 2021

nolouch added the status/TODO The issue will be done in the future. label Oct 14, 2021

disksing added a commit to oh-my-tidb/pd that referenced this issue Oct 19, 2021

operator: check store status for running operators

3d43310

fix tikv#3353 Signed-off-by: disksing <i@disksing.com>

disksing added a commit to oh-my-tidb/pd that referenced this issue Oct 19, 2021

operator: check store status for running operators

3dbad10

close tikv#3353 Signed-off-by: disksing <i@disksing.com>

disksing mentioned this issue Oct 19, 2021

operator: check store status for running operators #4223

Merged

ti-chi-bot closed this as completed in #4223 Nov 23, 2021

ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 23, 2021

operator: check store status for running operators

1a2940d

close tikv#3353 Signed-off-by: disksing <i@disksing.com>

This was referenced Nov 23, 2021

operator: check store status for running operators (#4223) #4365

Merged

operator: check store status for running operators (#4223) #4366

Closed

ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 23, 2021

operator: check store status for running operators

c29a0e4

close tikv#3353 Signed-off-by: disksing <i@disksing.com>

ti-chi-bot mentioned this issue Nov 23, 2021

operator: check store status for running operators (#4223) #4367

Merged

ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 23, 2021

operator: check store status for running operators

14c91b9

close tikv#3353 Signed-off-by: disksing <i@disksing.com>

ti-chi-bot mentioned this issue Nov 23, 2021

operator: check store status for running operators (#4223) #4368

Merged

disksing pushed a commit that referenced this issue Nov 30, 2021

operator: check store status for running operators (#4223) (#4365)

8ef0b12

* operator: check store status for running operators close #3353 Signed-off-by: disksing <i@disksing.com>

disksing pushed a commit that referenced this issue Dec 1, 2021

operator: check store status for running operators (#4223) (#4367)

e482eed

* operator: check store status for running operators close #3353 Signed-off-by: disksing <i@disksing.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PD keeps transfering leader to a down store #3353

PD keeps transfering leader to a down store #3353

BusyJay commented Jan 12, 2021 •

edited

Loading

HunDunDM commented Jan 12, 2021

HunDunDM commented Jan 12, 2021

PD keeps transfering leader to a down store #3353

PD keeps transfering leader to a down store #3353

Comments

BusyJay commented Jan 12, 2021 • edited Loading

Bug Report

What did you do?

HunDunDM commented Jan 12, 2021

HunDunDM commented Jan 12, 2021

BusyJay commented Jan 12, 2021 •

edited

Loading