tidb-scheduler stuck filtering #468

gregwebs · 2019-05-08T22:29:20Z

operator version: the latest stable release

@jlerche started two clusters at once. Both were in a stuck state and had a failed PD member, and one had 3 failed TiDB members [1] and the other had its TiKV pods stuck in pending.

The scheduler logs showed a lot of errors. I don't remember seeing so many errors in kube-scheduler before. [2] The tidb-scheduler logs were a lot simpler [3]. I noticed error level logging. Looking at the source code, there is an error logged when a 500 is returned, so we have the erorr from the kube-scheduler 500.

I would have said that this was just a network issue. However, when we deleted the second cluster, the first cluster immediately finished coming up successfully.

So my first question is: can the failure to schedule one cluster interfere with another being scheduled? I think the answer is yes in that there could be a problem node used by another cluster?

It seems like I should be seeing more information in the logs and maybe some events?

We have destroyed the clusters now, but are willing to try to reproduce the issue again.

E0508 21:45:48.991666       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling

[1] failed members

  Tidb:
    Failure Members:
      Demo - Tidb - 0:
        Pod Name:  demo-tidb-0
      Demo - Tidb - 1:
        Pod Name:  demo-tidb-1
      Demo - Tidb - 2:
        Pod Name:  demo-tidb-2
    Members:
      Demo - Tidb - 0:
        Health:                false
        Last Transition Time:  2019-05-08T20:50:33Z
        Name:                  demo-tidb-0
      Demo - Tidb - 1:
        Health:                false
        Last Transition Time:  2019-05-08T20:50:33Z
        Name:                  demo-tidb-1
      Demo - Tidb - 2:
        Health:                false
        Last Transition Time:  2019-05-08T20:55:43Z
        Name:                  demo-tidb-2
      Demo - Tidb - 3:
        Health:                false
        Last Transition Time:  2019-05-08T20:55:44Z
        Name:                  demo-tidb-3
      Demo - Tidb - 4:
        Health:                false
        Last Transition Time:  2019-05-08T21:00:43Z
        Name:                  demo-tidb-4
    Phase:                     Normal

[2] kube-scheduler errors

E0508 20:45:07.553494       1 factory.go:1483] Error scheduling tidb30067/demo-pd-2: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
I0508 20:45:08.562582       1 trace.go:76] Trace[1069131296]: "Scheduling tidb30067/demo-pd-1" (started: 2019-05-08 20:45:07.558249532 +0000 UTC m=+1214730.597821390) (total time: 1.004284101s):
Trace[1069131296]: [1.004279887s] [1.004252357s] Prioritizing
E0508 20:45:08.563018       1 factory.go:1483] Error scheduling tidb30067/demo-pd-1: v1.PersistentVolumeClaim "tidb30067/pd-demo-pd-1" is out of sync; retrying
E0508 20:45:08.582255       1 factory.go:1483] Error scheduling tidb30067/demo-pd-2: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
I0508 20:45:09.355554       1 trace.go:76] Trace[128453207]: "Scheduling tidb30067/demo-pd-1" (started: 2019-05-08 20:45:08.588251216 +0000 UTC m=+1214731.627823086) (total time: 767.099494ms):
Trace[128453207]: [767.094251ms] [767.065493ms] Prioritizing
I0508 20:46:24.574313       1 trace.go:76] Trace[1671821716]: "Scheduling tidb30067/demo-pd-2" (started: 2019-05-08 20:46:24.448517113 +0000 UTC m=+1214807.488088969) (total time: 125.753801ms):
Trace[1671821716]: [125.748505ms] [125.706445ms] Prioritizing
E0508 20:46:24.576719       1 factory.go:1483] Error scheduling tidb30067/demo-pd-2: v1.PersistentVolumeClaim "tidb30067/pd-demo-pd-2" is out of sync; retrying
I0508 20:49:34.958528       1 trace.go:76] Trace[189696979]: "Scheduling tidb30066/demo-tikv-1" (started: 2019-05-08 20:49:34.84773392 +0000 UTC m=+1214997.887305778) (total time: 110.70208ms):
Trace[189696979]: [110.70208ms] [110.653898ms] END
E0508 20:50:05.473933       1 factory.go:1483] Error scheduling tidb30066/demo-tikv-2: v1.PersistentVolume "local-pv-aea6b090" is out of sync; retrying
E0508 20:50:05.490738       1 factory.go:1483] Error scheduling tidb30066/demo-tikv-0: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0508 20:50:05.507671       1 factory.go:1483] Error scheduling tidb30066/demo-tikv-1: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
I0508 20:50:05.574707       1 pv_controller.go:874] claim "tidb30066/tikv-demo-tikv-2" bound to volume "local-pv-aea6b090"
I0508 20:50:10.577742       1 pv_controller.go:874] claim "tidb30066/tikv-demo-tikv-1" bound to volume "local-pv-6c3980e2"
I0508 20:50:23.176857       1 pv_controller.go:874] claim "tidb30066/tikv-demo-tikv-0" bound to volume "local-pv-f45c81db"
E0508 21:45:48.775170       1 scheduler.go:227] Error getting the updated preemptor pod object: pods "demo-tikv-0" not found
E0508 21:45:48.955823       1 factory.go:1483] Error scheduling tidb30066/demo-pd-1: v1.PersistentVolumeClaim "tidb30066/pd-demo-pd-1" is out of sync; retrying
E0508 21:45:48.992157       1 factory.go:1483] Error scheduling tidb30066/demo-pd-3: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0508 21:45:49.060587       1 factory.go:1483] Error scheduling tidb30066/demo-pd-3: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0508 21:45:49.082830       1 factory.go:1483] Error scheduling tidb30066/demo-pd-3: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0508 21:45:49.157664       1 factory.go:1483] Error scheduling tidb30066/demo-pd-3: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0508 21:45:49.176753       1 factory.go:1483] Error scheduling tidb30066/demo-pd-3: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0508 21:45:49.193168       1 factory.go:1483] Error scheduling tidb30066/demo-pd-3: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0508 21:45:49.297097       1 factory.go:1483] Error scheduling tidb30066/demo-pd-3: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
I0508 21:45:49.497747       1 trace.go:76] Trace[465471626]: "Scheduling tidb30066/demo-pd-3" (started: 2019-05-08 21:45:49.305212429 +0000 UTC m=+1218372.344784288) (total time: 192.488538ms):

[3] tidb-scheduler logs

I0508 20:50:05.556080       1 scheduler.go:75] scheduling pod: tidb30066/demo-tikv-2
I0508 20:50:10.526156       1 scheduler.go:75] scheduling pod: tidb30066/demo-tikv-1
I0508 20:50:23.105872       1 scheduler.go:75] scheduling pod: tidb30066/demo-tikv-0
I0508 21:45:48.891459       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-1
I0508 21:45:48.986898       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:48.991666       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.033320       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-1
I0508 21:45:49.055381       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.060054       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.077434       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.082516       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.152212       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.157122       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.170433       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.176328       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.188298       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.192697       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.252236       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.296580       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.317265       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.497132       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.510087       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.696746       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.712138       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.897125       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.950284       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:50.096858       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:50.112637       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:50.349413       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:50.373193       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:50.497188       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:50.511223       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:50.849719       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:50.867678       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:50.896961       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:50.911135       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:51.096488       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:51.162597       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:51.296429       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:51.310276       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:51.554070       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:51.661973       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:51.697130       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:51.711211       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:51.949647       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:51.967350       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:52.097087       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:52.116087       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:52.297258       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:52.365722       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:52.496368       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:52.513839       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
I0508 21:45:53.469832       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3

The text was updated successfully, but these errors were encountered:

weekface · 2019-05-08T23:39:54Z

kubectl describe po -n tidb30066 demo-pd-1
kubectl describe po -n tidb30066 demo-pd-3
kubectl get po -n tidb30066 demo-pd-1 -oyaml
kubectl get po -n tidb30066 demo-pd-3 -oyaml

The TiDB Operator can emit events now: #427

weekface · 2019-05-09T00:07:16Z

The latest operator will log this message: https://github.com/pingcap/tidb-operator/blob/master/pkg/scheduler/scheduler.go#L78

So you may be using an old version operator.

weekface · 2019-05-09T00:08:02Z

We will release the v1.0.0.beta.2 before this weekend.

gregwebs · 2019-05-10T21:42:14Z

The clusters are destroyed now. Originally I meant to just destroy one, but the destruction of one fixed the other.

I will try reproducing thiw with the new operator, but I believe the issue is still present in the new code that when provisioning two clusters they both can get stuck due to a problematic node even though at least one of them should be succeeding.

weekface · 2019-05-21T02:53:23Z

Does this problem still exist in the v1.0.0.beta.2 version? @gregwebs

gregwebs · 2019-06-21T16:09:50Z

@weekface I can confirm that in beta3 one cluster scheduling issue will block a second cluster from being scheduled.

weekface · 2019-07-15T09:22:00Z

same with #602, closing this.

Co-authored-by: DanielZhangQD <36026334+DanielZhangQD@users.noreply.github.com>

gregwebs mentioned this issue Jun 21, 2019

scheduler cannot schedule TiKV #602

Closed

weekface closed this as completed Jul 15, 2019

gregwebs mentioned this issue Jul 23, 2019

one tidb cluster's scheduling issues block other clusters #683

Closed

yahonda pushed a commit that referenced this issue Dec 27, 2021

en: remove steps in initialization headings (#468)

c587917

Co-authored-by: DanielZhangQD <36026334+DanielZhangQD@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tidb-scheduler stuck filtering #468

tidb-scheduler stuck filtering #468

gregwebs commented May 8, 2019 •

edited

Loading

weekface commented May 8, 2019

weekface commented May 9, 2019

weekface commented May 9, 2019

gregwebs commented May 10, 2019

weekface commented May 21, 2019

gregwebs commented Jun 21, 2019

weekface commented Jul 15, 2019

tidb-scheduler stuck filtering #468

tidb-scheduler stuck filtering #468

Comments

gregwebs commented May 8, 2019 • edited Loading

weekface commented May 8, 2019

weekface commented May 9, 2019

weekface commented May 9, 2019

gregwebs commented May 10, 2019

weekface commented May 21, 2019

gregwebs commented Jun 21, 2019

weekface commented Jul 15, 2019

gregwebs commented May 8, 2019 •

edited

Loading