Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidb-scheduler stuck filtering #468

Closed
gregwebs opened this issue May 8, 2019 · 7 comments
Closed

tidb-scheduler stuck filtering #468

gregwebs opened this issue May 8, 2019 · 7 comments

Comments

@gregwebs
Copy link
Contributor

gregwebs commented May 8, 2019

operator version: the latest stable release

@jlerche started two clusters at once. Both were in a stuck state and had a failed PD member, and one had 3 failed TiDB members [1] and the other had its TiKV pods stuck in pending.

The scheduler logs showed a lot of errors. I don't remember seeing so many errors in kube-scheduler before. [2] The tidb-scheduler logs were a lot simpler [3]. I noticed error level logging. Looking at the source code, there is an error logged when a 500 is returned, so we have the erorr from the kube-scheduler 500.

I would have said that this was just a network issue. However, when we deleted the second cluster, the first cluster immediately finished coming up successfully.

So my first question is: can the failure to schedule one cluster interfere with another being scheduled? I think the answer is yes in that there could be a problem node used by another cluster?

It seems like I should be seeing more information in the logs and maybe some events?

We have destroyed the clusters now, but are willing to try to reproduce the issue again.

E0508 21:45:48.991666       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling

[1] failed members

  Tidb:
    Failure Members:
      Demo - Tidb - 0:
        Pod Name:  demo-tidb-0
      Demo - Tidb - 1:
        Pod Name:  demo-tidb-1
      Demo - Tidb - 2:
        Pod Name:  demo-tidb-2
    Members:
      Demo - Tidb - 0:
        Health:                false
        Last Transition Time:  2019-05-08T20:50:33Z
        Name:                  demo-tidb-0
      Demo - Tidb - 1:
        Health:                false
        Last Transition Time:  2019-05-08T20:50:33Z
        Name:                  demo-tidb-1
      Demo - Tidb - 2:
        Health:                false
        Last Transition Time:  2019-05-08T20:55:43Z
        Name:                  demo-tidb-2
      Demo - Tidb - 3:
        Health:                false
        Last Transition Time:  2019-05-08T20:55:44Z
        Name:                  demo-tidb-3
      Demo - Tidb - 4:
        Health:                false
        Last Transition Time:  2019-05-08T21:00:43Z
        Name:                  demo-tidb-4
    Phase:                     Normal

[2] kube-scheduler errors

E0508 20:45:07.553494       1 factory.go:1483] Error scheduling tidb30067/demo-pd-2: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
I0508 20:45:08.562582       1 trace.go:76] Trace[1069131296]: "Scheduling tidb30067/demo-pd-1" (started: 2019-05-08 20:45:07.558249532 +0000 UTC m=+1214730.597821390) (total time: 1.004284101s):
Trace[1069131296]: [1.004279887s] [1.004252357s] Prioritizing
E0508 20:45:08.563018       1 factory.go:1483] Error scheduling tidb30067/demo-pd-1: v1.PersistentVolumeClaim "tidb30067/pd-demo-pd-1" is out of sync; retrying
E0508 20:45:08.582255       1 factory.go:1483] Error scheduling tidb30067/demo-pd-2: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
I0508 20:45:09.355554       1 trace.go:76] Trace[128453207]: "Scheduling tidb30067/demo-pd-1" (started: 2019-05-08 20:45:08.588251216 +0000 UTC m=+1214731.627823086) (total time: 767.099494ms):
Trace[128453207]: [767.094251ms] [767.065493ms] Prioritizing
I0508 20:46:24.574313       1 trace.go:76] Trace[1671821716]: "Scheduling tidb30067/demo-pd-2" (started: 2019-05-08 20:46:24.448517113 +0000 UTC m=+1214807.488088969) (total time: 125.753801ms):
Trace[1671821716]: [125.748505ms] [125.706445ms] Prioritizing
E0508 20:46:24.576719       1 factory.go:1483] Error scheduling tidb30067/demo-pd-2: v1.PersistentVolumeClaim "tidb30067/pd-demo-pd-2" is out of sync; retrying
I0508 20:49:34.958528       1 trace.go:76] Trace[189696979]: "Scheduling tidb30066/demo-tikv-1" (started: 2019-05-08 20:49:34.84773392 +0000 UTC m=+1214997.887305778) (total time: 110.70208ms):
Trace[189696979]: [110.70208ms] [110.653898ms] END
E0508 20:50:05.473933       1 factory.go:1483] Error scheduling tidb30066/demo-tikv-2: v1.PersistentVolume "local-pv-aea6b090" is out of sync; retrying
E0508 20:50:05.490738       1 factory.go:1483] Error scheduling tidb30066/demo-tikv-0: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0508 20:50:05.507671       1 factory.go:1483] Error scheduling tidb30066/demo-tikv-1: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
I0508 20:50:05.574707       1 pv_controller.go:874] claim "tidb30066/tikv-demo-tikv-2" bound to volume "local-pv-aea6b090"
I0508 20:50:10.577742       1 pv_controller.go:874] claim "tidb30066/tikv-demo-tikv-1" bound to volume "local-pv-6c3980e2"
I0508 20:50:23.176857       1 pv_controller.go:874] claim "tidb30066/tikv-demo-tikv-0" bound to volume "local-pv-f45c81db"
E0508 21:45:48.775170       1 scheduler.go:227] Error getting the updated preemptor pod object: pods "demo-tikv-0" not found
E0508 21:45:48.955823       1 factory.go:1483] Error scheduling tidb30066/demo-pd-1: v1.PersistentVolumeClaim "tidb30066/pd-demo-pd-1" is out of sync; retrying
E0508 21:45:48.992157       1 factory.go:1483] Error scheduling tidb30066/demo-pd-3: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0508 21:45:49.060587       1 factory.go:1483] Error scheduling tidb30066/demo-pd-3: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0508 21:45:49.082830       1 factory.go:1483] Error scheduling tidb30066/demo-pd-3: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0508 21:45:49.157664       1 factory.go:1483] Error scheduling tidb30066/demo-pd-3: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0508 21:45:49.176753       1 factory.go:1483] Error scheduling tidb30066/demo-pd-3: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0508 21:45:49.193168       1 factory.go:1483] Error scheduling tidb30066/demo-pd-3: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0508 21:45:49.297097       1 factory.go:1483] Error scheduling tidb30066/demo-pd-3: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
I0508 21:45:49.497747       1 trace.go:76] Trace[465471626]: "Scheduling tidb30066/demo-pd-3" (started: 2019-05-08 21:45:49.305212429 +0000 UTC m=+1218372.344784288) (total time: 192.488538ms):

[3] tidb-scheduler logs

I0508 20:50:05.556080       1 scheduler.go:75] scheduling pod: tidb30066/demo-tikv-2
I0508 20:50:10.526156       1 scheduler.go:75] scheduling pod: tidb30066/demo-tikv-1
I0508 20:50:23.105872       1 scheduler.go:75] scheduling pod: tidb30066/demo-tikv-0
I0508 21:45:48.891459       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-1
I0508 21:45:48.986898       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:48.991666       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.033320       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-1
I0508 21:45:49.055381       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.060054       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.077434       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.082516       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.152212       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.157122       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.170433       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.176328       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.188298       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.192697       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.252236       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.296580       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.317265       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.497132       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.510087       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.696746       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.712138       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:49.897125       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:49.950284       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:50.096858       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:50.112637       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:50.349413       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:50.373193       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:50.497188       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:50.511223       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:50.849719       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:50.867678       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:50.896961       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:50.911135       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:51.096488       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:51.162597       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:51.296429       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:51.310276       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:51.554070       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:51.661973       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:51.697130       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:51.711211       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:51.949647       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:51.967350       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:52.097087       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:52.116087       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:52.297258       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:52.365722       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
E0508 21:45:52.496368       1 mux.go:107] unable to filter nodes: waiting for Pod tidb30066/-demo-pd-1 scheduling
I0508 21:45:52.513839       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
I0508 21:45:53.469832       1 scheduler.go:75] scheduling pod: tidb30066/demo-pd-3
@weekface
Copy link
Contributor

weekface commented May 8, 2019

kubectl describe po -n tidb30066 demo-pd-1
kubectl describe po -n tidb30066 demo-pd-3
kubectl get po -n tidb30066 demo-pd-1 -oyaml
kubectl get po -n tidb30066 demo-pd-3 -oyaml

The TiDB Operator can emit events now: #427

@weekface
Copy link
Contributor

weekface commented May 9, 2019

The latest operator will log this message: https://github.com/pingcap/tidb-operator/blob/master/pkg/scheduler/scheduler.go#L78

So you may be using an old version operator.

@weekface
Copy link
Contributor

weekface commented May 9, 2019

We will release the v1.0.0.beta.2 before this weekend.

@gregwebs
Copy link
Contributor Author

The clusters are destroyed now. Originally I meant to just destroy one, but the destruction of one fixed the other.

I will try reproducing thiw with the new operator, but I believe the issue is still present in the new code that when provisioning two clusters they both can get stuck due to a problematic node even though at least one of them should be succeeding.

@weekface
Copy link
Contributor

Does this problem still exist in the v1.0.0.beta.2 version? @gregwebs

@gregwebs
Copy link
Contributor Author

@weekface I can confirm that in beta3 one cluster scheduling issue will block a second cluster from being scheduled.

@weekface
Copy link
Contributor

same with #602, closing this.

yahonda pushed a commit that referenced this issue Dec 27, 2021
Co-authored-by: DanielZhangQD <36026334+DanielZhangQD@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants