Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler cannot schedule TiKV #602

Closed
gregwebs opened this issue Jun 21, 2019 · 11 comments
Closed

scheduler cannot schedule TiKV #602

gregwebs opened this issue Jun 21, 2019 · 11 comments
Assignees
Labels
test/stability stability tests type/bug Something isn't working

Comments

@gregwebs
Copy link
Contributor

Bug Report

The scheduler continually logs:

E0621 13:50:45.693332       1 mux.go:107] unable to filter nodes: waiting for Pod tidb21/demo-tikv-2 scheduling
I0621 13:50:45.706780       1 scheduler.go:105] scheduling pod: tidb21/demo-tikv-1
I0621 13:50:45.707363       1 scheduler.go:108] entering predicate: HighAvailability, nodes: [gke-alpha-tidb-n1-standard-2-375-63186231-flpw gke-alpha-tidb-n1-standard-2-375-c798b6cf-1zn1 gke-alpha-tidb-n1-standard-2-375-f079b5d0-b9q9]
E0621 13:50:46.093051       1 mux.go:107] unable to filter nodes: waiting for Pod tidb21/demo-tikv-2 scheduling
I0621 13:50:46.104954       1 scheduler.go:105] scheduling pod: tidb21/demo-tikv-0
I0621 13:50:46.104985       1 scheduler.go:108] entering predicate: HighAvailability, nodes: [gke-alpha-tidb-n1-standard-2-375-63186231-flpw gke-alpha-tidb-n1-standard-2-375-c798b6cf-1zn1 gke-alpha-tidb-n1-standard-2-375-f079b5d0-b9q9]

The kube-scheduler log is similar.

There are no events for tikv-2 when it is described.

I got to this state after creating a tidb cluster, then creating a 2nd tidb cluster and deleting the first cluster. I deleted Released PV.

@weekface
Copy link
Contributor

weekface commented Jun 21, 2019

kubectl describe po -n tidb21 demo-tikv-2
kubectl get pvc -n tidb21
kubectl get pv

@gregwebs
Copy link
Contributor Author

kubectl describe po -n tidb21 demo-tikv-2

Name:               demo-tikv-2
Namespace:          tidb21
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             app.kubernetes.io/component=tikv
                    app.kubernetes.io/instance=tidb21
                    app.kubernetes.io/managed-by=tidb-operator
                    app.kubernetes.io/name=tidb-cluster
                    controller-revision-hash=demo-tikv-874b8bf89
                    statefulset.kubernetes.io/pod-name=demo-tikv-2
Annotations:        pingcap.com/last-applied-configuration:
                      {"volumes":[{"name":"annotations","downwardAPI":{"items":[{"path":"annotations","fieldRef":{"fieldPath":"metadata.annotations"}}]}},{"name...
                    prometheus.io/path: /metrics
                    prometheus.io/port: 20180
                    prometheus.io/scrape: true
Status:             Pending
IP:                 
Controlled By:      StatefulSet/demo-tikv
Init Containers:
  wait-for-pd:
    Image:      gcr.io/pingcap-tidb-alpha/tidb-operator:v1.0.0-beta.3.start-fast-16
    Port:       <none>
    Host Port:  <none>
    Command:
      wait-for-pd
    Environment:
      NAMESPACE:     tidb21 (v1:metadata.namespace)
      CLUSTER_NAME:  demo
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-x6265 (ro)
Containers:
  tikv:
    Image:      pingcap/tikv:v3.0.0-rc.1
    Port:       20160/TCP
    Host Port:  0/TCP
    Command:
      /bin/sh
      /usr/local/bin/tikv_start_script.sh
    Requests:
      cpu:     1
      memory:  2Gi
    Environment:
      NAMESPACE:              tidb21 (v1:metadata.namespace)
      CLUSTER_NAME:           demo
      HEADLESS_SERVICE_NAME:  demo-tikv-peer
      CAPACITY:               0
      TZ:                     UTC
    Mounts:
      /etc/podinfo from annotations (ro)
      /etc/tikv from config (ro)
      /usr/local/bin from startup-script (ro)
      /var/lib/tikv from tikv (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-x6265 (ro)
Volumes:
  tikv:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  tikv-demo-tikv-2
    ReadOnly:   false
  annotations:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      demo-tikv
    Optional:  false
  startup-script:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      demo-tikv
    Optional:  false
  default-token-x6265:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-x6265
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 tidb.pingcap.com/tidb-scaler=n1-standard-2-375:NoSchedule
Events:          <none>

@gregwebs
Copy link
Contributor Author

kubectl get pvc -n tidb21

NAME               STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
pd-demo-pd-0       Bound     pvc-21708809-942a-11e9-aab8-4201ac1f4008   5Gi        RWO            pd-ssd-wait     110m
pd-demo-pd-1       Bound     pvc-2175417b-942a-11e9-aab8-4201ac1f4008   5Gi        RWO            pd-ssd-wait     110m
pd-demo-pd-2       Bound     pvc-217981bc-942a-11e9-aab8-4201ac1f4008   5Gi        RWO            pd-ssd-wait     110m
tikv-demo-tikv-0   Pending                                                                        local-storage   110m
tikv-demo-tikv-1   Pending                                                                        local-storage   110m
tikv-demo-tikv-2   Bound     local-pv-3c9d1093                          368Gi      RWO            local-storage   110m

@gregwebs
Copy link
Contributor Author

kubectl get pv

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                               STORAGECLASS    REASON   AGE
local-pv-1c02244d                          368Gi      RWO            Delete           Available                                       local-storage            44h
local-pv-3c9d1093                          368Gi      RWO            Retain           Bound       tidb21/tikv-demo-tikv-2             local-storage            113m
local-pv-52bb53c                           368Gi      RWO            Delete           Available                                       local-storage            110m
local-pv-5e3c2064                          368Gi      RWO            Delete           Available                                       local-storage            44h
local-pv-69e7f7f9                          368Gi      RWO            Delete           Available                                       local-storage            44h
local-pv-6a9c1bf9                          368Gi      RWO            Delete           Available                                       local-storage            110m
local-pv-82f4cde9                          368Gi      RWO            Delete           Available                                       local-storage            15h
local-pv-8b5c80f4                          368Gi      RWO            Delete           Available                                       local-storage            44h
local-pv-92134e5d                          368Gi      RWO            Delete           Available                                       local-storage            21h
local-pv-92524f84                          368Gi      RWO            Delete           Available                                       local-storage            44h
local-pv-99d360f                           368Gi      RWO            Delete           Available                                       local-storage            20h
local-pv-a2973354                          368Gi      RWO            Delete           Available                                       local-storage            20h
local-pv-b06a079e                          368Gi      RWO            Delete           Available                                       local-storage            21h
local-pv-b1e66ac4                          368Gi      RWO            Delete           Available                                       local-storage            110m
local-pv-ba5e9234                          368Gi      RWO            Delete           Available                                       local-storage            22h
local-pv-bb23005c                          368Gi      RWO            Delete           Available                                       local-storage            22h
local-pv-da125dd4                          368Gi      RWO            Delete           Available                                       local-storage            44h
local-pv-e8210ae5                          368Gi      RWO            Delete           Available                                       local-storage            18h
local-pv-f4f18899                          368Gi      RWO            Delete           Available                                       local-storage            22h
pvc-21708809-942a-11e9-aab8-4201ac1f4008   5Gi        RWO            Retain           Bound       tidb21/pd-demo-pd-0                 pd-ssd-wait              112m
pvc-2175417b-942a-11e9-aab8-4201ac1f4008   5Gi        RWO            Retain           Bound       tidb21/pd-demo-pd-1                 pd-ssd-wait              111m
pvc-217981bc-942a-11e9-aab8-4201ac1f4008   5Gi        RWO            Retain           Bound       tidb21/pd-demo-pd-2                 pd-ssd-wait              112m
pvc-a518b9e1-920e-11e9-afc9-4201ac1f4006   2Gi        RWO            Delete           Bound       operations/tidb-data-mysql-0        standard                 2d18h
pvc-b2d05151-9200-11e9-afc9-4201ac1f4006   2Gi        RWO            Delete           Bound       monitor/database-netdata-master-0   standard                 2d19h
pvc-b2d39b21-9200-11e9-afc9-4201ac1f4006   1Gi        RWO            Delete           Bound       monitor/alarms-netdata-master-0     standard                 2d19h

@weekface
Copy link
Contributor

weekface commented Jun 21, 2019

The tikv-2 PVC was bound, but can't scheduled, and there are no events, so this should be a kube-scheduler problem we have met frequency in our k8s env recently? @cofyc

  • what is the logs of containerkube-scheduler in the tidb-scheduler pod?
  • how many pods(all namespaces) are there in this k8s cluster?

@gregwebs
Copy link
Contributor Author

As per #468 this blocks a new cluster from being scheduled.

The tidb-scheduler logs are listed above. kube-scheduler looks the same.

E0621 16:04:47.293641       1 factory.go:1519] Error scheduling tidb21/demo-tikv-0: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0621 16:04:47.296992       1 scheduler.go:546] error selecting node for pod: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500
E0621 16:04:47.297663       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.297676       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.297912       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.297920       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.298155       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.298163       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
I0621 16:04:47.693084       1 trace.go:76] Trace[1601680201]: "Scheduling tidb21/demo-tikv-1" (started: 2019-06-21 16:04:47.297100315 +0000 UTC m=+161567.461774652) (total time: 395.944579ms):
Trace[1601680201]: [395.944579ms] [395.883901ms] END
E0621 16:04:47.694697       1 factory.go:1519] Error scheduling tidb21/demo-tikv-1: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0621 16:04:47.701308       1 scheduler.go:546] error selecting node for pod: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500
E0621 16:04:47.702889       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.702909       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.702929       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.702950       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.703229       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.703242       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
I0621 16:04:48.092963       1 trace.go:76] Trace[147365297]: "Scheduling tidb21/demo-tikv-0" (started: 2019-06-21 16:04:47.7020882 +0000 UTC m=+161567.866762536) (total time: 390.828437ms):
Trace[147365297]: [390.828437ms] [390.756271ms] END

@cofyc cofyc self-assigned this Jun 24, 2019
@cofyc
Copy link
Contributor

cofyc commented Jun 24, 2019

Is the unscheduled pod retried by the scheduler repeatedly? If the scheduler retries scheduling the pod but always fail, it is unrelated to the issue we found in IDC k8s env. In IDC k8s env, the tidb-scheduler didn't try to schedule the new TikV pods.

@gregwebs
Copy link
Contributor Author

Yes, it keeps trying to schedule.

@weekface
Copy link
Contributor

weekface commented Jul 15, 2019

@cofyc suggests:

or uprade to v1.14+

@gregwebs can you have a try?

This was referenced Jul 15, 2019
@gregwebs
Copy link
Contributor Author

I filled out a form to be an alpha user of 1.14 on GKE. I am still waiting... tidb-operator is using kube-scheduler v1.13.6 which matches the GKE version when it was installed.
I will update my version of tidb-operator.

@gregwebs gregwebs added type/bug Something isn't working test/stability stability tests labels Jul 23, 2019
@gregwebs
Copy link
Contributor Author

I cannot reproduce this anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test/stability stability tests type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants