Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiKV state is upgrade when failover occurs #567

Closed
weekface opened this issue Jun 10, 2019 · 2 comments · Fixed by #598
Closed

TiKV state is upgrade when failover occurs #567

weekface opened this issue Jun 10, 2019 · 2 comments · Fixed by #598
Assignees
Labels
test/stability stability tests

Comments

@weekface
Copy link
Contributor

Statbility failed with logs:

I0606 10:01:00.353136       1 failover.go:353] tidbCluster:[stability-cluster2/stability-cluster2]'s store pod:[stability-cluster2-tikv-2] have not become failuremember
I0606 10:01:00.369080       1 failover.go:226] cluster: [stability-cluster1/cluster-restore]'s failover feature has complete
I0606 10:02:00.338380       1 failover.go:226] cluster: [stability-cluster1/stability-cluster1]'s failover feature has complete
I0606 10:02:00.354734       1 failover.go:353] tidbCluster:[stability-cluster2/stability-cluster2]'s store pod:[stability-cluster2-tikv-2] have not become failuremember
I0606 10:02:00.371318       1 failover.go:226] cluster: [stability-cluster1/cluster-restore]'s failover feature has complete
I0606 10:03:00.343503       1 failover.go:226] cluster: [stability-cluster1/stability-cluster1]'s failover feature has complete
I0606 10:03:00.357949       1 failover.go:353] tidbCluster:[stability-cluster2/stability-cluster2]'s store pod:[stability-cluster2-tikv-2] have not become failuremember
I0606 10:03:00.372710       1 failover.go:226] cluster: [stability-cluster1/cluster-restore]'s failover feature has complete
I0606 10:04:00.345816       1 failover.go:226] cluster: [stability-cluster1/stability-cluster1]'s failover feature has complete
I0606 10:04:00.362034       1 failover.go:353] tidbCluster:[stability-cluster2/stability-cluster2]'s store pod:[stability-cluster2-tikv-2] have not become failuremember
I0606 10:04:00.376680       1 failover.go:226] cluster: [stability-cluster1/cluster-restore]'s failover feature has complete
I0606 10:05:00.343228       1 failover.go:226] cluster: [stability-cluster1/stability-cluster1]'s failover feature has complete
I0606 10:05:00.356885       1 failover.go:353] tidbCluster:[stability-cluster2/stability-cluster2]'s store pod:[stability-cluster2-tikv-2] have not become failuremember
I0606 10:05:00.370895       1 failover.go:226] cluster: [stability-cluster1/cluster-restore]'s failover feature has complete
I0606 10:06:00.339083       1 failover.go:226] cluster: [stability-cluster1/stability-cluster1]'s failover feature has complete
I0606 10:06:00.354353       1 failover.go:353] tidbCluster:[stability-cluster2/stability-cluster2]'s store pod:[stability-cluster2-tikv-2] have not become failuremember
I0606 10:06:00.370211       1 failover.go:226] cluster: [stability-cluster1/cluster-restore]'s failover feature has complete
I0606 10:06:00.386923       1 failover.go:226] cluster: [stability-cluster1/stability-cluster1]'s failover feature has complete
I0606 10:06:00.399697       1 failover.go:353] tidbCluster:[stability-cluster2/stability-cluster2]'s store pod:[stability-cluster2-tikv-2] have not become failuremember
I0606 10:06:00.413897       1 failover.go:226] cluster: [stability-cluster1/cluster-restore]'s failover feature has complete
I0606 10:06:16.371894       1 blockwriter.go:258] [block_writer] stoping...
I0606 10:06:16.372808       1 blockwriter.go:163] run stopped
I0606 10:06:16.373488       1 blockwriter.go:163] run stopped
I0606 10:06:16.373894       1 blockwriter.go:99] [block_writer] [stability-cluster2] [action: generate Query] stopped
I0606 10:06:16.374654       1 blockwriter.go:163] run stopped
I0606 10:06:16.375612       1 blockwriter.go:163] run stopped
I0606 10:06:16.375640       1 blockwriter.go:225] [block_writer] [stability-cluster2] stopped
E0606 10:06:25.581913       1 runtime.go:69] Observed a panic: &errors.errorString{s:"failed to check failover"} (failed to check failover)
/Users/changjunchang/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20181128191346-49ce2735e507/pkg/util/runtime/runtime.go:76
/Users/changjunchang/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20181128191346-49ce2735e507/pkg/util/runtime/runtime.go:65
/Users/changjunchang/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20181128191346-49ce2735e507/pkg/util/runtime/runtime.go:51
/usr/local/Cellar/go/1.12.1/libexec/src/runtime/panic.go:522
/Users/changjunchang/go/src/github.com/pingcap/tidb-operator/tests/slack/slack.go:154
/Users/changjunchang/go/src/github.com/pingcap/tidb-operator/tests/failover.go:268
/Users/changjunchang/go/src/github.com/pingcap/tidb-operator/tests/cmd/stability/main.go:320
/Users/changjunchang/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20181128191346-49ce2735e507/pkg/util/wait/wait.go:133
/Users/changjunchang/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20181128191346-49ce2735e507/pkg/util/wait/wait.go:134
/Users/changjunchang/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20181128191346-49ce2735e507/pkg/util/wait/wait.go:88
/Users/changjunchang/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20181128191346-49ce2735e507/pkg/util/wait/wait.go:79
/Users/changjunchang/go/src/github.com/pingcap/tidb-operator/tests/cmd/stability/main.go:65
/usr/local/Cellar/go/1.12.1/libexec/src/runtime/proc.go:200
/usr/local/Cellar/go/1.12.1/libexec/src/runtime/asm_amd64.s:1337
panic: failed to check failover [recovered]
	panic: failed to check failover

goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/Users/changjunchang/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20181128191346-49ce2735e507/pkg/util/runtime/runtime.go:58 +0x105
panic(0x1694c20, 0xc0005f7340)
	/usr/local/Cellar/go/1.12.1/libexec/src/runtime/panic.go:522 +0x1b5
github.com/pingcap/tidb-operator/tests/slack.NotifyAndPanic(0x1b39e40, 0xc0005f7340)
	/Users/changjunchang/go/src/github.com/pingcap/tidb-operator/tests/slack/slack.go:154 +0x194
github.com/pingcap/tidb-operator/tests.(*operatorActions).CheckFailoverOrDie(0xc00096c9c0, 0xc000560720, 0x3, 0x3, 0xc0009a7c70, 0xc)
	/Users/changjunchang/go/src/github.com/pingcap/tidb-operator/tests/failover.go:268 +0x16c
main.run()
	/Users/changjunchang/go/src/github.com/pingcap/tidb-operator/tests/cmd/stability/main.go:320 +0x3145
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0x1973950)
	/Users/changjunchang/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20181128191346-49ce2735e507/pkg/util/wait/wait.go:133 +0x54
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x1973950, 0x45d964b800, 0x0, 0x1b3c901, 0xc000440060)
	/Users/changjunchang/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20181128191346-49ce2735e507/pkg/util/wait/wait.go:134 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/Users/changjunchang/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20181128191346-49ce2735e507/pkg/util/wait/wait.go:88
k8s.io/apimachinery/pkg/util/wait.Forever(...)
	/Users/changjunchang/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20181128191346-49ce2735e507/pkg/util/wait/wait.go:79
main.main()
	/Users/changjunchang/go/src/github.com/pingcap/tidb-operator/tests/cmd/stability/main.go:65 +0x1a8

The TidbCluster status is:

apiVersion: v1
items:
- apiVersion: pingcap.com/v1alpha1
  kind: TidbCluster
  metadata:
    annotations:
      pingcap.com/pd.stability-cluster2-pd.sha: 4a85b80b
      pingcap.com/tidb.stability-cluster2-tidb.sha: 08536b26
      pingcap.com/tikv.stability-cluster2-tikv.sha: 94bcfc52
      tidb.pingcap.com/tidb-partition: "0"
    creationTimestamp: 2019-06-06T08:08:29Z
    generation: 1
    labels:
      app.kubernetes.io/component: tidb-cluster
      app.kubernetes.io/instance: stability-cluster2
      app.kubernetes.io/managed-by: Tiller
      app.kubernetes.io/name: tidb-cluster
      helm.sh/chart: tidb-cluster-dev
    name: stability-cluster2
    namespace: stability-cluster2
    resourceVersion: "16936069"
    selfLink: /apis/pingcap.com/v1alpha1/namespaces/stability-cluster2/tidbclusters/stability-cluster2
    uid: 44c8a26d-8832-11e9-bde2-52540064ef3f
  spec:
    pd:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/component: pd
                  app.kubernetes.io/instance: stability-cluster2
              namespaces:
              - stability-cluster2
              topologyKey: rack
            weight: 50
      image: pingcap/pd:v3.0.0-rc.1
      imagePullPolicy: IfNotPresent
      limits:
        cpu: 1000m
        memory: 2Gi
      replicas: 3
      requests:
        cpu: 200m
        memory: 1Gi
        storage: 1Gi
      storageClassName: local-storage
    pvReclaimPolicy: Retain
    schedulerName: tidb-scheduler
    services:
    - name: pd
      type: ClusterIP
    tidb:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/component: tidb
                  app.kubernetes.io/instance: stability-cluster2
              namespaces:
              - stability-cluster2
              topologyKey: rack
            weight: 50
      image: pingcap/tidb:v3.0.0-rc.1
      imagePullPolicy: IfNotPresent
      limits:
        cpu: 8000m
        memory: 8Gi
      maxFailoverCount: 3
      replicas: 2
      requests:
        cpu: 500m
        memory: 1Gi
      slowLogTailer:
        image: busybox:1.26.2
        imagePullPolicy: IfNotPresent
        limits:
          cpu: 100m
          memory: 50Mi
        requests:
          cpu: 20m
          memory: 5Mi
    tikv:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/component: tikv
                  app.kubernetes.io/instance: stability-cluster2
              namespaces:
              - stability-cluster2
              topologyKey: rack
            weight: 50
      image: pingcap/tikv:v3.0.0-rc.1
      imagePullPolicy: IfNotPresent
      limits:
        cpu: 8000m
        memory: 8Gi
      replicas: 3
      requests:
        cpu: 1000m
        memory: 2Gi
        storage: 10Gi
      storageClassName: local-storage
    tikvPromGateway:
      image: ""
    timezone: UTC
  status:
    clusterID: "6699326581014711933"
    pd:
      failureMembers:
        stability-cluster2-pd-2:
          memberDeleted: true
          memberID: "5090906351690763226"
          podName: stability-cluster2-pd-2
          pvcUID: 44d6abcb-8832-11e9-afcc-52540005d356
      leader:
        clientURL: http://stability-cluster2-pd-0.stability-cluster2-pd-peer.stability-cluster2.svc:2379
        health: true
        id: "11060246199916407275"
        lastTransitionTime: 2019-06-06T08:37:37Z
        name: stability-cluster2-pd-0
      members:
        stability-cluster2-pd-0:
          clientURL: http://stability-cluster2-pd-0.stability-cluster2-pd-peer.stability-cluster2.svc:2379
          health: true
          id: "11060246199916407275"
          lastTransitionTime: 2019-06-06T08:37:37Z
          name: stability-cluster2-pd-0
        stability-cluster2-pd-1:
          clientURL: http://stability-cluster2-pd-1.stability-cluster2-pd-peer.stability-cluster2.svc:2379
          health: true
          id: "4466539441664974893"
          lastTransitionTime: 2019-06-06T08:37:20Z
          name: stability-cluster2-pd-1
        stability-cluster2-pd-3:
          clientURL: http://stability-cluster2-pd-3.stability-cluster2-pd-peer.stability-cluster2.svc:2379
          health: true
          id: "2317614290725171035"
          lastTransitionTime: 2019-06-06T09:26:22Z
          name: stability-cluster2-pd-3
      phase: Normal
      statefulSet:
        collisionCount: 0
        currentReplicas: 3
        currentRevision: stability-cluster2-pd-667df68896
        observedGeneration: 21
        readyReplicas: 3
        replicas: 4
        updateRevision: stability-cluster2-pd-667df68896
        updatedReplicas: 3
      synced: true
    tidb:
      members:
        stability-cluster2-tidb-0:
          health: true
          lastTransitionTime: 2019-06-06T08:45:33Z
          name: stability-cluster2-tidb-0
          node: 172.16.4.151
        stability-cluster2-tidb-1:
          health: true
          lastTransitionTime: 2019-06-06T08:43:13Z
          name: stability-cluster2-tidb-1
          node: 172.16.4.153
      phase: Normal
      statefulSet:
        collisionCount: 0
        currentReplicas: 2
        currentRevision: stability-cluster2-tidb-78fc9568c6
        observedGeneration: 15
        readyReplicas: 2
        replicas: 2
        updateRevision: stability-cluster2-tidb-78fc9568c6
        updatedReplicas: 2
    tikv:
      phase: Upgrade
      statefulSet:
        collisionCount: 0
        currentReplicas: 2
        currentRevision: stability-cluster2-tikv-f8c576546
        observedGeneration: 20
        readyReplicas: 2
        replicas: 3
        updateRevision: stability-cluster2-tikv-f8c576546
        updatedReplicas: 2
      stores:
        "1":
          id: "1"
          ip: stability-cluster2-tikv-2.stability-cluster2-tikv-peer.stability-cluster2.svc
          lastHeartbeatTime: 2019-06-06T09:08:13Z
          lastTransitionTime: 2019-06-06T09:20:06Z
          leaderCount: 0
          podName: stability-cluster2-tikv-2
          state: Down
        "4":
          id: "4"
          ip: stability-cluster2-tikv-0.stability-cluster2-tikv-peer.stability-cluster2.svc
          lastHeartbeatTime: 2019-06-10T06:45:47Z
          lastTransitionTime: 2019-06-06T08:42:13Z
          leaderCount: 17
          podName: stability-cluster2-tikv-0
          state: Up
        "5":
          id: "5"
          ip: stability-cluster2-tikv-1.stability-cluster2-tikv-peer.stability-cluster2.svc
          lastHeartbeatTime: 2019-06-10T06:45:53Z
          lastTransitionTime: 2019-06-06T08:39:50Z
          leaderCount: 32
          podName: stability-cluster2-tikv-1
          state: Up
      synced: true
      tombstoneStores:
        "112":
          id: "112"
          ip: stability-cluster2-tikv-3.stability-cluster2-tikv-peer.stability-cluster2.svc
          lastHeartbeatTime: null
          lastTransitionTime: null
          leaderCount: 0
          podName: stability-cluster2-tikv-3
          state: Tombstone
        "113":
          id: "113"
          ip: stability-cluster2-tikv-4.stability-cluster2-tikv-peer.stability-cluster2.svc
          lastHeartbeatTime: null
          lastTransitionTime: null
          leaderCount: 0
          podName: stability-cluster2-tikv-4
          state: Tombstone
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
@weekface weekface added the test/stability stability tests label Jun 10, 2019
@cofyc
Copy link
Contributor

cofyc commented Jun 10, 2019

I reproduced this in cluster 3 too. It seems that no new TiKV pod is created. The cluster expects 3 replicas of TiKV, but only 2/3 pods are up.

BTW, TiKV.status.phase is Upgrade, is this correct in failure phase?

@weekface
Copy link
Contributor Author

weekface commented Jun 10, 2019

When TiKV crashed, PD will create a evict-leader-scheduler scheduler, but this was judged as Upgrade by TiDB Operator. So the stability case failed. We have to delete these codes.

evictLeaderSchedulers, err := pdControl.GetPDClient(tc).GetEvictLeaderSchedulers()
if err != nil {
return false, err
}
return evictLeaderSchedulers != nil && len(evictLeaderSchedulers) > 0, nil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test/stability stability tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants