Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2/4 PD in error state #568

Closed
gregwebs opened this issue Jun 11, 2019 · 10 comments
Closed

2/4 PD in error state #568

gregwebs opened this issue Jun 11, 2019 · 10 comments
Labels
test/stability stability tests

Comments

@gregwebs
Copy link
Contributor

Using the latest stable release (beta 3), after the cluster has been running for several hours I observe pods in an Error state. The log looks very suspicious that there is a problem with how PD are joining after experiencing a failure.

NAME                              READY   STATUS    RESTARTS   AGE
demo-discovery-6579dfd986-9kdpd   1/1     Running   0          5h30m
demo-monitor-dc6f8b689-9vrtk      2/2     Running   0          5h30m
demo-pd-0                         1/1     Running   0          5h30m
demo-pd-1                         0/1     Error     2          33m
demo-pd-2                         1/1     Running   0          5h30m
demo-pd-3                         0/1     Error     2          33m
demo-tidb-0                       1/1     Running   0          5h20m
demo-tidb-1                       1/1     Running   0          5h20m
demo-tikv-0                       1/1     Running   0          5h25m
demo-tikv-1                       1/1     Running   0          5h25m
demo-tikv-2                       1/1     Running   0          5h25m

kubectl logs -p -n tidb180007 demo-pd-3

Name:      demo-pd-3.demo-pd-peer.tidb180007.svc
Address 1: 172.30.1.196 demo-pd-3.demo-pd-peer.tidb180007.svc.cluster.local
nslookup domain demo-pd-3.demo-pd-peer.tidb180007.svc.svc success
starting pd-server ...
/pd-server --data-dir=/var/lib/pd --name=demo-pd-3 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://demo-pd-3.demo-pd-peer.tidb180007.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://demo-pd-3.demo-pd-peer.tidb180007.svc:2379 --config=/etc/pd/pd.toml  --join=http://demo-pd-0.demo-pd-peer.tidb180007.svc:2380,http://demo-pd-2.demo-pd-peer.tidb180007.svc:2380
[2019/06/11 00:08:02.505 +00:00] [INFO] [util.go:59] ["Welcome to Placement Driver (PD)"]
[2019/06/11 00:08:02.505 +00:00] [INFO] [util.go:60] [PD] [release-version=v3.0.0-rc.1]
[2019/06/11 00:08:02.505 +00:00] [INFO] [util.go:61] [PD] [git-hash=67549be8b94e2465949de0a88ab07d0abb75abd0]
[2019/06/11 00:08:02.505 +00:00] [INFO] [util.go:62] [PD] [git-branch=HEAD]
[2019/06/11 00:08:02.505 +00:00] [INFO] [util.go:63] [PD] [utc-build-time="2019-05-10 11:35:57"]
[2019/06/11 00:08:02.505 +00:00] [INFO] [metricutil.go:80] ["disable Prometheus push client"]
[2019/06/11 00:08:02.506 +00:00] [INFO] [server.go:110] ["PD Config"] [config="{\"client-urls\":\"http://0.0.0.0:2379\",\"peer-urls\":\"http://0.0.0.0:2380\",\"advertise-client-urls\":\"http://demo-pd-3.demo-pd-peer.tidb180007.svc:2379\",\"advertise-peer-urls\":\"http://demo-pd-3.demo-pd-peer.tidb180007.svc:2380\",\"name\":\"demo-pd-3\",\"data-dir\":\"/var/lib/pd\",\"force-new-cluster\":false,\"initial-cluster\":\"demo-pd-0=http://demo-pd-0.demo-pd-peer.tidb180007.svc:2380,demo-pd-2=http://demo-pd-2.demo-pd-peer.tidb180007.svc:2380\",\"initial-cluster-state\":\"existing\",\"join\":\"http://demo-pd-0.demo-pd-peer.tidb180007.svc:2380,http://demo-pd-2.demo-pd-peer.tidb180007.svc:2380\",\"lease\":3,\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":false,\"file\":{\"filename\":\"\",\"log-rotate\":true,\"max-size\":0,\"max-days\":0,\"max-backups\":0},\"development\":false,\"disable-caller\":false,\"disable-stacktrace\":false,\"sampling\":null},\"log-file\":\"\",\"log-level\":\"\",\"tso-save-interval\":\"3s\",\"metric\":{\"job\":\"demo-pd-3\",\"address\":\"\",\"interval\":\"15s\"},\"schedule\":{\"max-snapshot-count\":3,\"max-pending-peer-count\":16,\"max-merge-region-size\":0,\"max-merge-region-keys\":0,\"split-merge-interval\":\"1h0m0s\",\"patrol-region-interval\":\"100ms\",\"max-store-down-time\":\"30m0s\",\"leader-schedule-limit\":4,\"region-schedule-limit\":4,\"replica-schedule-limit\":8,\"merge-schedule-limit\":8,\"hot-region-schedule-limit\":2,\"hot-region-cache-hits-threshold\":3,\"tolerant-size-ratio\":5,\"low-space-ratio\":0.8,\"high-space-ratio\":0.6,\"disable-raft-learner\":\"false\",\"disable-remove-down-replica\":\"false\",\"disable-replace-offline-replica\":\"false\",\"disable-make-up-replica\":\"false\",\"disable-remove-extra-replica\":\"false\",\"disable-location-replacement\":\"false\",\"disable-namespace-relocation\":\"false\",\"schedulers-v2\":[{\"type\":\"balance-region\",\"args\":null,\"disable\":false},{\"type\":\"balance-leader\",\"args\":null,\"disable\":false},{\"type\":\"hot-region\",\"args\":null,\"disable\":false},{\"type\":\"label\",\"args\":null,\"disable\":false}]},\"replication\":{\"max-replicas\":3,\"location-labels\":\"region,zone,rack,host\"},\"namespace\":{},\"pd-server\":{\"use-region-storage\":\"false\"},\"cluster-version\":\"0.0.0\",\"quota-backend-bytes\":\"0 B\",\"auto-compaction-mode\":\"periodic\",\"auto-compaction-retention-v2\":\"1h\",\"TickInterval\":\"500ms\",\"ElectionInterval\":\"3s\",\"PreVote\":true,\"security\":{\"cacert-path\":\"\",\"cert-path\":\"\",\"key-path\":\"\"},\"label-property\":{},\"WarningMsgs\":null,\"namespace-classifier\":\"table\",\"LeaderPriorityCheckInterval\":\"1m0s\"}"]
[2019/06/11 00:08:02.508 +00:00] [INFO] [server.go:145] ["start embed etcd"]
[2019/06/11 00:08:02.509 +00:00] [INFO] [systime_mon.go:25] ["start system time monitor"]
[2019/06/11 00:08:02.509 +00:00] [INFO] [etcd.go:117] ["configuring peer listeners"] [listen-peer-urls="[http://0.0.0.0:2380]"]
[2019/06/11 00:08:02.509 +00:00] [INFO] [etcd.go:127] ["configuring client listeners"] [listen-client-urls="[http://0.0.0.0:2379]"]
[2019/06/11 00:08:02.509 +00:00] [INFO] [etcd.go:600] ["pprof is enabled"] [path=/debug/pprof]
[2019/06/11 00:08:02.509 +00:00] [INFO] [etcd.go:297] ["starting an etcd server"] [etcd-version=3.3.0+git] [git-sha="Not provided (use ./build instead of go build)"] [go-version=go1.12] [go-os=linux] [go-arch=amd64] [max-cpu-set=2] [max-cpu-available=2] [member-initialized=false] [name=demo-pd-3] [data-dir=/var/lib/pd] [wal-dir=] [wal-dir-dedicated=] [member-dir=/var/lib/pd/member] [force-new-cluster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [initial-advertise-peer-urls="[http://demo-pd-3.demo-pd-peer.tidb180007.svc:2380]"] [listen-peer-urls="[http://0.0.0.0:2380]"] [advertise-client-urls="[http://demo-pd-3.demo-pd-peer.tidb180007.svc:2379]"] [listen-client-urls="[http://0.0.0.0:2379]"] [listen-metrics-urls="[]"] [cors="[*]"] [host-whitelist="[*]"] [initial-cluster="demo-pd-0=http://demo-pd-0.demo-pd-peer.tidb180007.svc:2380,demo-pd-2=http://demo-pd-2.demo-pd-peer.tidb180007.svc:2380"] [initial-cluster-state=existing] [initial-cluster-token=etcd-cluster] [quota-size-bytes=2147483648] [pre-vote=true] [initial-corrupt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [discovery-url=] [discovery-proxy=]
[2019/06/11 00:08:02.510 +00:00] [INFO] [backend.go:79] ["opened backend db"] [path=/var/lib/pd/member/snap/db] [took=221.883µs]
[2019/06/11 00:08:02.511 +00:00] [INFO] [etcd.go:358] ["closing etcd server"] [name=demo-pd-3] [data-dir=/var/lib/pd] [advertise-peer-urls="[http://demo-pd-3.demo-pd-peer.tidb180007.svc:2380]"] [advertise-client-urls="[http://demo-pd-3.demo-pd-peer.tidb180007.svc:2379]"]
[2019/06/11 00:08:02.512 +00:00] [INFO] [etcd.go:362] ["closed etcd server"] [name=demo-pd-3] [data-dir=/var/lib/pd] [advertise-peer-urls="[http://demo-pd-3.demo-pd-peer.tidb180007.svc:2380]"] [advertise-client-urls="[http://demo-pd-3.demo-pd-peer.tidb180007.svc:2379]"]
[2019/06/11 00:08:02.512 +00:00] [FATAL] [main.go:111] ["run server failed"] [error="couldn't find local name \"demo-pd-3\" in the initial cluster configuration",errorVerbose="couldn't find local name \"demo-pd-3\" in the initial cluster configuration\ngithub.com/pingcap/pd/server.(*Server).startEtcd\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/server/server.go:151\ngithub.com/pingcap/pd/server.(*Server).Run\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/server/server.go:302\nmain.main\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:110\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1337"] [stack="github.com/pingcap/log.Fatal\n\t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/log@v0.0.0-20190214045112-b37da76f67a7/global.go:59\nmain.main\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:111\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"]

kubectl logs -p -n tidb180007 demo-pd-1

Name:      demo-pd-1.demo-pd-peer.tidb180007.svc
Address 1: 172.30.1.68 demo-pd-1.demo-pd-peer.tidb180007.svc.cluster.local
nslookup domain demo-pd-1.demo-pd-peer.tidb180007.svc.svc success
starting pd-server ...
/pd-server --data-dir=/var/lib/pd --name=demo-pd-1 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://demo-pd-1.demo-pd-peer.tidb180007.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://demo-pd-1.demo-pd-peer.tidb180007.svc:2379 --config=/etc/pd/pd.toml --join=http://demo-pd-0.demo-pd-peer.tidb180007.svc:2380,http://demo-pd-3.demo-pd-peer.tidb180007.svc:2380,http://demo-pd-2.demo-pd-peer.tidb180007.svc:2380
[2019/06/11 00:07:36.811 +00:00] [INFO] [util.go:59] ["Welcome to Placement Driver (PD)"]
[2019/06/11 00:07:36.812 +00:00] [INFO] [util.go:60] [PD] [release-version=v3.0.0-rc.1]
[2019/06/11 00:07:36.812 +00:00] [INFO] [util.go:61] [PD] [git-hash=67549be8b94e2465949de0a88ab07d0abb75abd0]
[2019/06/11 00:07:36.812 +00:00] [INFO] [util.go:62] [PD] [git-branch=HEAD]
[2019/06/11 00:07:36.812 +00:00] [INFO] [util.go:63] [PD] [utc-build-time="2019-05-10 11:35:57"]
[2019/06/11 00:07:36.812 +00:00] [INFO] [metricutil.go:80] ["disable Prometheus push client"]
[2019/06/11 00:07:36.812 +00:00] [ERROR] [join.go:180] ["failed to open directory"] [error="open /var/lib/pd/member: no such file or directory"] [stack="github.com/pingcap/log.Error\n\t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/log@v0.0.0-20190214045112-b37da76f67a7/global.go:42\ngithub.com/pingcap/pd/server.isDataExist\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/server/join.go:180\ngithub.com/pingcap/pd/server.PrepareJoinCluster\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/server/join.go:99\nmain.main\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:83\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"]
2019/06/11 00:07:36.813 grpclog.go:45: [info] parsed scheme: "endpoint"
2019/06/11 00:07:36.813 grpclog.go:45: [info] ccResolverWrapper: sending new addresses to cc: [{http://demo-pd-0.demo-pd-peer.tidb180007.svc:2380 0  <nil>} {http://demo-pd-3.demo-pd-peer.tidb180007.svc:2380 0  <nil>} {http://demo-pd-2.demo-pd-peer.tidb180007.svc:2380 0  <nil>}]
2019/06/11 00:07:36.821 grpclog.go:60: [warning] grpc: addrConn.createTransport failed to connect to {http://demo-pd-3.demo-pd-peer.tidb180007.svc:2380 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.30.1.196:2380: connect: connection refused". Reconnecting...
[2019/06/11 00:07:36.824 +00:00] [FATAL] [main.go:85] ["join meet error"] [error="there is a member that has not joined successfully",errorVerbose="there is a member that has not joined successfully\ngithub.com/pingcap/pd/server.PrepareJoinCluster\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/server/join.go:128\nmain.main\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:83\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1337"] [stack="github.com/pingcap/log.Fatal\n\t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/log@v0.0.0-20190214045112-b37da76f67a7/global.go:59\nmain.main\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:85\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"]
@gregwebs
Copy link
Contributor Author

oh, I should have said 2/4. It seems that PD has recovered now. However, there is still a demo-pd-3 in an error state. Why is that?

@gregwebs gregwebs changed the title 2/3 PD in error state 2/4 PD in error state Jun 11, 2019
@weekface
Copy link
Contributor

{
  "header": {
    "cluster_id": 6700974179107742051
  },
  "members": [
    {
      "name": "demo-pd-0",
      "member_id": 3987767476851708343,
      "peer_urls": [
        "http://demo-pd-0.demo-pd-peer.tidb180007.svc:2380"
      ],
      "client_urls": [
        "http://demo-pd-0.demo-pd-peer.tidb180007.svc:2379"
      ]
    },
    {
      "member_id": 18131069412732890039,
      "peer_urls": [
        "http://demo-pd-3.demo-pd-peer.tidb180007.svc:2380"
      ]
    },
    {
      "name": "demo-pd-2",
      "member_id": 18184684093988594571,
      "peer_urls": [
        "http://demo-pd-2.demo-pd-peer.tidb180007.svc:2380"
      ],
      "client_urls": [
        "http://demo-pd-2.demo-pd-peer.tidb180007.svc:2379"
      ]
    }
  ],
  "leader": {
    "name": "demo-pd-0",
    "member_id": 3987767476851708343,
    "peer_urls": [
      "http://demo-pd-0.demo-pd-peer.tidb180007.svc:2380"
    ],
    "client_urls": [
      "http://demo-pd-0.demo-pd-peer.tidb180007.svc:2379"
    ]
  },
  "etcd_leader": {
    "name": "demo-pd-0",
    "member_id": 3987767476851708343,
    "peer_urls": [
      "http://demo-pd-0.demo-pd-peer.tidb180007.svc:2380"
    ],
    "client_urls": [
      "http://demo-pd-0.demo-pd-peer.tidb180007.svc:2379"
    ]
  }
}

The demo-pd-3 only have a member_id but not have name

@weekface
Copy link
Contributor

#126

@weekface
Copy link
Contributor

demo-pd-3 join file:

$ cat /var/lib/pd/join
demo-pd-0=http://demo-pd-0.demo-pd-peer.tidb180007.svc:2380,demo-pd-2=http://demo-pd-2.demo-pd-peer.tidb180007.svc:2380

https://transfer.sh/oS8B4/demo-pd-0.log
https://transfer.sh/10fjLC/demo-pd-2.log
https://transfer.sh/O71s1/demo-pd-3.log

@nolouch PTAL

@weekface weekface added the test/stability stability tests label Jun 21, 2019
@nolouch
Copy link
Member

nolouch commented Jun 21, 2019

pd3 no output?

@nolouch
Copy link
Member

nolouch commented Jun 24, 2019

Thanks, the join file missed itself, I will add a retry mechanism to ensure it can find itself.

@weekface
Copy link
Contributor

fixed by tikv/pd#1643, closing.

@weekface weekface reopened this Jul 23, 2019
@weekface weekface removed this from the v1.0.0 milestone Jul 23, 2019
@gregwebs
Copy link
Contributor Author

We can close this out when we specify a PD version with the bug fix. Next possible fix version will be 3.0.2

@tennix
Copy link
Member

tennix commented Oct 11, 2019

We've changed the default TiDB version to v3.0.4, shall we close this issue now? @gregwebs

@gregwebs
Copy link
Contributor Author

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test/stability stability tests
Projects
None yet
Development

No branches or pull requests

4 participants