Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the new pd pod start failed when scale out PD #698

Closed
xiaojingchen opened this issue Jul 27, 2019 · 6 comments
Closed

the new pd pod start failed when scale out PD #698

xiaojingchen opened this issue Jul 27, 2019 · 6 comments
Labels
type/question Further information is requested

Comments

@xiaojingchen
Copy link
Contributor

Bug Report

What's the status of the TiDB cluster pods?

NAME                                  READY   STATUS             RESTARTS   AGE
cluster1-discovery-78bf857f44-4786k   1/1     Running            0          76m
cluster1-monitor-7cfbbfff54-tvgrp     2/2     Running            0          76m
cluster1-pd-0                         1/1     Running            1          76m
cluster1-pd-1                         1/1     Running            0          76m
cluster1-pd-2                         1/1     Running            0          76m
cluster1-pd-3                         0/1     CrashLoopBackOff   18         72m
cluster1-tidb-0                       2/2     Running            0          75m
cluster1-tidb-1                       2/2     Running            0          75m
cluster1-tidb-2                       2/2     Running            0          72m
cluster1-tidb-initializer-mrkvz       0/1     Completed          4          76m
cluster1-tikv-0                       1/1     Running            0          76m
cluster1-tikv-1                       1/1     Running            0          76m
cluster1-tikv-2                       1/1     Running            0          76m
cluster1-tikv-3                       1/1     Running            0          72m

What did you do?

scale out pd from 3 to 5

the cluster1-pd-3 log:

nslookup domain cluster1-pd-3.cluster1-pd-peer.ns1.svc failed

nslookup domain cluster1-pd-3.cluster1-pd-peer.ns1.svc failed

Name:      cluster1-pd-3.cluster1-pd-peer.ns1.svc
Address 1: 10.233.102.62 cluster1-pd-3.cluster1-pd-peer.ns1.svc.cluster.local
nslookup domain cluster1-pd-3.cluster1-pd-peer.ns1.svc.svc success
starting pd-server ...
/pd-server --data-dir=/var/lib/pd --name=cluster1-pd-3 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://cluster1-pd-3.cluster1-pd-peer.ns1.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://cluster1-pd-3.cluster1-pd-peer.ns1.svc:2379 --config=/etc/pd/pd.toml  --join=http://cluster1-pd-2.cluster1-pd-peer.ns1.svc:2380,http://cluster1-pd-1.cluster1-pd-peer.ns1.svc:2380,http://cluster1-pd-0.cluster1-pd-peer.ns1.svc:2380
[2019/07/27 11:09:38.953 +00:00] [INFO] [util.go:59] ["Welcome to Placement Driver (PD)"]
[2019/07/27 11:09:38.953 +00:00] [INFO] [util.go:60] [PD] [release-version=v3.0.0-rc.1]
[2019/07/27 11:09:38.954 +00:00] [INFO] [util.go:61] [PD] [git-hash=67549be8b94e2465949de0a88ab07d0abb75abd0]
[2019/07/27 11:09:38.954 +00:00] [INFO] [util.go:62] [PD] [git-branch=HEAD]
[2019/07/27 11:09:38.954 +00:00] [INFO] [util.go:63] [PD] [utc-build-time="2019-05-10 11:35:57"]
[2019/07/27 11:09:38.954 +00:00] [INFO] [metricutil.go:80] ["disable Prometheus push client"]
[2019/07/27 11:09:38.954 +00:00] [INFO] [server.go:110] ["PD Config"] [config="{\"client-urls\":\"http://0.0.0.0:2379\",\"peer-urls\":\"http://0.0.0.0:2380\",\"advertise-client-urls\":\"http://cluster1-pd-3.cluster1-pd-peer.ns1.svc:2379\",\"advertise-peer-urls\":\"http://cluster1-pd-3.cluster1-pd-peer.ns1.svc:2380\",\"name\":\"cluster1-pd-3\",\"data-dir\":\"/var/lib/pd\",\"force-new-cluster\":false,\"initial-cluster\":\"cluster1-pd-2=http://cluster1-pd-2.cluster1-pd-peer.ns1.svc:2380,cluster1-pd-1=http://cluster1-pd-1.cluster1-pd-peer.ns1.svc:2380,cluster1-pd-0=http://cluster1-pd-0.cluster1-pd-peer.ns1.svc:2380\",\"initial-cluster-state\":\"existing\",\"join\":\"http://cluster1-pd-2.cluster1-pd-peer.ns1.svc:2380,http://cluster1-pd-1.cluster1-pd-peer.ns1.svc:2380,http://cluster1-pd-0.cluster1-pd-peer.ns1.svc:2380\",\"lease\":3,\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":false,\"file\":{\"filename\":\"\",\"log-rotate\":true,\"max-size\":0,\"max-days\":0,\"max-backups\":0},\"development\":false,\"disable-caller\":false,\"disable-stacktrace\":false,\"sampling\":null},\"log-file\":\"\",\"log-level\":\"\",\"tso-save-interval\":\"3s\",\"metric\":{\"job\":\"cluster1-pd-3\",\"address\":\"\",\"interval\":\"15s\"},\"schedule\":{\"max-snapshot-count\":3,\"max-pending-peer-count\":16,\"max-merge-region-size\":20,\"max-merge-region-keys\":200000,\"split-merge-interval\":\"1h0m0s\",\"patrol-region-interval\":\"100ms\",\"max-store-down-time\":\"30m0s\",\"leader-schedule-limit\":4,\"region-schedule-limit\":4,\"replica-schedule-limit\":8,\"merge-schedule-limit\":8,\"hot-region-schedule-limit\":2,\"hot-region-cache-hits-threshold\":3,\"tolerant-size-ratio\":5,\"low-space-ratio\":0.8,\"high-space-ratio\":0.6,\"disable-raft-learner\":\"false\",\"disable-remove-down-replica\":\"false\",\"disable-replace-offline-replica\":\"false\",\"disable-make-up-replica\":\"false\",\"disable-remove-extra-replica\":\"false\",\"disable-location-replacement\":\"false\",\"disable-namespace-relocation\":\"false\",\"schedulers-v2\":[{\"type\":\"balance-region\",\"args\":null,\"disable\":false},{\"type\":\"balance-leader\",\"args\":null,\"disable\":false},{\"type\":\"hot-region\",\"args\":null,\"disable\":false},{\"type\":\"label\",\"args\":null,\"disable\":false}]},\"replication\":{\"max-replicas\":3,\"location-labels\":\"rack\"},\"namespace\":{},\"pd-server\":{\"use-region-storage\":\"false\"},\"cluster-version\":\"0.0.0\",\"quota-backend-bytes\":\"0 B\",\"auto-compaction-mode\":\"periodic\",\"auto-compaction-retention-v2\":\"1h\",\"TickInterval\":\"500ms\",\"ElectionInterval\":\"3s\",\"PreVote\":true,\"security\":{\"cacert-path\":\"\",\"cert-path\":\"\",\"key-path\":\"\"},\"label-property\":null,\"WarningMsgs\":null,\"namespace-classifier\":\"table\",\"LeaderPriorityCheckInterval\":\"1m0s\"}"]
[2019/07/27 11:09:38.958 +00:00] [INFO] [server.go:145] ["start embed etcd"]
[2019/07/27 11:09:38.958 +00:00] [INFO] [systime_mon.go:25] ["start system time monitor"]
[2019/07/27 11:09:38.958 +00:00] [INFO] [etcd.go:117] ["configuring peer listeners"] [listen-peer-urls="[http://0.0.0.0:2380]"]
[2019/07/27 11:09:38.959 +00:00] [INFO] [etcd.go:127] ["configuring client listeners"] [listen-client-urls="[http://0.0.0.0:2379]"]
[2019/07/27 11:09:38.959 +00:00] [INFO] [etcd.go:600] ["pprof is enabled"] [path=/debug/pprof]
[2019/07/27 11:09:38.959 +00:00] [INFO] [etcd.go:297] ["starting an etcd server"] [etcd-version=3.3.0+git] [git-sha="Not provided (use ./build instead of go build)"] [go-version=go1.12] [go-os=linux] [go-arch=amd64] [max-cpu-set=10] [max-cpu-available=10] [member-initialized=false] [name=cluster1-pd-3] [data-dir=/var/lib/pd] [wal-dir=] [wal-dir-dedicated=] [member-dir=/var/lib/pd/member] [force-new-cluster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [initial-advertise-peer-urls="[http://cluster1-pd-3.cluster1-pd-peer.ns1.svc:2380]"] [listen-peer-urls="[http://0.0.0.0:2380]"] [advertise-client-urls="[http://cluster1-pd-3.cluster1-pd-peer.ns1.svc:2379]"] [listen-client-urls="[http://0.0.0.0:2379]"] [listen-metrics-urls="[]"] [cors="[*]"] [host-whitelist="[*]"] [initial-cluster="cluster1-pd-0=http://cluster1-pd-0.cluster1-pd-peer.ns1.svc:2380,cluster1-pd-1=http://cluster1-pd-1.cluster1-pd-peer.ns1.svc:2380,cluster1-pd-2=http://cluster1-pd-2.cluster1-pd-peer.ns1.svc:2380"] [initial-cluster-state=existing] [initial-cluster-token=etcd-cluster] [quota-size-bytes=2147483648] [pre-vote=true] [initial-corrupt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [discovery-url=] [discovery-proxy=]
[2019/07/27 11:09:38.960 +00:00] [INFO] [backend.go:79] ["opened backend db"] [path=/var/lib/pd/member/snap/db] [took=775.074µs]
[2019/07/27 11:09:38.961 +00:00] [INFO] [etcd.go:358] ["closing etcd server"] [name=cluster1-pd-3] [data-dir=/var/lib/pd] [advertise-peer-urls="[http://cluster1-pd-3.cluster1-pd-peer.ns1.svc:2380]"] [advertise-client-urls="[http://cluster1-pd-3.cluster1-pd-peer.ns1.svc:2379]"]
[2019/07/27 11:09:38.961 +00:00] [INFO] [etcd.go:362] ["closed etcd server"] [name=cluster1-pd-3] [data-dir=/var/lib/pd] [advertise-peer-urls="[http://cluster1-pd-3.cluster1-pd-peer.ns1.svc:2380]"] [advertise-client-urls="[http://cluster1-pd-3.cluster1-pd-peer.ns1.svc:2379]"]
[2019/07/27 11:09:38.961 +00:00] [FATAL] [main.go:111] ["run server failed"] [error="couldn't find local name \"cluster1-pd-3\" in the initial cluster configuration",errorVerbose="couldn't find local name \"cluster1-pd-3\" in the initial cluster configuration\ngithub.com/pingcap/pd/server.(*Server).startEtcd\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/server/server.go:151\ngithub.com/pingcap/pd/server.(*Server).Run\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/server/server.go:302\nmain.main\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:110\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1337"] [stack="github.com/pingcap/log.Fatal\n\t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/log@v0.0.0-20190214045112-b37da76f67a7/global.go:59\nmain.main\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:111\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"]
@xiaojingchen
Copy link
Contributor Author

xiaojingchen commented Jul 27, 2019

  • the other PD's log:
[2019/07/27 11:19:11.640 +00:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=558f19585f5d3819] [rtt=0s] [error="dial tcp: lookup cluster1-pd-3.cluster1-pd-peer.ns1.svc on 10.233.0.10:53: no such host"]
[2019/07/27 11:19:11.640 +00:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=558f19585f5d3819] [rtt=0s] [error="dial tcp: lookup cluster1-pd-3.cluster1-pd-peer.ns1.svc on 10.233.0.10:53: no such host"]
[2019/07/27 11:19:16.640 +00:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=558f19585f5d3819] [rtt=0s] [error="dial tcp: lookup cluster1-pd-3.cluster1-pd-peer.ns1.svc on 10.233.0.10:53: no such host"]
[2019/07/27 11:19:16.640 +00:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=558f19585f5d3819] [rtt=0s] [error="dial tcp: lookup cluster1-pd-3.cluster1-pd-peer.ns1.svc on 10.233.0.10:53: no such host"]
[2019/07/27 11:19:20.742 +00:00] [WARN] [util.go:144] ["apply request took too long"] [took=374.323643ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/tidb/store/gcworker/saved_safe_point\" "] [response="range_response_count:0 size:5"] []
[2019/07/27 11:19:20.814 +00:00] [WARN] [util.go:144] ["apply request took too long"] [took=379.790856ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/tidb/store/gcworker/saved_safe_point\" "] [response="range_response_count:0 size:5"] []
[2019/07/27 11:19:21.641 +00:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=558f19585f5d3819] [rtt=0s] [error="dial tcp: lookup cluster1-pd-3.cluster1-pd-peer.ns1.svc on 10.233.0.10:53: no such host"]
[2019/07/27 11:19:21.641 +00:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=558f19585f5d3819] [rtt=0s] [error="dial tcp: lookup cluster1-pd-3.cluster1-pd-peer.ns1.svc on 10.233.0.10:53: no such host"]
[2019/07/27 11:19:26.641 +00:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=558f19585f5d3819] [rtt=0s] [error="dial tcp: lookup cluster1-pd-3.cluster1-pd-peer.ns1.svc on 10.233.0.10:53: no such host"]
[2019/07/27 11:19:26.641 +00:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=558f19585f5d3819] [rtt=0s] [error="dial tcp: lookup cluster1-pd-3.cluster1-pd-peer.ns1.svc on 10.233.0.10:53: no such host"]
[2019/07/27 11:19:31.641 +00:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=558f19585f5d3819] [rtt=0s] [error="dial tcp: lookup cluster1-pd-3.cluster1-pd-peer.ns1.svc on 10.233.0.10:53: no such host"]
[2019/07/27 11:19:31.641 +00:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=558f19585f5d3819] [rtt=0s] [error="dial tcp: lookup cluster1-pd-3.cluster1-pd-peer.ns1.svc on 10.233.0.10:53: no such host"]
  • pd members info:
{
  "header": {
    "cluster_id": 6718279638435828598
  },
  "members": [
    {
      "name": "cluster1-pd-2",
      "member_id": 1202411103240240618,
      "peer_urls": [
        "http://cluster1-pd-2.cluster1-pd-peer.ns1.svc:2380"
      ],
      "client_urls": [
        "http://cluster1-pd-2.cluster1-pd-peer.ns1.svc:2379"
      ]
    },
    {
      "name": "cluster1-pd-1",
      "member_id": 1783272702718335136,
      "peer_urls": [
        "http://cluster1-pd-1.cluster1-pd-peer.ns1.svc:2380"
      ],
      "client_urls": [
        "http://cluster1-pd-1.cluster1-pd-peer.ns1.svc:2379"
      ]
    },
    {
      "member_id": 6165174282241259545,
      "peer_urls": [
        "http://cluster1-pd-3.cluster1-pd-peer.ns1.svc:2380"
      ]
    },
    {
      "name": "cluster1-pd-0",
      "member_id": 7834020172005702291,
      "peer_urls": [
        "http://cluster1-pd-0.cluster1-pd-peer.ns1.svc:2380"
      ],
      "client_urls": [
        "http://cluster1-pd-0.cluster1-pd-peer.ns1.svc:2379"
      ]
    }
  ],
  "leader": {
    "name": "cluster1-pd-1",
    "member_id": 1783272702718335136,
    "peer_urls": [
      "http://cluster1-pd-1.cluster1-pd-peer.ns1.svc:2380"
    ],
    "client_urls": [
      "http://cluster1-pd-1.cluster1-pd-peer.ns1.svc:2379"
    ]
  },
  "etcd_leader": {
    "name": "cluster1-pd-1",
    "member_id": 1783272702718335136,
    "peer_urls": [
      "http://cluster1-pd-1.cluster1-pd-peer.ns1.svc:2380"
    ],
    "client_urls": [
      "http://cluster1-pd-1.cluster1-pd-peer.ns1.svc:2379"
    ]
  }
}

@nolouch
Copy link
Member

nolouch commented Jul 29, 2019

@xiaojingchen Please show me the join file of PD 3. and you can use the master branch to test, I think I have fixed it.

@xiaojingchen
Copy link
Contributor Author

xiaojingchen commented Jul 29, 2019

@nolouch
our pd version:

Release Version: v3.0.0-rc.1
Git Commit Hash: 67549be8b94e2465949de0a88ab07d0abb75abd0
Git Branch: HEAD
UTC Build Time:  2019-05-10 11:35:57

and what PR you merged to fix the bug?

the cluster have been cleaned, I found the new failed PD cluster,they are failed for the same reason.

cluster3-discovery-7c89bccf98-7nlxg   1/1     Running            0          3h48m
cluster3-monitor-d6c76f868-fg56l      2/2     Running            0          3h48m
cluster3-pd-0                         1/1     Running            0          3h48m
cluster3-pd-1                         1/1     Running            0          3h48m
cluster3-pd-2                         1/1     Running            0          3h48m
cluster3-pd-3                         1/1     Running            0          3h44m
cluster3-pd-4                         0/1     CrashLoopBackOff   47         3h43m
cluster3-tidb-0                       2/2     Running            0          3h47m
cluster3-tidb-1                       2/2     Running            0          3h47m
cluster3-tidb-2                       2/2     Running            0          3h44m
cluster3-tidb-initializer-d6zpg       0/1     Completed          4          3h48m
cluster3-tikv-0                       1/1     Running            0          3h48m
cluster3-tikv-1                       1/1     Running            0          3h48m
cluster3-tikv-2                       1/1     Running            0          3h48m
cluster3-tikv-3                       1/1     Running            0          3h44m
cluster3-tikv-4                       1/1     Running            0          3h43m

the cluster3-pd-4 join file:

cluster3-pd-0=http://cluster3-pd-0.cluster3-pd-peer.ns2.svc:2380,cluster3-pd-1=http://cluster3-pd-1.cluster3-pd-peer.ns2.svc:2380,cluster3-pd-3=http://cluster3-pd-3.cluster3-pd-peer.ns2.svc:2380,cluster3-pd-2=http://cluster3-pd-2.cluster3-pd-peer.ns2.svc:2380

@nolouch
Copy link
Member

nolouch commented Aug 5, 2019

tikv/pd#1643

@tennix tennix added the type/question Further information is requested label Oct 11, 2019
@tennix
Copy link
Member

tennix commented Oct 11, 2019

@xiaojingchen Is this fixed in the new version of PD?

@aylei
Copy link
Contributor

aylei commented Nov 15, 2019

closed via tikv/pd#1643 and tikv/pd#1663

@aylei aylei closed this as completed Nov 15, 2019
yahonda pushed a commit that referenced this issue Dec 27, 2021
* add config notes for tiflash

* Apply suggestions from code review

Co-authored-by: Ran <huangran@pingcap.com>

* Update zh/deploy-tiflash.md

Co-authored-by: Ran <huangran@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants