using pxe boot, the etcd always "heartbeat near election timeout" #963

kernel8liang · 2014-08-27T08:51:19Z

I setup PXE server in a subnet, use it to boot coreOS. Since "$public_ipv4" dosen't support in PXE,
so I use could-config to start up etcd. My cloud-config file is as below:

cloud-config

coreos:
units:
- name: etcd.service
command: stop
- name: config-etcd.service
command: start
content: |
[Unit]
Description=Config Etcd
After=etcd.service

     [Service]
     ExecStartPre=-/usr/bin/systemctl stop etcd
     ExecStartPre=/usr/bin/wget http://192.168.1.2:8585/startEtcd.sh -P /home/core
     ExecStartPre=/usr/bin/chmod 755 /home/core/startEtcd.sh
     ExecStart=/home/core/startEtcd.sh
     Restart=always
     RestartSec=10s


     [Install]
     WantedBy=multi-user.target
 - name: fleet.service
   command: start

ssh_authorized_keys:

ssh-rsa AAAAA....

in could-config I download a script, use it to get IP address and start etcd, the script is as below.

#!/bin/bash

export publicIP=ifconfig eno1 | sed -n 2p | awk '{ print $2 }'

systemctl stop etcd

etcd -name $publicIP -peer-addr $publicIP:7001 -addr 127.0.0.1:4001 -discovery http://192.168.1.2:4001/v2/keys/0000040 -peer- election-timeout=5000 -peer-heartbeat-interval=5000 snapshot=true -v

after the coreOS is boot up, the script is downloaded, and it work as expecting. But the etcd always
get timeout. I have two nodes, the follower always get timeout, I mean no matter which is become
follower, and the leader is work fine.

core@localhost ~ $ journalctl -fu config-etcd.service
-- Logs begin at Wed 2014-08-27 08:19:35 UTC. --
Aug 27 08:20:02 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:02.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999946418s
Aug 27 08:20:07 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:07.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999720814s
Aug 27 08:20:12 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:12.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999857421s
Aug 27 08:20:17 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:17.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999935035s
Aug 27 08:20:22 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:22.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 5.000014421s
Aug 27 08:20:27 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:27.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999861909s
Aug 27 08:20:32 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:32.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999855009s
Aug 27 08:20:37 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:37.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999945439s
Aug 27 08:20:42 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:42.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999893319s
Aug 27 08:20:47 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:47.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.99987582s
Aug 27 08:20:52 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:52.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999934057s

if I set something on the follower node, will occur error, see below,
core@localhost ~ $ etcdctl set /message test
Error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

I set something on the leader, everything is working fine, the follower can also get the the message
using etcdctl.

I tried all version list in page https://coreos.com/docs/running-coreos/bare-metal/booting-with-pxe/,
stable, beta, alpha, all have the same issue. But no matter what version I use to boot, when login
it always show CoreOS (beta).

I also tuned the parameters with etcd "election-timeout=5000 -peer-heartbeat-interval=5000", but
the issue still there.

googled, there are some similar issue like #868, #594, #915, I am not sure if these bug is the same
with mine.

I also run 5 coreOS nodes cluser with vagrant and virtualbox, everything works fine. But when longin
it show CoreOS(alpha).

Did I do anything wrong ? And how I make it work ?

The text was updated successfully, but these errors were encountered:

kernel8liang · 2014-08-28T06:28:09Z

any suggestion ？

yichengq · 2014-08-28T07:37:15Z

May you check the machine list? curl http://127.0.0.1:7001/v2/admin/machines

kernel8liang · 2014-08-28T08:18:29Z

check from both machines:

core@localhost ~ $ curl http://127.0.0.1:7001/v2/admin/machines
[{"name":"4","state":"leader","clientURL":"http://127.0.0.1:4001","peerURL":"http://192.168.1.4:7001"},{"name":"5","state":"follower","clientURL":"http://127.0.0.1:4001","peerURL":"http://192.168.1.5:7001"}]
core@localhost ~ $ curl http://127.0.0.1:7001/v2/admin/machines
[{"name":"4","state":"leader","clientURL":"http://127.0.0.1:4001","peerURL":"http://192.168.1.4:7001"},{"name":"5","state":"follower","clientURL":"http://127.0.0.1:4001","peerURL":"http://192.168.1.5:7001"}]

I can set and get on leader, only can get on follwer.

kernel8liang · 2014-08-28T08:37:52Z

yesterday, I use -name whit IP address, like -name 192.168.1.4. I change the it as -name 4, Now the
time out disappear.

but when I set some message on follower, still get error.
core@localhost ~ $ etcdctl set /test1 test
Error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

kernel8liang · 2014-08-28T08:41:20Z

and log on follower.

core@localhost ~ $ journalctl -fu config-etcd.service
-- Logs begin at Thu 2014-08-28 08:12:55 UTC. --
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.360 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.410 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [1]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.460 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.510 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.560 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.610 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.660 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.710 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.760 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]

log on leader,

core@localhost ~ $ journalctl -fu config-etcd.service
-- Logs begin at Thu 2014-08-28 08:12:49 UTC. --
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.286 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.336 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.386 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.436 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.486 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.536 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.586 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.636 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.686 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.710 DEBUG | URLs: /_etcd/machines: / (4,5)
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.736 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.786 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.836 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.886 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.936 DEBUG | Send LogEntries to http://192.168.1.5:7001

yichengq · 2014-08-28T17:38:01Z

etcdctl uses clientURL to connect to each machine.
Could you set -addr to 192.168.1.X also?

kernel8liang · 2014-08-29T02:20:52Z

it works correctly, thanks!!!

yichengq added the question label Aug 28, 2014

kernel8liang closed this as completed Aug 29, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using pxe boot, the etcd always "heartbeat near election timeout" #963

using pxe boot, the etcd always "heartbeat near election timeout" #963

kernel8liang commented Aug 27, 2014

kernel8liang commented Aug 28, 2014

yichengq commented Aug 28, 2014

kernel8liang commented Aug 28, 2014

kernel8liang commented Aug 28, 2014

kernel8liang commented Aug 28, 2014

yichengq commented Aug 28, 2014

kernel8liang commented Aug 29, 2014

using pxe boot, the etcd always "heartbeat near election timeout" #963

using pxe boot, the etcd always "heartbeat near election timeout" #963

Comments

kernel8liang commented Aug 27, 2014

cloud-config

kernel8liang commented Aug 28, 2014

yichengq commented Aug 28, 2014

kernel8liang commented Aug 28, 2014

kernel8liang commented Aug 28, 2014

kernel8liang commented Aug 28, 2014

yichengq commented Aug 28, 2014

kernel8liang commented Aug 29, 2014