Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using pxe boot, the etcd always "heartbeat near election timeout" #963

Closed
kernel8liang opened this issue Aug 27, 2014 · 7 comments
Closed

Comments

@kernel8liang
Copy link

I setup PXE server in a subnet, use it to boot coreOS. Since "$public_ipv4" dosen't support in PXE,
so I use could-config to start up etcd. My cloud-config file is as below:

cloud-config

coreos:
units:
- name: etcd.service
command: stop
- name: config-etcd.service
command: start
content: |
[Unit]
Description=Config Etcd
After=etcd.service

     [Service]
     ExecStartPre=-/usr/bin/systemctl stop etcd
     ExecStartPre=/usr/bin/wget http://192.168.1.2:8585/startEtcd.sh -P /home/core
     ExecStartPre=/usr/bin/chmod 755 /home/core/startEtcd.sh
     ExecStart=/home/core/startEtcd.sh
     Restart=always
     RestartSec=10s


     [Install]
     WantedBy=multi-user.target
 - name: fleet.service
   command: start

ssh_authorized_keys:

  • ssh-rsa AAAAA....

in could-config I download a script, use it to get IP address and start etcd, the script is as below.

#!/bin/bash

export publicIP=ifconfig eno1 | sed -n 2p | awk '{ print $2 }'

systemctl stop etcd

etcd -name $publicIP -peer-addr $publicIP:7001 -addr 127.0.0.1:4001 -discovery http://192.168.1.2:4001/v2/keys/0000040 -peer- election-timeout=5000 -peer-heartbeat-interval=5000 snapshot=true -v

after the coreOS is boot up, the script is downloaded, and it work as expecting. But the etcd always
get timeout. I have two nodes, the follower always get timeout, I mean no matter which is become
follower, and the leader is work fine.

core@localhost ~ $ journalctl -fu config-etcd.service
-- Logs begin at Wed 2014-08-27 08:19:35 UTC. --
Aug 27 08:20:02 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:02.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999946418s
Aug 27 08:20:07 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:07.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999720814s
Aug 27 08:20:12 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:12.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999857421s
Aug 27 08:20:17 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:17.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999935035s
Aug 27 08:20:22 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:22.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 5.000014421s
Aug 27 08:20:27 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:27.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999861909s
Aug 27 08:20:32 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:32.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999855009s
Aug 27 08:20:37 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:37.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999945439s
Aug 27 08:20:42 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:42.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999893319s
Aug 27 08:20:47 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:47.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.99987582s
Aug 27 08:20:52 localhost startEtcd.sh[597]: [etcd] Aug 27 08:20:52.753 INFO | 192.168.1.5: warning: heartbeat near election timeout: 4.999934057s

if I set something on the follower node, will occur error, see below,
core@localhost ~ $ etcdctl set /message test
Error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

I set something on the leader, everything is working fine, the follower can also get the the message
using etcdctl.

I tried all version list in page https://coreos.com/docs/running-coreos/bare-metal/booting-with-pxe/,
stable, beta, alpha, all have the same issue. But no matter what version I use to boot, when login
it always show CoreOS (beta).

I also tuned the parameters with etcd "election-timeout=5000 -peer-heartbeat-interval=5000", but
the issue still there.

googled, there are some similar issue like #868, #594, #915, I am not sure if these bug is the same
with mine.

I also run 5 coreOS nodes cluser with vagrant and virtualbox, everything works fine. But when longin
it show CoreOS(alpha).

Did I do anything wrong ? And how I make it work ?

@kernel8liang
Copy link
Author

any suggestion ?

@yichengq
Copy link
Contributor

May you check the machine list? curl http://127.0.0.1:7001/v2/admin/machines

@kernel8liang
Copy link
Author

check from both machines:

core@localhost ~ $ curl http://127.0.0.1:7001/v2/admin/machines
[{"name":"4","state":"leader","clientURL":"http://127.0.0.1:4001","peerURL":"http://192.168.1.4:7001"},{"name":"5","state":"follower","clientURL":"http://127.0.0.1:4001","peerURL":"http://192.168.1.5:7001"}]
core@localhost ~ $ curl http://127.0.0.1:7001/v2/admin/machines
[{"name":"4","state":"leader","clientURL":"http://127.0.0.1:4001","peerURL":"http://192.168.1.4:7001"},{"name":"5","state":"follower","clientURL":"http://127.0.0.1:4001","peerURL":"http://192.168.1.5:7001"}]

I can set and get on leader, only can get on follwer.

@kernel8liang
Copy link
Author

yesterday, I use -name whit IP address, like -name 192.168.1.4. I change the it as -name 4, Now the
time out disappear.

but when I set some message on follower, still get error.
core@localhost ~ $ etcdctl set /test1 test
Error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

@kernel8liang
Copy link
Author

and log on follower.

core@localhost ~ $ journalctl -fu config-etcd.service
-- Logs begin at Thu 2014-08-28 08:12:55 UTC. --
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.360 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.410 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [1]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.460 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.510 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.560 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.610 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.660 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.710 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]
Aug 28 08:15:48 localhost startEtcd.sh[595]: [etcd] Aug 28 08:15:48.760 DEBUG | [recv] POST http://192.168.1.5:7001/log/append [0]

log on leader,

core@localhost ~ $ journalctl -fu config-etcd.service
-- Logs begin at Thu 2014-08-28 08:12:49 UTC. --
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.286 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.336 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.386 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.436 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.486 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.536 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.586 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.636 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.686 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.710 DEBUG | URLs: /_etcd/machines: / (4,5)
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.736 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.786 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.836 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.886 DEBUG | Send LogEntries to http://192.168.1.5:7001
Aug 28 08:41:03 localhost startEtcd.sh[592]: [etcd] Aug 28 08:41:03.936 DEBUG | Send LogEntries to http://192.168.1.5:7001

@yichengq
Copy link
Contributor

etcdctl uses clientURL to connect to each machine.
Could you set -addr to 192.168.1.X also?

@kernel8liang
Copy link
Author

it works correctly, thanks!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants