Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

advertise_addr not allowing cluster communication #2120

Closed
panophobicPanda opened this issue Jun 16, 2016 · 5 comments
Closed

advertise_addr not allowing cluster communication #2120

panophobicPanda opened this issue Jun 16, 2016 · 5 comments
Labels
type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp

Comments

@panophobicPanda
Copy link

consul version for both Client and Server

Client: [v0.6.4]
Server: [v0.6.4]

consul info for both Client and Server

Client:

no client

Server (node C):

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 3
build:
        prerelease =
        revision = 26a0ef8c
        version = 0.6.4
consul:
        bootstrap = false
        known_datacenters = 1
        leader = false
        server = true
raft:
        applied_index = 0
        commit_index = 0
        fsm_pending = 0
        last_contact = never
        last_log_index = 0
        last_log_term = 0
        last_snapshot_index = 0
        last_snapshot_term = 0
        num_peers = 0
        state = Follower
        term = 0
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 50
        max_procs = 2
        os = linux
        version = go1.6
serf_lan:
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        intent_queue = 0
        left = 0
        member_time = 3
        members = 3
        query_queue = 0
        query_time = 1
serf_wan:
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        intent_queue = 0
        left = 0
        member_time = 1
        members = 1
        query_queue = 0
        query_time = 1

Operating system and Environment details

centOS 7, docker 1.11.2

Description of the Issue (and unexpected/desired result)

I'm running a set of 3 consul servers in a cluster with custom ports (10.0.4.63 is different on each docker host, it matches eth0 ):

{
        "data_dir": "/data",
        "client_addr": "0.0.0.0",
        "disable_update_check": true,
        "advertise_addrs": {
                "serf_lan": "10.0.4.63:11301",
                "serf_wan": "10.0.4.63:11302",
                "rpc": "10.0.4.63:11300"
        }
}

The systems start up with default ports (Cluster Addr is the IP of the docker container):

       Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 8600, RPC: 8400)
      Cluster Addr: 172.17.0.3 (LAN: 8301, WAN: 8302)

Those ports are mapped in docker:

            "PortBindings": {
                "8300/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "11300"
                    }
                ],
                "8301/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "11301"
                    }
                ],
                "8301/udp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "11301"
                    }
                ],
                "8302/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "11302"
                    }
                ],
                "8302/udp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "11302"
                    }
                ],
                "8400/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "11400"
                    }
                ],
                "8500/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "11500"
                    }
                ],
                "8600/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "11600"
                    }
                ],
                "8600/udp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "11600"
                    }
                ]
            },

Started node A with this config:

agent -server -bootstrap-expect 3 -dc us-2b -recursor 10.0.4.2 -config-dir /config

Started nodes B and C with this config ( 10.0.4.63 is node A ):

agent -server -dc us-2b -recursor 10.0.4.2 -config-dir /config -rejoin -retry-join 10.0.4.63:11301

I see the following errors on A:

    2016/06/16 21:07:22 [INFO] serf: EventMemberJoin: 0fefc65397fd 10.0.4.63
    2016/06/16 21:07:22 [INFO] serf: EventMemberJoin: 0fefc65397fd.us-2b 10.0.4.63
    2016/06/16 21:07:22 [INFO] raft: Node at 10.0.4.63:11300 [Follower] entering Follower state
    2016/06/16 21:07:22 [INFO] consul: adding LAN server 0fefc65397fd (Addr: 10.0.4.63:8300) (DC: us-2b)
    2016/06/16 21:07:22 [INFO] consul: adding WAN server 0fefc65397fd.us-2b (Addr: 10.0.4.63:8300) (DC: us-2b)
    2016/06/16 21:07:22 [ERR] agent: failed to sync remote state: No cluster leader
    2016/06/16 21:07:23 [WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.
    2016/06/16 21:07:35 [INFO] serf: EventMemberJoin: ad61a3e569a6 10.0.5.12
    2016/06/16 21:07:35 [INFO] consul: adding LAN server ad61a3e569a6 (Addr: 10.0.5.12:8300) (DC: us-2b)
    2016/06/16 21:07:39 [ERR] agent: failed to sync remote state: No cluster leader
    2016/06/16 21:07:43 [ERR] agent: coordinate update error: No cluster leader
    2016/06/16 21:07:49 [INFO] serf: EventMemberJoin: 4b2ed1a6d767 10.0.5.231
    2016/06/16 21:07:49 [INFO] consul: adding LAN server 4b2ed1a6d767 (Addr: 10.0.5.231:8300) (DC: us-2b)
    2016/06/16 21:07:49 [INFO] consul: Attempting bootstrap with nodes: [10.0.4.63:8300 10.0.5.12:8300 10.0.5.231:8300]
    2016/06/16 21:07:50 [WARN] raft: Heartbeat timeout reached, starting election
    2016/06/16 21:07:50 [INFO] raft: Node at 10.0.4.63:11300 [Candidate] entering Candidate state
    2016/06/16 21:07:50 [ERR] raft: Failed to make RequestVote RPC to 10.0.4.63:8300: dial tcp 10.0.4.63:8300: getsockopt: connection refused
    2016/06/16 21:07:50 [ERR] raft: Failed to make RequestVote RPC to 10.0.5.231:8300: dial tcp 10.0.5.231:8300: getsockopt: connection refused
    2016/06/16 21:07:50 [ERR] raft: Failed to make RequestVote RPC to 10.0.5.12:8300: dial tcp 10.0.5.12:8300: getsockopt: connection refused
    2016/06/16 21:07:52 [WARN] raft: Election timeout reached, restarting election
    2016/06/16 21:07:52 [INFO] raft: Node at 10.0.4.63:11300 [Candidate] entering Candidate state
    2016/06/16 21:07:52 [ERR] raft: Failed to make RequestVote RPC to 10.0.4.63:8300: dial tcp 10.0.4.63:8300: getsockopt: connection refused
    2016/06/16 21:07:52 [ERR] raft: Failed to make RequestVote RPC to 10.0.5.12:8300: dial tcp 10.0.5.12:8300: getsockopt: connection refused
    2016/06/16 21:07:52 [ERR] raft: Failed to make RequestVote RPC to 10.0.5.231:8300: dial tcp 10.0.5.231:8300: getsockopt: connection refused
    2016/06/16 21:07:53 [WARN] raft: Election timeout reached, restarting election

I see the following errors on nodes B and C:

    2016/06/16 21:07:49 [INFO] raft: Node at 10.0.5.231:11300 [Follower] entering Follower state
    2016/06/16 21:07:49 [INFO] serf: EventMemberJoin: 4b2ed1a6d767 10.0.5.231
    2016/06/16 21:07:49 [INFO] serf: EventMemberJoin: 4b2ed1a6d767.us-2b 10.0.5.231
    2016/06/16 21:07:49 [INFO] consul: adding LAN server 4b2ed1a6d767 (Addr: 10.0.5.231:8300) (DC: us-2b)
    2016/06/16 21:07:49 [INFO] consul: adding WAN server 4b2ed1a6d767.us-2b (Addr: 10.0.5.231:8300) (DC: us-2b)
    2016/06/16 21:07:49 [ERR] agent: failed to sync remote state: No cluster leader
    2016/06/16 21:07:49 [INFO] agent: Joining cluster...
    2016/06/16 21:07:49 [INFO] agent: (LAN) joining: [10.0.4.63:11301]
    2016/06/16 21:07:49 [INFO] serf: EventMemberJoin: ad61a3e569a6 10.0.5.12
    2016/06/16 21:07:49 [INFO] serf: EventMemberJoin: 0fefc65397fd 10.0.4.63
    2016/06/16 21:07:49 [INFO] agent: (LAN) joined: 1 Err: <nil>
    2016/06/16 21:07:49 [INFO] agent: Join completed. Synced with 1 initial agents
    2016/06/16 21:07:49 [INFO] consul: adding LAN server ad61a3e569a6 (Addr: 10.0.5.12:8300) (DC: us-2b)
    2016/06/16 21:07:49 [INFO] consul: adding LAN server 0fefc65397fd (Addr: 10.0.4.63:8300) (DC: us-2b)
    2016/06/16 21:07:51 [WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.
    2016/06/16 21:08:13 [ERR] agent: failed to sync remote state: No cluster leader
    2016/06/16 21:08:18 [ERR] agent: coordinate update error: No cluster leader
    2016/06/16 21:08:33 [ERR] agent: failed to sync remote state: No cluster leader
    2016/06/16 21:08:35 [ERR] agent: coordinate update error: No cluster leader

They appear to be aware of eachother, at least in a vague sense that they tried to join but probably no communication afterwards? :

# docker exec -it consul-us-2b consul members
Node          Address           Status  Type    Build  Protocol  DC
0fefc65397fd  10.0.4.63:11301   alive   server  0.6.4  2         us-2b
4b2ed1a6d767  10.0.5.231:11301  alive   server  0.6.4  2         us-2b
ad61a3e569a6  10.0.5.12:11301   alive   server  0.6.4  2         us-2b

Reproduction steps

add advertise_addr to config file for all nodes
start A with bootstrap
start B and C, with -join A:<custom_port>

Thanks much for the help!

@slackpad
Copy link
Contributor

Hi @akabdog it looks like there might be a mismatch between your configured advertise addresses and what Consul is binding to., which is this one:

Cluster Addr: 172.17.0.3 (LAN: 8301, WAN: 8302)

Setting https://www.consul.io/docs/agent/options.html#_bind should help here.

@slackpad slackpad added the type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp label Aug 11, 2016
@gvenka008c
Copy link

gvenka008c commented Oct 21, 2016

@slackpad I am seeing the same error on all my 3 nodes

s consul: 2016/10/21 14:06:36 [ERR] agent: failed to sync remote state: No cluster leader
Oct 21 14:06:36 consul[3795]: agent: failed to sync remote state: No cluster leader

Here is the config for bootstrap node

{
    "server": true,
    "bootstrap_expect": 3,
    "datacenter": "dc1",
    "data_dir": "/srv/consul",
    "client_addr": "0.0.0.0",
    "encrypt": "encryptkey",
    "bind_addr": "xx.xxx.xx.xx",    [ip address of the VM]
    "advertise_addr": "xx.xxx.xx.xx",  [ip address of the VM]
    "log_level": "INFO",
    "enable_syslog": true,
    "ui": true,
    "ui_dir": "/home/consul/dest",
    "start_join": ["xx.xxx.xx.xx","xx.xxx.xx.xx"]
}

Here is the config for 2 other nodes

{
    "server": true,
    "datacenter": "dc1",
    "data_dir": "/srv/consul",
    "client_addr": "0.0.0.0",
    "encrypt": "encryptkey",
    "bind_addr": "xx.xxx.xx.xx",    [ip address of the VM]
    "advertise_addr": "xx.xxx.xx.xx",  [ip address of the VM]
    "log_level": "INFO",
    "enable_syslog": true,
    "ui": true,
    "ui_dir": "/home/consul/dest",
    "start_join": ["xx.xxx.xx.xx","xx.xxx.xx.xx"]
}

Here is the consul version

# consul --version
Consul v0.7.0
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Thoughts?

@sean-
Copy link
Contributor

sean- commented Mar 6, 2017

@gvenka008c can you try a build from master and remove the -client argument and try again? #2786 probably fixes your issue. Let us know if it does. You can also remove the advertise_addr argument, too. Lastly, try setting, -bind={{GetPrivateIP}} and see if that works. If it does, can you let us know?

@slackpad
Copy link
Contributor

slackpad commented May 3, 2017

Closing this as we never heard back. Please let us know if you still need help.

@slackpad slackpad closed this as completed May 3, 2017
@shahamit2
Copy link

shahamit2 commented May 22, 2018

@sean- Thanks "bind" helped me with latest version 1.1.0. You saved my day.

e.g. with "bind" for someone coming to this page searching.
consul agent -dev -config-dir=/etc/consul.d -data-dir=/tmp/consul -advertise="172.28.128.1" -bind="172.28.128.1"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp
Projects
None yet
Development

No branches or pull requests

5 participants