Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds node ID integrity checking to the catalog and the LAN and WAN clusters. #2832

Merged
merged 3 commits into from
Mar 27, 2017

Conversation

slackpad
Copy link
Contributor

Since node ID integrity is critical for the 0.8 Raft features, we need to ensure that node IDs are unique. We also add integrity checking for the catalog so we can start relying on node IDs there in future versions of Consul.

@slackpad
Copy link
Contributor Author

Working through some test fallout from this, not quite ready for merge.

}
if existing != nil {
n = existing.(*structs.Node)
fmt.Printf("XXX %#v\n", *n)
Copy link
Contributor

@sean- sean- Mar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^ ?

Also, q.Q() > fmt.Printf() 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol good catch!

Ended up removing the leader_test.go server address change test as part
of this. The join was failing becase we were using a new node name with
the new logic here, but realized this was hitting some of the memberlist
conflict logic and not working as we expected. We need some additional
work to fully support address changes, so removed the test for now.
@slackpad slackpad merged commit 7360928 into master Mar 27, 2017
@slackpad slackpad deleted the node-id-integrity branch March 27, 2017 16:01
@micahlmartin
Copy link

I'm working on upgrading to 0.8.0 and I'm testing by spinning up 3 centos docker containers running docker-in-docker (on Docker Mac), and then each launching the consul container. What's odd is that all 3 nodes appear to come up with the same Node ID which is causing them all to throw errors and fail to join the cluster. Here is my config:

{
    "advertise_addr": "172.18.0.4",
    "client_addr": "0.0.0.0",
    "datacenter": "test",
    "disable_update_check": true,
    "dns_config": {
        "allow_stale": true,
        "max_stale": "10s",
        "service_ttl": {
            "*": "10s"
        }
    },
    "leave_on_terminate": true,
    "log_level": "info",
    "node_name": "0de2007c2117",
    "performance": {
        "raft_multiplier": 1
    },
    "ports": {
        "dns": 8600
    },
    "recursors": [
        "172.18.0.4"
    ],
    "rejoin_after_leave": true,
    "retry_join": [
        "consul-ha-1",
        "consul-ha-2",
        "consul-ha-3"
    ],
    "server": true,
    "telemetry": {
        "statsd_address": "172.18.0.4:8125"
    },
    "ui": true
}

Here are some of the logs showing the errors:

Apr 06 02:17:59 072950a258a0 docker-compose[2697]: consul    | ==> Log data will now stream in as it occurs:
Apr 06 02:17:59 072950a258a0 docker-compose[2697]: consul    |
Apr 06 02:17:59 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:17:59 [INFO] raft: Initial configuration (index=0): []
Apr 06 02:17:59 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:17:59 [INFO] raft: Node at 172.18.0.2:8300 [Follower] entering Follower state (Leader: "")
Apr 06 02:17:59 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:17:59 [INFO] serf: EventMemberJoin: 072950a258a0 172.18.0.2
Apr 06 02:17:59 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:17:59 [INFO] consul: Adding LAN server 072950a258a0 (Addr: tcp/172.18.0.2:8300) (DC: test)
Apr 06 02:17:59 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:17:59 [INFO] serf: EventMemberJoin: 072950a258a0.test 172.18.0.2
Apr 06 02:17:59 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:17:59 [INFO] consul: Handled member-join event for server "072950a258a0.test" in area "wan"
Apr 06 02:17:59 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:17:59 [INFO] agent: Joining cluster...
Apr 06 02:17:59 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:17:59 [INFO] agent: (LAN) joining: [consul-ha-1 consul-ha-2 consul-ha-3 consul-ha-1 consul-ha-2 consul-ha-3]
Apr 06 02:17:59 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:17:59 [INFO] agent: (LAN) joined: 2 Err: <nil>
Apr 06 02:17:59 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:17:59 [INFO] agent: Join completed. Synced with 2 initial agents
Apr 06 02:18:01 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:18:01 [WARN] raft: no known peers, aborting election
Apr 06 02:18:06 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:18:06 [ERR] agent: failed to sync remote state: No cluster leader
Apr 06 02:18:29 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:18:29 [ERR] agent: coordinate update error: No cluster leader
Apr 06 02:18:39 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:18:39 [ERR] memberlist: Failed push/pull merge: Member '0de2007c2117' has conflicting node ID '86bf24da-204d-4cf4-b530-e2d40ccb51e9' with this agent's ID from=172.18.0.4:42234
Apr 06 02:18:39 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:18:39 [ERR] memberlist: Failed push/pull merge: Member '0de2007c2117' has conflicting node ID '86bf24da-204d-4cf4-b530-e2d40ccb51e9' with this agent's ID from=172.18.0.4:42240
Apr 06 02:18:42 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:18:42 [ERR] agent: failed to sync remote state: No cluster leader
Apr 06 02:18:59 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:18:59 [ERR] agent: coordinate update error: No cluster leader
Apr 06 02:19:16 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:19:16 [ERR] agent: failed to sync remote state: No cluster leader
Apr 06 02:19:33 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:19:33 [ERR] agent: coordinate update error: No cluster leader
Apr 06 02:19:50 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:19:50 [ERR] agent: failed to sync remote state: No cluster leader
Apr 06 02:20:01 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:20:01 [ERR] agent: coordinate update error: No cluster leader
Apr 06 02:20:17 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:20:17 [ERR] agent: failed to sync remote state: No cluster leader
Apr 06 02:20:26 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:20:26 [ERR] agent: coordinate update error: No cluster leader
Apr 06 02:20:45 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:20:45 [ERR] agent: failed to sync remote state: No cluster leader
Apr 06 02:21:01 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:21:01 [ERR] agent: coordinate update error: No cluster leader
Apr 06 02:21:15 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:21:15 [ERR] agent: failed to sync remote state: No cluster leader
Apr 06 02:21:35 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:21:35 [ERR] agent: coordinate update error: No cluster leader
Apr 06 02:21:39 072950a258a0 docker-compose[2697]: consul    |     2017/04/06 02:21:39 [ERR] agent: failed to sync remote state: No cluster leader

I'm guessing it's something to do with this commit but I could be wrong. Any help would be appreciated.

@slackpad
Copy link
Contributor Author

slackpad commented Apr 6, 2017

Hi @micahlmartin please take a look at the discussion on #2877. You can set the node ID to something unique by adding something like -node-id=$(uuidgen | awk '{print tolower($0)}') to your command line.

@micahlmartin
Copy link

@slackpad That fixed it. Thanks for the quick response.

@dictvm
Copy link

dictvm commented Apr 12, 2017

@slackpad that's a neat idea, but it doesn't work with Kubernetes pods because there seems to be no support for command substitution. I'd like to run Consul in Kubernetes but I can't upgrade to 0.8.0 right now, because of this issue. Do you have any suggestion how to ensure Consul gets a new node-id each time the container restarts?

Of course I could just build my own consul docker image where I first fill an env variable with a uuid but I was hoping to just use the vanilla images.

I'd be happy to test a few things, if you have any ideas. 👍

@slackpad
Copy link
Contributor Author

@dictvm please take a look at #2877 (comment), that's probably the easiest solution for folks running containers. Do you think that would fix your particular issue with k8s?

@dictvm
Copy link

dictvm commented Apr 12, 2017

@slackpad I'm pretty sure it would, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants