Spurious failures of cross-DC requests #5595

nahratzah · 2019-04-02T13:18:05Z

Overview of the Issue

Cross-datacenter requests sometimes fail, due to consul selecting a decommissioned server as its target.

We suspect this leads to failed KV requests, as well as DNS being super slow.

Reproduction Steps

1: create a set of consul clusters

Using the docker consul:1.2.3 image, create 4 consul clusters each with 5 servers.

docker-compose up -d --scale consul_nodes_dc1=4 --scale consul_nodes_dc2=4 --scale consul_nodes_dc3=4 --scale consul_nodes_dc4=4 --scale consul_nodes_dc5=4

*docker-compose.yml*: ``` version: "2" services: consul_dc1: image: consul:1.2.3 command: "agent -ui -server -client 0.0.0.0 -datacenter=dc1 -bootstrap-expect=5 -join consul_dc1" consul_nodes_dc1: image: consul:1.2.3 command: "agent -server -bootstrap-expect=5 -datacenter=dc1 -join consul_dc1" depends_on: - consul_dc1 consul_dc2: image: consul:1.2.3 command: "agent -server -bootstrap-expect=5 -datacenter=dc2 -join consul_dc2" consul_nodes_dc2: image: consul:1.2.3 command: "agent -server -bootstrap-expect=5 -datacenter=dc2 -join consul_dc2" depends_on: - consul_dc2 consul_dc3: image: consul:1.2.3 command: "agent -server -bootstrap-expect=5 -datacenter=dc3 -join consul_dc3" consul_nodes_dc3: image: consul:1.2.3 command: "agent -server -bootstrap-expect=5 -datacenter=dc3 -join consul_dc3" depends_on: - consul_dc3 consul_dc4: image: consul:1.2.3 command: "agent -server -bootstrap-expect=5 -datacenter=dc4 -join consul_dc4" consul_nodes_dc4: image: consul:1.2.3 command: "agent -server -bootstrap-expect=5 -datacenter=dc4 -join consul_dc4" depends_on: - consul_dc4 tribute: image: tribute ```

2: WAN join all these clusters

docker exec docker_consul_dc2_1 /bin/consul join -wan consul_dc1
docker exec docker_consul_dc3_1 /bin/consul join -wan consul_dc1
docker exec docker_consul_dc4_1 /bin/consul join -wan consul_dc1

3: create a KV in each cluster, for use later

for dc in dc1 dc2 dc3 dc4; do docker exec docker_consul_${dc}_1 /bin/consul kv put foobar "kv in ${dc}"; done

4: cycle through the nodes, creating lots and lots of failed nodes in the process

for x in {1,2,3,4,5,6}{1,2,3,4,5,6,7,8,9,0} ; do ./cycle.sh ; done

Note that we leave dc1 untouched, only affecting dc2, dc3, and dc4.

cycle.sh contents:

#!/bin/bash

for dc in dc2 dc3 dc4; do
  let "tribute_scale_count = 1 + $(docker-compose ps | grep -c tribute)"
  docker-compose scale consul_nodes_${dc}=3
  sleep 10
  docker-compose scale tribute=${tribute_scale_count}
  sleep 10
  docker-compose scale consul_nodes_${dc}=4
  sleep 10
done

5: Leave to simmer for 2+ days, while repeatedly querying the clusters

We collect the query log, because disk space is cheap.
(Use tmux so you can detach.)

while sleep 5; do ./query.sh ; done 2>&1 | tee query.log

query.sh contents: #!/bin/bash

for origin_dc in dc1 dc2 dc3 dc4 ; do
  for target_dc in dc1 dc2 dc3 dc4 dc5; do
    docker exec docker_consul_${origin_dc}_1 /bin/consul kv get -datacenter=${target_dc} foobar
  done
done

6: Observe spurious failure

docker-compose logs | grep 'rpc error: lead thread didn'\''t get connection'

Note, it took a couple of days for us to generate the log messages.
Have patience. <3

Also note the number of 500 HTTP statuses in query.log:

grep 500 query.log

Operating system and Environment details

uname -a: Linux z-kube-beach-avandersteldt-penguin 4.4.0-1077-aws #87-Ubuntu SMP Wed Mar 6 00:03:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

consul-1.2.3

Log Fragments

Output from step 5:

consul_dc4_1        |     2019/03/29 14:12:45 [ERR] consul: RPC failed to server 172.19.0.33:8300 in DC "dc2": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
consul_dc4_1        |     2019/03/29 14:12:45 [ERR] http: Request GET /v1/kv/foobar?dc=dc2, error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection from=127.0.0.1:57878
consul_dc2_1        |     2019/03/29 14:37:38 [ERR] consul: RPC failed to server 172.19.0.77:8300 in DC "dc4": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
consul_dc2_1        |     2019/03/29 14:37:38 [ERR] http: Request GET /v1/kv/foobar?dc=dc4, error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection from=127.0.0.1:52206

The text was updated successfully, but these errors were encountered:

pearkes · 2019-04-03T22:37:39Z

Thank you for the detailed bug report and reproduction. Can you summarize the state of the cluster at the end of this experiment? lead thread didn't get connection usually means there is a network or reachability issue of some kind. Consul should try a few different servers when it can't reach one, but perhaps at the end of the experiment there are none left that are healthy for some reason. It'd be good to clarify that.

nahratzah · 2019-04-04T10:45:45Z

For the entire duration of the test, except during step 4, the clusters are all healthy, with 5 out of 5 servers up.

I can describe it:

Each consul-DC (dc1, dc2, dc3, and dc4) had 5 servers in the alive state.
Each cycled consul-DC (dc2, dc3, and dc4) had 60 servers each in the failed state.
The servers in the failed state disappeared after 72 hours had elapsed (which is why I can't simply copy/paste the output of consul members -wan for you).

During execution of step 4, there is at most one cluster operating at 4 out of 5 servers. A single server is decommissioned at line 5, and a replacement is commissioned at line 9, of cycle.sh.

There were no network problems: the entire docker installation ran locally and the machine (c4large) was pretty much idle. Apart from regular attempts by Serf to reconnect with the decommissioned machines, there was no mention of connection issues in the logs (this is the expected behaviour).

The lead thread didn't get connection messages point at decommissioned servers.

avandersteldt@z-kube-beach-avandersteldt-penguin:~/docker$ docker-compose logs | grep 'rpc error: lead thread didn'\''t get connection'                                                         
consul_dc4_1        |     2019/03/29 14:12:45 [ERR] consul: RPC failed to server 172.19.0.33:8300 in DC "dc2": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
consul_dc4_1        |     2019/03/29 14:12:45 [ERR] http: Request GET /v1/kv/foobar?dc=dc2, error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection from=127.0.0.1:57878
consul_dc2_1        |     2019/03/29 14:37:38 [ERR] consul: RPC failed to server 172.19.0.77:8300 in DC "dc4": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
consul_dc2_1        |     2019/03/29 14:37:38 [ERR] http: Request GET /v1/kv/foobar?dc=dc4, error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection from=127.0.0.1:52206

In this log fragment, the servers 172.19.0.33 and 172.19.0.77 are both decommissioned during step 4. And these errors are not during the cycling either, but many hours later. Yet they are used as targets for the cross-DC query, originating from dc4 and dc2 respectively.

CpuID · 2019-04-17T23:54:12Z

Pretty sure we are seeing the same thing here, have been seeing it for a while.

Some details:

3 clusters with WAN federation between them
noticed cluster "ops" seems to be hanging onto an old entry for a "production" server node in serf
"ops" server node claims it reaps the failed "production" server node after 72hrs, but it keeps trying to hit it

Relevant log snippets from an "ops" server node:

user@ip-10-199-10-22:~# docker logs dc5310ea6d61 2>&1 | grep 10.202.0.5
    2019/03/19 21:02:38 [INFO] serf: EventMemberJoin: ip-10-202-0-5.aws-us-e1-production 10.202.0.5
    2019/03/19 21:02:38 [INFO] consul: Handled member-join event for server "ip-10-202-0-5.aws-us-e1-production" in area "wan"
    2019/03/20 00:11:09 [ERR] yamux: Failed to write header: write tcp 10.199.10.22:56998->10.202.0.5:8300: use of closed network connection
    2019/04/01 20:54:56 [INFO] memberlist: Suspect ip-10-202-0-5.aws-us-e1-production has failed, no acks received
    2019/04/01 20:55:17 [INFO] serf: EventMemberFailed: ip-10-202-0-5.aws-us-e1-production 10.202.0.5
    2019/04/01 20:55:17 [INFO] consul: Handled member-failed event for server "ip-10-202-0-5.aws-us-e1-production" in area "wan"
    2019/04/01 21:08:58 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
    2019/04/01 21:25:58 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
...
    2019/04/02 05:14:20 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/02 05:14:59 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
    2019/04/02 05:26:29 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
    2019/04/02 05:45:31 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/02 05:45:31 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
...
    2019/04/04 20:15:41 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
    2019/04/04 20:17:11 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
    2019/04/04 20:31:52 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/04 20:31:52 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/04 20:40:12 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
    2019/04/04 20:41:12 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
    2019/04/04 20:53:37 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/04 20:55:18 [INFO] serf: EventMemberReap: ip-10-202-0-5.aws-us-e1-production
    2019/04/05 02:19:33 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/05 02:19:33 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/05 03:24:32 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/05 04:05:44 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
...
    2019/04/17 20:59:45 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/17 23:04:44 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/17 23:04:44 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection

The evidence:

user@ip-10-199-10-22:~# docker exec -it dc5310ea6d61 cat /consul/data/serf/remote.snapshot | wc -l
67
user@ip-10-199-10-22:~# docker exec -it dc5310ea6d61 cat /consul/data/serf/remote.snapshot | grep 10.202.0.5
alive: ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
not-alive: ip-10-202-0-5.aws-us-e1-production
user@ip-10-199-10-22:~# date
Wed Apr 17 23:53:49 UTC 2019

CpuID · 2019-04-17T23:55:14Z

Note: we are running Consul 1.0.3, I didn't spot anything obvious in the changelogs with a fix for this. My suspicion is if I replace the server node with a new one, this will all come good, and I'll try doing so now.

CpuID · 2019-07-04T22:27:17Z

@pearkes any feedback on this...?

mkeeler · 2019-07-04T23:06:49Z

@CpuID I believe this issue was fixed by #5317. That was in v1.4.3. I haven’t verified yet with the reproduction in this issue though so I will leave it open until the.

CpuID · 2019-07-04T23:12:00Z

@mkeeler I think you might be right actually, was reading through PRs and spotted that one today. I'm going to try get us upgraded to latest stable over the next week or so and see if the problem disappears permanently, thanks for following up :)

stale · 2019-10-21T18:10:31Z

Hey there,
We wanted to check in on this request since it has been inactive for at least 60 days.
If you think this is still an important issue in the latest version of Consul
or its documentation please reply with a comment here which will cause it to stay open for investigation.
If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well!
Thank you!

stale · 2019-11-20T18:21:27Z

Hey there, This issue has been automatically closed because there hasn't been any activity for at least 90 days. If you are still experiencing problems, or still have questions, feel free to open a new one 👍

ghost · 2020-01-25T02:13:55Z

Hey there,

This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days.

If you are still experiencing problems, or still have questions, feel free to open a new one 👍.

pearkes added needs-investigation The issue described is detailed and complex. waiting-reply Waiting on response from Original Poster or another individual in the thread labels Apr 3, 2019

pearkes removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Apr 11, 2019

stale bot added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Oct 21, 2019

stale bot closed this as completed Nov 20, 2019

ghost locked and limited conversation to collaborators Jan 25, 2020

ghost removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spurious failures of cross-DC requests #5595

Spurious failures of cross-DC requests #5595

nahratzah commented Apr 2, 2019 •

edited

Loading

pearkes commented Apr 3, 2019

nahratzah commented Apr 4, 2019 •

edited

Loading

CpuID commented Apr 17, 2019

CpuID commented Apr 17, 2019

CpuID commented Jul 4, 2019

mkeeler commented Jul 4, 2019

CpuID commented Jul 4, 2019

stale bot commented Oct 21, 2019

stale bot commented Nov 20, 2019

ghost commented Jan 25, 2020

Spurious failures of cross-DC requests #5595

Spurious failures of cross-DC requests #5595

Comments

nahratzah commented Apr 2, 2019 • edited Loading

Overview of the Issue

Reproduction Steps

1: create a set of consul clusters

2: WAN join all these clusters

3: create a KV in each cluster, for use later

4: cycle through the nodes, creating lots and lots of failed nodes in the process

5: Leave to simmer for 2+ days, while repeatedly querying the clusters

6: Observe spurious failure

Operating system and Environment details

Log Fragments

pearkes commented Apr 3, 2019

nahratzah commented Apr 4, 2019 • edited Loading

CpuID commented Apr 17, 2019

CpuID commented Apr 17, 2019

CpuID commented Jul 4, 2019

mkeeler commented Jul 4, 2019

CpuID commented Jul 4, 2019

stale bot commented Oct 21, 2019

stale bot commented Nov 20, 2019

ghost commented Jan 25, 2020

nahratzah commented Apr 2, 2019 •

edited

Loading

nahratzah commented Apr 4, 2019 •

edited

Loading